VAL 09 — Stuck Rollout Detection Validation

Purpose

This plan validates that the AutonomyOps control-plane correctly identifies stuck rollout plans, provides operator-visible diagnostics, and transitions plans to expected phases via the retry and rollback recovery strategies exercised by this slice.


Claims Under Test

ID

Claim

VAL09-C1

Plans in active phases with updated_at older than the threshold appear in GET /v1/rollouts/stuck with non-empty diagnosis strings

VAL09-C2

Plans in paused or terminal phases are excluded from stuck detection regardless of updated_at staleness

VAL09-C3

POST /v1/rollouts/{id}/recover with strategy=retry transitions the plan to active and refreshes updated_at, removing it from the stuck scan

VAL09-C4

POST /v1/rollouts/{id}/recover with strategy=rollback transitions the plan to rolled_back (terminal), removing it from the stuck scan


Branch-Specific Rule

Question

Answer

Covered by an existing lab?

No. The main rollout lab (port 18888) and VAL07/VAL08 CPs are killed before any stuck validation can run. Stuck detection requires a live CP with plans in active phases.

Lab to extend

scripts/labs/run_cli_audit_lab.sh — new function run_stuck_detection_val09_lab()

Port allocation

CP: 127.0.0.1:18994, Metrics: 127.0.0.1:19094

New runner required?

No. Extending the existing CLI audit lab is sufficient.


Staleness Injection Method

The GET /v1/rollouts/stuck endpoint accepts a ?threshold_seconds=N query parameter (default: 3600). The lab uses threshold_seconds=3 (3-second threshold) and sleeps 4 seconds after plan creation. This is:

  • Deterministic: all 5 plans become stale at the same time (same updated_at timestamp from the CreatePlan call)

  • Dependency-free: no SQLite manipulation, no external clock injection

  • Minimal overhead: 4-second sleep is the only intentional delay in the function

Plans created in published phase — which is listed in activePhases — satisfy all three stuck preconditions: active phase, stale timestamp, no recent stage promotion.


Architecture

API Endpoints Exercised

Endpoint

Purpose

GET /v1/health

CP readiness check

POST /v1/rollouts

Create 5 test plans

GET /v1/rollouts/stuck?threshold_seconds=3

Stuck scan (6 calls: empty, fresh, stale, after-pause, after-cancel, final)

POST /v1/rollouts/{id}/pause

Transition plan-b to paused phase

DELETE /v1/rollouts/{id}

Transition plan-c to cancelled (terminal) phase

POST /v1/rollouts/{id}/recover

Recovery: retry (plan-d), rollback (plan-e)

Plan Lifecycle Map

Plan ID

State After Creation

Action

Expected at VAL09-10

val09-plan-a

published (stale)

none

still stuck

val09-plan-b

published → paused

POST /pause

absent (paused ∉ activePhases)

val09-plan-c

published → cancelled

DELETE

absent (terminal phase)

val09-plan-d

published → active

recover retry

absent (updated_at refreshed)

val09-plan-e

published → rolled_back

recover rollback

absent (terminal phase)

Diagnosis Logic (from diagnoseStuck)

Since lab plans are created in published phase with no status record (no stage activations, no gate evaluations), the GetStatus call returns nil or an empty status. This produces the diagnosis:

"zero activations nodes may not be receiving the plan or artifact distribution is incomplete"

This is the expected and correct diagnosis for a freshly-created plan that has never been picked up by an edge agent.


Environment Assumptions

Assumption

Value

Platform

Linux (linux/amd64)

CP binary

$AUTONOMY_BIN

Transport

Plain HTTP (no TLS)

Stuck threshold

3 seconds (query parameter)

Sleep duration

4 seconds (ensures all plans exceed threshold)

Plans created

5 (val09-plan-a through val09-plan-e)

Data dir

$WORK_DIR/val09 (removed and recreated before each run)


Scenario Matrix

VAL09-01 — Control-Plane Reachable

Action: GET /v1/health against CP at 18994. Evidence: val09-health.txt Pass criterion: HTTP 200, status=ok.


VAL09-02 — Empty Store Baseline

Action: GET /v1/rollouts/stuck?threshold_seconds=3 on an empty store (no plans created yet). Evidence: val09-scan-empty.json, val09-baseline-check.txt Pass criterion: stuck_count=0. Rationale: Confirms the detection function handles empty stores without error.


VAL09-03 — Fresh Plans Not Stuck

Action: Create 5 plans; immediately scan without sleeping. Evidence: val09-plans-created.txt, val09-scan-fresh.json, val09-fresh-check.txt Pass criterion: stuck_count=0 — plans are younger than the 3-second threshold.


VAL09-04 — Stale Plans Detected

Action: Sleep 4 seconds (exceeding the 3-second threshold); rescan. Evidence: val09-scan-stale.json, val09-stale-check.txt Pass criterion: stuck_count=5 — all 5 plans appear as stuck.


VAL09-05 — Diagnosis Populated

Action: Inspect stuck_plans[*].diagnosis from the stale scan JSON. Evidence: val09-diagnosis-check.txt Pass criterion: All 5 stuck plans have the exact expected diagnosis string. Expected diagnosis: "zero activations nodes may not be receiving the plan or artifact distribution is incomplete" (published phase, no activations recorded).


VAL09-06 — Paused Plan Excluded

Action: POST /v1/rollouts/val09-plan-b/pause (transitions to paused phase, which is excluded from activePhases); rescan. Evidence: val09-pause-planb.json, val09-scan-after-pause.json, val09-pause-check.txt Pass criterion: val09-plan-b is absent from stuck_plans[*].plan_id.


VAL09-07 — Terminal Plan Excluded

Action: DELETE /v1/rollouts/val09-plan-c (transitions to cancelled terminal phase); rescan. Evidence: val09-cancel-planc.json, val09-scan-after-cancel.json, val09-cancel-check.txt Pass criterion: val09-plan-c is absent from stuck_plans[*].plan_id.


VAL09-08 — Recovery: Retry

Action: POST /v1/rollouts/val09-plan-d/recover with {"strategy":"retry","reason":"val09-stuck-test"}. Evidence: val09-recover-retry-pland.json, val09-retry-check.txt Pass criterion: Response body has new_phase=active and strategy=retry. Effect: UpdatePhase sets updated_at = now() — plan is no longer stale.


VAL09-09 — Recovery: Rollback

Action: POST /v1/rollouts/val09-plan-e/recover with {"strategy":"rollback","reason":"val09-stuck-test"}. Evidence: val09-recover-rollback-plane.json, val09-rollback-check.txt Pass criterion: Response body has new_phase=rolled_back and strategy=rollback. Effect: RollbackPlan transitions to terminal phase — excluded from active scan.


VAL09-10 — Post-Recovery Scan Clean

Action: Final GET /v1/rollouts/stuck?threshold_seconds=3. Evidence: val09-scan-final.json, val09-final-check.txt Pass criterion:

  • val09-plan-a IS present (no action taken; still stale and active)

  • val09-plan-b is absent (paused)

  • val09-plan-c is absent (cancelled — terminal)

  • val09-plan-d is absent (retry refreshed updated_at — no longer stale)

  • val09-plan-e is absent (rollback — terminal phase)


Evidence Files

File

Description

val09-cp.log

Control-plane stdout/stderr

val09-health.txt

Health check result

val09-plans-created.txt

5 plan IDs with HTTP create codes

val09-scan-empty.json

Stuck scan on empty store (VAL09-02)

val09-baseline-check.txt

stuck_count, expected, pass for empty scan

val09-scan-fresh.json

Stuck scan immediately after creation (VAL09-03)

val09-fresh-check.txt

stuck_count, expected, pass for fresh scan

val09-scan-stale.json

Stuck scan after sleep — all 5 plans stale (VAL09-04)

val09-stale-check.txt

stuck_count, expected, pass for stale scan

val09-diagnosis-check.txt

total, diagnosis_populated, diagnosis_exact, pass (VAL09-05)

val09-pause-planb.json

Pause response for plan-b

val09-scan-after-pause.json

Stuck scan after pausing plan-b (VAL09-06)

val09-pause-check.txt

paused_plan, in_stuck_scan, pass

val09-cancel-planc.json

Cancel response for plan-c

val09-scan-after-cancel.json

Stuck scan after cancelling plan-c (VAL09-07)

val09-cancel-check.txt

cancelled_plan, in_stuck_scan, pass

val09-recover-retry-pland.json

Retry recovery response for plan-d (VAL09-08)

val09-retry-check.txt

new_phase, pass for retry check

val09-recover-rollback-plane.json

Rollback recovery response for plan-e (VAL09-09)

val09-rollback-check.txt

new_phase, pass for rollback check

val09-scan-final.json

Final stuck scan after all recovery actions (VAL09-10)

val09-final-check.txt

Per-plan presence/absence flags + pass

val09-report.txt

Human-readable composite report (10 checks)

val09-report.json

Machine-readable composite report


Pass/Fail Criteria

Full pass: All 10 checks report PASS.

Minimum acceptable: VAL09-01, VAL09-04, VAL09-05, VAL09-08, VAL09-09 pass — detection works with populated diagnostics and both recovery strategies execute.

Key thresholds:

Check

Threshold

VAL09-04 (stale detection)

stuck_count=5 after 4-second sleep

VAL09-05 (diagnosis)

All stuck_plans match the expected diagnosis string

VAL09-08 (retry recovery)

new_phase=active in response

VAL09-09 (rollback recovery)

new_phase=rolled_back in response

VAL09-10 (post-recovery)

plan-a present; plan-b/c/d/e absent


Failure Handling

Symptom

Likely Cause

Resolution

VAL09-01 FAIL

CP binary missing or port 18994 conflict

Check val09-cp.log

VAL09-03 FAIL, fresh scan shows stuck

Host clock skew or very slow CP start

Check if sleep 4 completed before plans were created

VAL09-04 FAIL, stuck_count < 5

One or more plans failed to create

Check val09-plans-created.txt for non-201 codes

VAL09-05 FAIL, diagnosis empty

Store GetStatus returning errors

Inspect val09-scan-stale.json .stuck_plans[*].diagnosis

VAL09-06/07 FAIL, plan still in scan

Phase transition failed (pause/cancel returned error)

Check val09-pause-planb.json / val09-cancel-planc.json for error body

VAL09-08 FAIL, new_phase ≠ active

Retry recovery precondition blocked (plan already terminal)

Verify plan-d was not accidentally cancelled earlier

VAL09-09 FAIL, new_phase ≠ rolled_back

Rollback recovery returned error

Check val09-recover-rollback-plane.json for error body

VAL09-10 FAIL, plan-d still stuck

retry recovery did not refresh updated_at

Check store.UpdatePhase is called with current timestamp