VAL 09 — Stuck Rollout Detection Validation¶
Purpose¶
This plan validates that the AutonomyOps control-plane correctly identifies stuck rollout
plans, provides operator-visible diagnostics, and transitions plans to expected phases
via the retry and rollback recovery strategies exercised by this slice.
Claims Under Test¶
ID |
Claim |
|---|---|
VAL09-C1 |
Plans in active phases with |
VAL09-C2 |
Plans in paused or terminal phases are excluded from stuck detection regardless of |
VAL09-C3 |
|
VAL09-C4 |
|
Branch-Specific Rule¶
Question |
Answer |
|---|---|
Covered by an existing lab? |
No. The main rollout lab (port 18888) and VAL07/VAL08 CPs are killed before any stuck validation can run. Stuck detection requires a live CP with plans in active phases. |
Lab to extend |
|
Port allocation |
CP: |
New runner required? |
No. Extending the existing CLI audit lab is sufficient. |
Staleness Injection Method¶
The GET /v1/rollouts/stuck endpoint accepts a ?threshold_seconds=N query parameter
(default: 3600). The lab uses threshold_seconds=3 (3-second threshold) and sleeps 4
seconds after plan creation. This is:
Deterministic: all 5 plans become stale at the same time (same
updated_attimestamp from theCreatePlancall)Dependency-free: no SQLite manipulation, no external clock injection
Minimal overhead: 4-second sleep is the only intentional delay in the function
Plans created in published phase — which is listed in activePhases — satisfy all
three stuck preconditions: active phase, stale timestamp, no recent stage promotion.
Architecture¶
API Endpoints Exercised¶
Endpoint |
Purpose |
|---|---|
|
CP readiness check |
|
Create 5 test plans |
|
Stuck scan (6 calls: empty, fresh, stale, after-pause, after-cancel, final) |
|
Transition plan-b to paused phase |
|
Transition plan-c to cancelled (terminal) phase |
|
Recovery: retry (plan-d), rollback (plan-e) |
Plan Lifecycle Map¶
Plan ID |
State After Creation |
Action |
Expected at VAL09-10 |
|---|---|---|---|
|
published (stale) |
none |
still stuck |
|
published → paused |
|
absent (paused ∉ activePhases) |
|
published → cancelled |
|
absent (terminal phase) |
|
published → active |
|
absent ( |
|
published → rolled_back |
|
absent (terminal phase) |
Diagnosis Logic (from diagnoseStuck)¶
Since lab plans are created in published phase with no status record (no stage
activations, no gate evaluations), the GetStatus call returns nil or an empty
status. This produces the diagnosis:
"zero activations — nodes may not be receiving the plan or artifact distribution is incomplete"
This is the expected and correct diagnosis for a freshly-created plan that has never been picked up by an edge agent.
Environment Assumptions¶
Assumption |
Value |
|---|---|
Platform |
Linux ( |
CP binary |
|
Transport |
Plain HTTP (no TLS) |
Stuck threshold |
|
Sleep duration |
|
Plans created |
5 ( |
Data dir |
|
Scenario Matrix¶
VAL09-01 — Control-Plane Reachable¶
Action: GET /v1/health against CP at 18994.
Evidence: val09-health.txt
Pass criterion: HTTP 200, status=ok.
VAL09-02 — Empty Store Baseline¶
Action: GET /v1/rollouts/stuck?threshold_seconds=3 on an empty store (no plans
created yet).
Evidence: val09-scan-empty.json, val09-baseline-check.txt
Pass criterion: stuck_count=0.
Rationale: Confirms the detection function handles empty stores without error.
VAL09-03 — Fresh Plans Not Stuck¶
Action: Create 5 plans; immediately scan without sleeping.
Evidence: val09-plans-created.txt, val09-scan-fresh.json, val09-fresh-check.txt
Pass criterion: stuck_count=0 — plans are younger than the 3-second threshold.
VAL09-04 — Stale Plans Detected¶
Action: Sleep 4 seconds (exceeding the 3-second threshold); rescan.
Evidence: val09-scan-stale.json, val09-stale-check.txt
Pass criterion: stuck_count=5 — all 5 plans appear as stuck.
VAL09-05 — Diagnosis Populated¶
Action: Inspect stuck_plans[*].diagnosis from the stale scan JSON.
Evidence: val09-diagnosis-check.txt
Pass criterion: All 5 stuck plans have the exact expected diagnosis string.
Expected diagnosis: "zero activations — nodes may not be receiving the plan or artifact distribution is incomplete" (published phase, no activations recorded).
VAL09-06 — Paused Plan Excluded¶
Action: POST /v1/rollouts/val09-plan-b/pause (transitions to paused phase,
which is excluded from activePhases); rescan.
Evidence: val09-pause-planb.json, val09-scan-after-pause.json, val09-pause-check.txt
Pass criterion: val09-plan-b is absent from stuck_plans[*].plan_id.
VAL09-07 — Terminal Plan Excluded¶
Action: DELETE /v1/rollouts/val09-plan-c (transitions to cancelled terminal
phase); rescan.
Evidence: val09-cancel-planc.json, val09-scan-after-cancel.json, val09-cancel-check.txt
Pass criterion: val09-plan-c is absent from stuck_plans[*].plan_id.
VAL09-08 — Recovery: Retry¶
Action: POST /v1/rollouts/val09-plan-d/recover with
{"strategy":"retry","reason":"val09-stuck-test"}.
Evidence: val09-recover-retry-pland.json, val09-retry-check.txt
Pass criterion: Response body has new_phase=active and strategy=retry.
Effect: UpdatePhase sets updated_at = now() — plan is no longer stale.
VAL09-09 — Recovery: Rollback¶
Action: POST /v1/rollouts/val09-plan-e/recover with
{"strategy":"rollback","reason":"val09-stuck-test"}.
Evidence: val09-recover-rollback-plane.json, val09-rollback-check.txt
Pass criterion: Response body has new_phase=rolled_back and strategy=rollback.
Effect: RollbackPlan transitions to terminal phase — excluded from active scan.
VAL09-10 — Post-Recovery Scan Clean¶
Action: Final GET /v1/rollouts/stuck?threshold_seconds=3.
Evidence: val09-scan-final.json, val09-final-check.txt
Pass criterion:
val09-plan-aIS present (no action taken; still stale and active)val09-plan-bis absent (paused)val09-plan-cis absent (cancelled — terminal)val09-plan-dis absent (retryrefreshedupdated_at— no longer stale)val09-plan-eis absent (rollback— terminal phase)
Evidence Files¶
File |
Description |
|---|---|
|
Control-plane stdout/stderr |
|
Health check result |
|
5 plan IDs with HTTP create codes |
|
Stuck scan on empty store (VAL09-02) |
|
|
|
Stuck scan immediately after creation (VAL09-03) |
|
|
|
Stuck scan after sleep — all 5 plans stale (VAL09-04) |
|
|
|
|
|
Pause response for plan-b |
|
Stuck scan after pausing plan-b (VAL09-06) |
|
|
|
Cancel response for plan-c |
|
Stuck scan after cancelling plan-c (VAL09-07) |
|
|
|
Retry recovery response for plan-d (VAL09-08) |
|
|
|
Rollback recovery response for plan-e (VAL09-09) |
|
|
|
Final stuck scan after all recovery actions (VAL09-10) |
|
Per-plan presence/absence flags + |
|
Human-readable composite report (10 checks) |
|
Machine-readable composite report |
Pass/Fail Criteria¶
Full pass: All 10 checks report PASS.
Minimum acceptable: VAL09-01, VAL09-04, VAL09-05, VAL09-08, VAL09-09 pass — detection works with populated diagnostics and both recovery strategies execute.
Key thresholds:
Check |
Threshold |
|---|---|
VAL09-04 (stale detection) |
|
VAL09-05 (diagnosis) |
All |
VAL09-08 (retry recovery) |
|
VAL09-09 (rollback recovery) |
|
VAL09-10 (post-recovery) |
plan-a present; plan-b/c/d/e absent |
Failure Handling¶
Symptom |
Likely Cause |
Resolution |
|---|---|---|
VAL09-01 FAIL |
CP binary missing or port 18994 conflict |
Check |
VAL09-03 FAIL, fresh scan shows stuck |
Host clock skew or very slow CP start |
Check if |
VAL09-04 FAIL, stuck_count < 5 |
One or more plans failed to create |
Check |
VAL09-05 FAIL, diagnosis empty |
Store |
Inspect |
VAL09-06/07 FAIL, plan still in scan |
Phase transition failed (pause/cancel returned error) |
Check |
VAL09-08 FAIL, new_phase ≠ active |
Retry recovery precondition blocked (plan already terminal) |
Verify plan-d was not accidentally cancelled earlier |
VAL09-09 FAIL, new_phase ≠ rolled_back |
Rollback recovery returned error |
Check |
VAL09-10 FAIL, plan-d still stuck |
|
Check |