VAL 11 — Fleet Rollout Chaos Test Pack¶
Purpose¶
This plan validates that the AutonomyOps control-plane and CLI operator surface behave correctly under representative failure conditions: control-plane process failure, data durability under repeated restart, gate-wait plan survival, artifact corruption proxy, and bulk stuck-plan cascade recovery. It establishes that operator-facing diagnostics and recovery commands remain usable after each chaos injection.
Branch-Specific Rule¶
Question |
Answer |
|---|---|
Covered by existing lab? |
Partially. VAL09 covers the post-failure observable state (stuck detection) but does not inject failure. VAL08 covers concurrent creates under load but not CP kill. VAL10 covers rollback reliability on clean plans. No existing lab kills the CP, verifies durability across restarts, or tests client error handling under partition. |
Lab to extend |
|
Why new function? |
A dedicated fresh CP at port 18996 is required. All prior CPs are torn down before this function runs. The chaos kill/restart cycle must not interact with other lab phases. |
New runner required? |
No. Extending |
Chaos Mechanisms¶
Network-level injection (iptables) requires root and is outside the lab’s permitted scope. All chaos is injected via process-level signals — the CP process is killed with SIGTERM and restarted against the same SQLite data directory. This directly exercises:
User scenario |
Lab mechanism |
|---|---|
Network partition during rollout |
SIGTERM → client gets |
Control-plane failure mid-rollout |
SIGTERM after plan creates; verify SQLite data survives |
Device unresponsive during update |
Stuck scan proxy: |
Artifact corruption mid-rollout |
Plan with invalid |
CP failover during gate wait |
Kill CP while plan in |
Why Process Kill Is Sufficient¶
SQLite’s write-ahead log (WAL) is flushed on every successful INSERT before the HTTP 201
is returned. A clean SIGTERM (not SIGKILL) gives the process a chance to close the WAL;
even a SIGKILL does not corrupt committed rows because WAL recovery runs at next open.
The lab uses SIGTERM to match what a process supervisor (systemd, Kubernetes liveness)
would send first.
Safety Guardrails¶
Guardrail |
Implementation |
|---|---|
Isolated CP |
Dedicated port 18996; fresh data dir removed before each run |
Scoped to local process |
No changes to system state; no iptables; no external service calls |
Cleanup on exit |
|
|
Chaos scenarios wrapped with |
Workers awaited before kill |
|
Operator identity scoped |
All CLI calls use |
Evidence captured before kill |
Stuck scan and plan queries happen while CP is live |
Claims Under Test¶
ID |
Claim |
|---|---|
VAL11-C1 |
After a CP process kill, client requests exit non-zero with a connection error; no silent data loss |
VAL11-C2 |
Plans committed before a CP kill are present in the store after CP restart; |
VAL11-C3 |
Three successive kill+restart cycles do not corrupt the store; new plan creates succeed after the final restart |
VAL11-C4 |
Operator-visible diagnostics and the |
Scenario Matrix¶
VAL11-01 — Control-Plane Baseline Reachable¶
Action: GET /v1/health against CP at 18996.
Evidence: val11-health.txt
Pass criterion: HTTP 200, status=ok.
VAL11-02 — CP Kill → Client Error¶
Action: Kill CP (SIGTERM); attempt autonomy rollback execute --target rollout_plan --resource val11-dur-1 --strategy retry --reason chaos-test.
Evidence: val11-kill-client-error.txt, val11-kill-check.txt
Pass criterion: Command exits non-zero. Output contains connection error keyword
(connection refused, dial, or connect).
Simulates: Network partition / control-plane failure from the client perspective.
VAL11-03 — Data Durability Post-Restart¶
Action: Restart CP against same data dir; GET /v1/rollouts.
Evidence: val11-durability-list.json, val11-durability-check.txt
Pass criterion: list_count ≥ 10 (all 10 pre-kill plans are present after restart).
Simulates: Recovery after control-plane failure mid-rollout.
VAL11-04 — Rapid Restart Resilience¶
Action: Kill and restart the CP 3 more times in quick succession (simulating unstable
restarts); after the final restart verify plan count unchanged and a new plan create
(val11-post-chaos-1) returns HTTP 201.
Evidence: val11-rapid-restart.txt
Pass criterion: list_count_final ≥ list_count_before, new_plan_code=201.
Simulates: Repeated CP failover events.
VAL11-05 — Gate-Wait Plan Survives CP Restart¶
Action: After VAL11-03 restart, GET /v1/rollouts/val11-gate-1 and inspect phase.
Evidence: val11-gate-plan.json, val11-gate-check.txt
Pass criterion: Plan val11-gate-1 exists and has phase=published
(the gate-wait state is preserved across the CP kill boundary).
Simulates: CP failover during gate wait — the operator’s gate is not lost.
VAL11-06 — Device-Unresponsive Detection (Stuck Proxy)¶
Action: 3 “device-unresponsive” proxy plans (val11-dev-{1..3}) created in
published phase; sleep 3 seconds (threshold=2s); GET /v1/rollouts/stuck?threshold_seconds=2.
Evidence: val11-scan-stuck.json, val11-stuck-check.txt
Pass criterion: stuck_count ≥ 3, all stuck plans have non-empty diagnosis.
Note: Other pre-existing plans also become stale during the sleep; the criterion is
≥ 3, not = 3.
Simulates: Operator detecting unresponsive devices during an active rollout.
VAL11-07 — Artifact Corruption Proxy: Plan Accepted¶
Action: POST /v1/rollouts with artifact_ref=corrupt://chaos/artifact@sha256:deadbeefcorrupt
and target_lock_fingerprint=blake3:0000000000000000corruption as plan val11-corrupt-1.
Then GET /v1/rollouts/val11-corrupt-1.
Evidence: val11-corrupt-plan-create.txt, val11-corrupt-plan.json,
val11-corrupt-check.txt
Pass criterion: Create returns HTTP 201; GET returns a rollout payload with
plan.metadata.id=val11-corrupt-1.
Rationale: The CP stores artifact metadata without validating the URI scheme or hash
format. An operator who notices unusual artifact metadata can still query and act on the plan.
Simulates: Artifact corruption registered at rollout creation time.
VAL11-08 — Artifact Corruption Proxy: Rollback Succeeds¶
Action: autonomy rollback execute --target rollout_plan --resource val11-corrupt-1 --strategy rollback --reason chaos-artifact-corruption.
Evidence: val11-execute-rollback-corrupt.txt, val11-rollback-corrupt-check.txt
Pass criterion: Command exits 0. Output contains outcome=success or outcome field
in JSON.
Simulates: Operator issuing a rollback upon discovering artifact corruption.
VAL11-09 — Bulk Cascade Recovery¶
Action: For each of val11-dev-{1..3} (still in published / active phase after
VAL11-06 stuck detection): run autonomy rollback execute --target rollout_plan --resource val11-dev-N --strategy retry --reason chaos-cascade-recovery.
Evidence: cascade/execute-retry-dev-{1..3}.txt, val11-cascade-check.txt
Pass criterion: All 3 commands exit 0. cascade_ok=3, cascade_fail=0.
Simulates: Operator issuing batch recovery after a cascade device failure during rollout.
VAL11-10 — Chaos Audit Integrity¶
Action: autonomy audit query --event-type rollback.executed --actor val11-chaos-op --start-time <val11_start_time> --output json against the retained audit store.
Evidence: val11-audit-executed.json, val11-audit-check.txt
Pass criterion: At least 1 rollback.executed event with outcome=success is present
from the chaos session operator (val11-chaos-op) for this slice.
Harness Plan¶
Tools¶
Tool |
Purpose |
|---|---|
|
Chaos injection — CP process failure simulation |
|
Plan creates, stuck scan, plan GET, health checks |
|
Client-side failure detection (VAL11-02), rollback (VAL11-08, 09) |
|
JSON field extraction |
|
Verify actor-scoped, time-scoped audit events from retained store |
Control-Plane Setup¶
Resource |
Value |
|---|---|
Port |
|
Data dir |
|
RBAC |
|
Operator identity |
|
Plan Inventory¶
Plan ID |
Purpose |
Pre-kill phase |
Expected post-restart phase |
|---|---|---|---|
|
Durability baseline (VAL11-03) |
published |
published |
|
Gate-wait survival (VAL11-05) |
published |
published |
|
Artifact corruption proxy (VAL11-07/08) |
published |
rolled_back |
|
Device-unresponsive proxy (VAL11-06/09) |
published → active (after retry) |
active |
|
Created after 3× rapid restart (VAL11-04) |
— |
published |
Execution Order¶
Start CP → VAL11-01 (health)
Create plans (dur-1..5, gate-1, corrupt-1, dev-1..3)
Sleep 3s (dev plans exceed threshold=2s)
VAL11-06 (stuck scan)
VAL11-07 (corrupt plan GET)
VAL11-08 (rollback execute corrupt)
VAL11-09 (cascade retry dev-1..3)
Kill CP
VAL11-02 (client error)
Restart CP
VAL11-03 (durability list)
VAL11-05 (gate plan phase)
Kill + restart CP × 3
VAL11-04 (rapid restart: list + new create)
VAL11-10 (audit integrity)
Kill CP (cleanup)
Known Failure Modes¶
Mode |
Description |
Detectable by |
|---|---|---|
CP start failure |
Port 18996 already in use |
|
VAL11-02 false pass |
CP not fully dead when |
Add |
VAL11-03 phantom plans |
SQLite WAL not recovered on restart |
|
VAL11-06 stuck_count < 3 |
CP start took longer than 3s minus plan-create time |
Increase |
VAL11-08/09 CLI error |
RBAC guard firing (unexpected) |
Verify |
VAL11-10 no events |
Audit store not shared with CLI emitter |
Check |
Out-of-Scope Items¶
Item |
Reason |
|---|---|
iptables / network-level partition |
Requires root; not available in lab environment |
Concurrent creates under kill |
Inherently racy; replaced by deterministic rapid-restart resilience test |
CP crash under write-heavy load (SIGKILL) |
WAL recovery is SQLite-internal; not observable at CLI layer |
PostgreSQL backend chaos |
Requires live PG instance; separate Gate D soak test |
Edge-agent device reconnect after partition |
Edge agent not running in this lab |
Automatic rollout stage promotion under chaos |
Requires running batch promoter |
Evidence Files¶
File |
Description |
|---|---|
|
CP stdout/stderr (across all start/stop cycles; append mode) |
|
CP baseline health check result |
|
9 plan create HTTP codes (dur-1..5, gate-1, corrupt-1, dev-1..3) |
|
Stuck scan result (threshold=2s, after sleep=3s) |
|
|
|
HTTP 201 response for corrupt-artifact plan |
|
GET /v1/rollouts/val11-corrupt-1 |
|
|
|
rollback execute stdout for val11-corrupt-1 |
|
|
|
Cascade retry for val11-dev-1 |
|
Cascade retry for val11-dev-2 |
|
Cascade retry for val11-dev-3 |
|
|
|
CLI output after CP kill |
|
|
|
GET /v1/rollouts after restart |
|
|
|
GET /v1/rollouts/val11-gate-1 after restart |
|
|
|
3× kill/restart log; final list count + new plan create code (VAL11-04) |
|
|
|
|
|
Human-readable composite report (10 checks) |
|
Machine-readable JSON report |
Pass/Fail Criteria¶
Full pass: All 10 checks report PASS.
Minimum acceptable: VAL11-01, VAL11-02, VAL11-03, VAL11-05 pass — the core durability + client-error story.
Key thresholds:
Check |
Threshold |
|---|---|
VAL11-02 (kill error) |
Exit code ≠ 0; connection error keyword in output |
VAL11-03 (durability) |
|
VAL11-04 (rapid restart) |
|
VAL11-05 (gate wait) |
|
VAL11-06 (unresponsive) |
|
VAL11-09 (cascade) |
|
VAL11-10 (audit) |
≥ 1 |
Recommendations Section Template¶
## VAL11 Chaos Run — Operator Recommendations
Date: <ISO-8601>
Pass: <N>/10
Environment: local-SQLite / port 18996
### Durability
- [PASS/FAIL] All pre-kill plans recovered after restart (VAL11-03, 04, 05)
- Finding: <any plans missing or phase unexpected>
- Recommendation: <if FAIL, check WAL checkpoint settings or SQLite build>
### Client Error Handling
- [PASS/FAIL] CLI exits non-zero with connection error after CP kill (VAL11-02)
- Finding: <actual exit code and error text>
- Recommendation: <if FAIL, verify CLI does not silently swallow dial errors>
### Artifact Corruption
- [PASS/FAIL] CP accepts and stores corrupt-artifact plan; rollback succeeds (VAL11-07, 08)
- Finding: <any unexpected rejection or rollback failure>
- Recommendation: Consider adding artifact URI scheme validation at plan-creation time
### Cascade Recovery
- [PASS/FAIL] All 3 stuck (device-unresponsive proxy) plans recovered via retry (VAL11-06, 09)
- Finding: <any recovery failures or unexpected phases>
- Recommendation: <if failures, check stuck detection threshold vs actual staleness>
### Audit Integrity
- [PASS/FAIL] Rollback events captured across CP kill boundary (VAL11-10)
- Finding: <event count vs expected>
- Recommendation: <if events missing, verify AUTONOMY_AUDIT_DIR is the same path for CLI and CP>