VAL 11 — Fleet Rollout Chaos Test Pack¶

Purpose¶

This plan validates that the AutonomyOps control-plane and CLI operator surface behave correctly under representative failure conditions: control-plane process failure, data durability under repeated restart, gate-wait plan survival, artifact corruption proxy, and bulk stuck-plan cascade recovery. It establishes that operator-facing diagnostics and recovery commands remain usable after each chaos injection.

Branch-Specific Rule¶

Question	Answer
Covered by existing lab?	Partially. VAL09 covers the post-failure observable state (stuck detection) but does not inject failure. VAL08 covers concurrent creates under load but not CP kill. VAL10 covers rollback reliability on clean plans. No existing lab kills the CP, verifies durability across restarts, or tests client error handling under partition.
Lab to extend	`scripts/labs/run_cli_audit_lab.sh` — new function `run_chaos_val11_lab()`
Why new function?	A dedicated fresh CP at port 18996 is required. All prior CPs are torn down before this function runs. The chaos kill/restart cycle must not interact with other lab phases.
New runner required?	No. Extending `run_cli_audit_lab.sh` is sufficient.

Chaos Mechanisms¶

Network-level injection (iptables) requires root and is outside the lab’s permitted scope. All chaos is injected via process-level signals — the CP process is killed with SIGTERM and restarted against the same SQLite data directory. This directly exercises:

User scenario	Lab mechanism
Network partition during rollout	SIGTERM → client gets `connection refused` or `dial` error
Control-plane failure mid-rollout	SIGTERM after plan creates; verify SQLite data survives
Device unresponsive during update	Stuck scan proxy: `threshold_seconds=2`, `sleep 3`
Artifact corruption mid-rollout	Plan with invalid `artifact_ref`; CP accepts; operator rollback tested
CP failover during gate wait	Kill CP while plan in `published` phase; restart; verify phase preserved

Why Process Kill Is Sufficient¶

SQLite’s write-ahead log (WAL) is flushed on every successful INSERT before the HTTP 201 is returned. A clean SIGTERM (not SIGKILL) gives the process a chance to close the WAL; even a SIGKILL does not corrupt committed rows because WAL recovery runs at next open. The lab uses SIGTERM to match what a process supervisor (systemd, Kubernetes liveness) would send first.

Safety Guardrails¶

Guardrail	Implementation
Isolated CP	Dedicated port 18996; fresh data dir removed before each run
Scoped to local process	No changes to system state; no iptables; no external service calls
Cleanup on exit	`cleanup()` trap kills all background PIDs and removes `$WORK_DIR`
`set -euo pipefail` maintained	Chaos scenarios wrapped with `\|\| true` to prevent false `set -e` exits
Workers awaited before kill	`wait` before chaos kill prevents zombie subshells
Operator identity scoped	All CLI calls use `AUTONOMY_OPERATOR=val11-chaos-op` for audit attribution
Evidence captured before kill	Stuck scan and plan queries happen while CP is live

Claims Under Test¶

ID	Claim
VAL11-C1	After a CP process kill, client requests exit non-zero with a connection error; no silent data loss
VAL11-C2	Plans committed before a CP kill are present in the store after CP restart; `published` phase is preserved across the kill boundary
VAL11-C3	Three successive kill+restart cycles do not corrupt the store; new plan creates succeed after the final restart
VAL11-C4	Operator-visible diagnostics and the `rollback execute` path function correctly for artifact-corruption proxies and cascade-recovery workflows

Scenario Matrix¶

VAL11-01 — Control-Plane Baseline Reachable¶

Action: GET /v1/health against CP at 18996. Evidence: val11-health.txt Pass criterion: HTTP 200, status=ok.

VAL11-02 — CP Kill → Client Error¶

Action: Kill CP (SIGTERM); attempt autonomy rollback execute --target rollout_plan --resource val11-dur-1 --strategy retry --reason chaos-test. Evidence: val11-kill-client-error.txt, val11-kill-check.txt Pass criterion: Command exits non-zero. Output contains connection error keyword (connection refused, dial, or connect). Simulates: Network partition / control-plane failure from the client perspective.

VAL11-03 — Data Durability Post-Restart¶

Action: Restart CP against same data dir; GET /v1/rollouts. Evidence: val11-durability-list.json, val11-durability-check.txt Pass criterion: list_count ≥ 10 (all 10 pre-kill plans are present after restart). Simulates: Recovery after control-plane failure mid-rollout.

VAL11-04 — Rapid Restart Resilience¶

Action: Kill and restart the CP 3 more times in quick succession (simulating unstable restarts); after the final restart verify plan count unchanged and a new plan create (val11-post-chaos-1) returns HTTP 201. Evidence: val11-rapid-restart.txt Pass criterion: list_count_final ≥ list_count_before, new_plan_code=201. Simulates: Repeated CP failover events.

VAL11-05 — Gate-Wait Plan Survives CP Restart¶

Action: After VAL11-03 restart, GET /v1/rollouts/val11-gate-1 and inspect phase. Evidence: val11-gate-plan.json, val11-gate-check.txt Pass criterion: Plan val11-gate-1 exists and has phase=published (the gate-wait state is preserved across the CP kill boundary). Simulates: CP failover during gate wait — the operator’s gate is not lost.

VAL11-06 — Device-Unresponsive Detection (Stuck Proxy)¶

Action: 3 “device-unresponsive” proxy plans (val11-dev-{1..3}) created in published phase; sleep 3 seconds (threshold=2s); GET /v1/rollouts/stuck?threshold_seconds=2. Evidence: val11-scan-stuck.json, val11-stuck-check.txt Pass criterion: stuck_count ≥ 3, all stuck plans have non-empty diagnosis. Note: Other pre-existing plans also become stale during the sleep; the criterion is ≥ 3, not = 3. Simulates: Operator detecting unresponsive devices during an active rollout.

VAL11-07 — Artifact Corruption Proxy: Plan Accepted¶

Action: POST /v1/rollouts with artifact_ref=corrupt://chaos/artifact@sha256:deadbeefcorrupt and target_lock_fingerprint=blake3:0000000000000000corruption as plan val11-corrupt-1. Then GET /v1/rollouts/val11-corrupt-1. Evidence: val11-corrupt-plan-create.txt, val11-corrupt-plan.json, val11-corrupt-check.txt Pass criterion: Create returns HTTP 201; GET returns a rollout payload with plan.metadata.id=val11-corrupt-1. Rationale: The CP stores artifact metadata without validating the URI scheme or hash format. An operator who notices unusual artifact metadata can still query and act on the plan. Simulates: Artifact corruption registered at rollout creation time.

VAL11-08 — Artifact Corruption Proxy: Rollback Succeeds¶

Action: autonomy rollback execute --target rollout_plan --resource val11-corrupt-1 --strategy rollback --reason chaos-artifact-corruption. Evidence: val11-execute-rollback-corrupt.txt, val11-rollback-corrupt-check.txt Pass criterion: Command exits 0. Output contains outcome=success or outcome field in JSON. Simulates: Operator issuing a rollback upon discovering artifact corruption.

VAL11-09 — Bulk Cascade Recovery¶

Action: For each of val11-dev-{1..3} (still in published / active phase after VAL11-06 stuck detection): run autonomy rollback execute --target rollout_plan --resource val11-dev-N --strategy retry --reason chaos-cascade-recovery. Evidence: cascade/execute-retry-dev-{1..3}.txt, val11-cascade-check.txt Pass criterion: All 3 commands exit 0. cascade_ok=3, cascade_fail=0. Simulates: Operator issuing batch recovery after a cascade device failure during rollout.

VAL11-10 — Chaos Audit Integrity¶

Action: autonomy audit query --event-type rollback.executed --actor val11-chaos-op --start-time <val11_start_time> --output json against the retained audit store. Evidence: val11-audit-executed.json, val11-audit-check.txt Pass criterion: At least 1 rollback.executed event with outcome=success is present from the chaos session operator (val11-chaos-op) for this slice.

Harness Plan¶

Tools¶

Tool	Purpose
`kill` (SIGTERM)	Chaos injection — CP process failure simulation
`curl`	Plan creates, stuck scan, plan GET, health checks
`autonomy rollback execute`	Client-side failure detection (VAL11-02), rollback (VAL11-08, 09)
`python3 -c`	JSON field extraction
`autonomy audit query`	Verify actor-scoped, time-scoped audit events from retained store

Control-Plane Setup¶

Resource	Value
Port	`127.0.0.1:18996`
Data dir	`$WORK_DIR/val11` (removed and recreated before each run; reused across restarts)
RBAC	`AUTONOMY_RBAC_ENFORCEMENT=0`
Operator identity	`AUTONOMY_OPERATOR=val11-chaos-op`

Plan Inventory¶

Plan ID	Purpose	Pre-kill phase	Expected post-restart phase
`val11-dur-{1..5}`	Durability baseline (VAL11-03)	published	published
`val11-gate-1`	Gate-wait survival (VAL11-05)	published	published
`val11-corrupt-1`	Artifact corruption proxy (VAL11-07/08)	published	rolled_back
`val11-dev-{1..3}`	Device-unresponsive proxy (VAL11-06/09)	published → active (after retry)	active
`val11-post-chaos-1`	Created after 3× rapid restart (VAL11-04)	—	published

Execution Order¶

Start CP → VAL11-01 (health)
Create plans (dur-1..5, gate-1, corrupt-1, dev-1..3)
Sleep 3s (dev plans exceed threshold=2s)
VAL11-06 (stuck scan)
VAL11-07 (corrupt plan GET)
VAL11-08 (rollback execute corrupt)
VAL11-09 (cascade retry dev-1..3)
Kill CP
VAL11-02 (client error)
Restart CP
VAL11-03 (durability list)
VAL11-05 (gate plan phase)
Kill + restart CP × 3
VAL11-04 (rapid restart: list + new create)
VAL11-10 (audit integrity)
Kill CP (cleanup)

Known Failure Modes¶

Mode	Description	Detectable by
CP start failure	Port 18996 already in use	`val11-health.txt status=unreachable`
VAL11-02 false pass	CP not fully dead when `autonomy` CLI runs	Add `sleep 0.3` after kill before CLI invocation
VAL11-03 phantom plans	SQLite WAL not recovered on restart	`val11-durability-list.json` plan count < 5
VAL11-06 stuck_count < 3	CP start took longer than 3s minus plan-create time	Increase `val11_chaos_sleep` from 3 to 5
VAL11-08/09 CLI error	RBAC guard firing (unexpected)	Verify `AUTONOMY_RBAC_ENFORCEMENT=0` is exported
VAL11-10 no events	Audit store not shared with CLI emitter	Check `$AUTONOMY_AUDIT_DIR` path in CP invocation

Out-of-Scope Items¶

Item	Reason
iptables / network-level partition	Requires root; not available in lab environment
Concurrent creates under kill	Inherently racy; replaced by deterministic rapid-restart resilience test
CP crash under write-heavy load (SIGKILL)	WAL recovery is SQLite-internal; not observable at CLI layer
PostgreSQL backend chaos	Requires live PG instance; separate Gate D soak test
Edge-agent device reconnect after partition	Edge agent not running in this lab
Automatic rollout stage promotion under chaos	Requires running batch promoter

Evidence Files¶

File	Description
`val11-cp.log`	CP stdout/stderr (across all start/stop cycles; append mode)
`val11-health.txt`	CP baseline health check result
`val11-plans-created.txt`	9 plan create HTTP codes (dur-1..5, gate-1, corrupt-1, dev-1..3)
`val11-scan-stuck.json`	Stuck scan result (threshold=2s, after sleep=3s)
`val11-stuck-check.txt`	`stuck_count`, `diagnosis_ok`, `pass` (VAL11-06)
`val11-corrupt-plan-create.txt`	HTTP 201 response for corrupt-artifact plan
`val11-corrupt-plan.json`	GET /v1/rollouts/val11-corrupt-1
`val11-corrupt-check.txt`	`create_code=201`, `get_ok=true` with `plan.metadata.id=val11-corrupt-1`, `pass` (VAL11-07)
`val11-execute-rollback-corrupt.txt`	rollback execute stdout for val11-corrupt-1
`val11-rollback-corrupt-check.txt`	`exit_code=0`, `pass` (VAL11-08)
`cascade/execute-retry-dev-1.txt`	Cascade retry for val11-dev-1
`cascade/execute-retry-dev-2.txt`	Cascade retry for val11-dev-2
`cascade/execute-retry-dev-3.txt`	Cascade retry for val11-dev-3
`val11-cascade-check.txt`	`cascade_ok=3`, `cascade_fail=0`, `pass` (VAL11-09)
`val11-kill-client-error.txt`	CLI output after CP kill
`val11-kill-check.txt`	`exit_code≠0`, `has_connection_error`, `pass` (VAL11-02)
`val11-durability-list.json`	GET /v1/rollouts after restart
`val11-durability-check.txt`	`list_count≥10`, `pass` (VAL11-03)
`val11-gate-plan.json`	GET /v1/rollouts/val11-gate-1 after restart
`val11-gate-check.txt`	`phase=published`, `pass` (VAL11-05)
`val11-rapid-restart.txt`	3× kill/restart log; final list count + new plan create code (VAL11-04)
`val11-audit-executed.json`	`rollback.executed` events from retained store
`val11-audit-check.txt`	`count≥1` scoped by actor + `start_time`, `pass` (VAL11-10)
`val11-report.txt`	Human-readable composite report (10 checks)
`val11-report.json`	Machine-readable JSON report

Pass/Fail Criteria¶

Full pass: All 10 checks report PASS.

Minimum acceptable: VAL11-01, VAL11-02, VAL11-03, VAL11-05 pass — the core durability + client-error story.

Key thresholds:

Check	Threshold
VAL11-02 (kill error)	Exit code ≠ 0; connection error keyword in output
VAL11-03 (durability)	`list_count ≥ 10` after first restart
VAL11-04 (rapid restart)	`list_count_final ≥ list_count_before`; `new_plan_code=201`
VAL11-05 (gate wait)	`phase=published` after kill+restart
VAL11-06 (unresponsive)	`stuck_count ≥ 3`; all have non-empty diagnosis
VAL11-09 (cascade)	`cascade_ok=3`, `cascade_fail=0`
VAL11-10 (audit)	≥ 1 `rollback.executed` outcome=success

Recommendations Section Template¶

## VAL11 Chaos Run — Operator Recommendations

Date: <ISO-8601>
Pass: <N>/10
Environment: local-SQLite / port 18996

### Durability
- [PASS/FAIL] All pre-kill plans recovered after restart (VAL11-03, 04, 05)
- Finding: <any plans missing or phase unexpected>
- Recommendation: <if FAIL, check WAL checkpoint settings or SQLite build>

### Client Error Handling
- [PASS/FAIL] CLI exits non-zero with connection error after CP kill (VAL11-02)
- Finding: <actual exit code and error text>
- Recommendation: <if FAIL, verify CLI does not silently swallow dial errors>

### Artifact Corruption
- [PASS/FAIL] CP accepts and stores corrupt-artifact plan; rollback succeeds (VAL11-07, 08)
- Finding: <any unexpected rejection or rollback failure>
- Recommendation: Consider adding artifact URI scheme validation at plan-creation time

### Cascade Recovery
- [PASS/FAIL] All 3 stuck (device-unresponsive proxy) plans recovered via retry (VAL11-06, 09)
- Finding: <any recovery failures or unexpected phases>
- Recommendation: <if failures, check stuck detection threshold vs actual staleness>

### Audit Integrity
- [PASS/FAIL] Rollback events captured across CP kill boundary (VAL11-10)
- Finding: <event count vs expected>
- Recommendation: <if events missing, verify AUTONOMY_AUDIT_DIR is the same path for CLI and CP>