VAL 11 — Fleet Rollout Chaos Test Pack

Purpose

This plan validates that the AutonomyOps control-plane and CLI operator surface behave correctly under representative failure conditions: control-plane process failure, data durability under repeated restart, gate-wait plan survival, artifact corruption proxy, and bulk stuck-plan cascade recovery. It establishes that operator-facing diagnostics and recovery commands remain usable after each chaos injection.


Branch-Specific Rule

Question

Answer

Covered by existing lab?

Partially. VAL09 covers the post-failure observable state (stuck detection) but does not inject failure. VAL08 covers concurrent creates under load but not CP kill. VAL10 covers rollback reliability on clean plans. No existing lab kills the CP, verifies durability across restarts, or tests client error handling under partition.

Lab to extend

scripts/labs/run_cli_audit_lab.sh — new function run_chaos_val11_lab()

Why new function?

A dedicated fresh CP at port 18996 is required. All prior CPs are torn down before this function runs. The chaos kill/restart cycle must not interact with other lab phases.

New runner required?

No. Extending run_cli_audit_lab.sh is sufficient.


Chaos Mechanisms

Network-level injection (iptables) requires root and is outside the lab’s permitted scope. All chaos is injected via process-level signals — the CP process is killed with SIGTERM and restarted against the same SQLite data directory. This directly exercises:

User scenario

Lab mechanism

Network partition during rollout

SIGTERM → client gets connection refused or dial error

Control-plane failure mid-rollout

SIGTERM after plan creates; verify SQLite data survives

Device unresponsive during update

Stuck scan proxy: threshold_seconds=2, sleep 3

Artifact corruption mid-rollout

Plan with invalid artifact_ref; CP accepts; operator rollback tested

CP failover during gate wait

Kill CP while plan in published phase; restart; verify phase preserved

Why Process Kill Is Sufficient

SQLite’s write-ahead log (WAL) is flushed on every successful INSERT before the HTTP 201 is returned. A clean SIGTERM (not SIGKILL) gives the process a chance to close the WAL; even a SIGKILL does not corrupt committed rows because WAL recovery runs at next open. The lab uses SIGTERM to match what a process supervisor (systemd, Kubernetes liveness) would send first.


Safety Guardrails

Guardrail

Implementation

Isolated CP

Dedicated port 18996; fresh data dir removed before each run

Scoped to local process

No changes to system state; no iptables; no external service calls

Cleanup on exit

cleanup() trap kills all background PIDs and removes $WORK_DIR

set -euo pipefail maintained

Chaos scenarios wrapped with || true to prevent false set -e exits

Workers awaited before kill

wait before chaos kill prevents zombie subshells

Operator identity scoped

All CLI calls use AUTONOMY_OPERATOR=val11-chaos-op for audit attribution

Evidence captured before kill

Stuck scan and plan queries happen while CP is live


Claims Under Test

ID

Claim

VAL11-C1

After a CP process kill, client requests exit non-zero with a connection error; no silent data loss

VAL11-C2

Plans committed before a CP kill are present in the store after CP restart; published phase is preserved across the kill boundary

VAL11-C3

Three successive kill+restart cycles do not corrupt the store; new plan creates succeed after the final restart

VAL11-C4

Operator-visible diagnostics and the rollback execute path function correctly for artifact-corruption proxies and cascade-recovery workflows


Scenario Matrix

VAL11-01 — Control-Plane Baseline Reachable

Action: GET /v1/health against CP at 18996. Evidence: val11-health.txt Pass criterion: HTTP 200, status=ok.


VAL11-02 — CP Kill → Client Error

Action: Kill CP (SIGTERM); attempt autonomy rollback execute --target rollout_plan --resource val11-dur-1 --strategy retry --reason chaos-test. Evidence: val11-kill-client-error.txt, val11-kill-check.txt Pass criterion: Command exits non-zero. Output contains connection error keyword (connection refused, dial, or connect). Simulates: Network partition / control-plane failure from the client perspective.


VAL11-03 — Data Durability Post-Restart

Action: Restart CP against same data dir; GET /v1/rollouts. Evidence: val11-durability-list.json, val11-durability-check.txt Pass criterion: list_count 10 (all 10 pre-kill plans are present after restart). Simulates: Recovery after control-plane failure mid-rollout.


VAL11-04 — Rapid Restart Resilience

Action: Kill and restart the CP 3 more times in quick succession (simulating unstable restarts); after the final restart verify plan count unchanged and a new plan create (val11-post-chaos-1) returns HTTP 201. Evidence: val11-rapid-restart.txt Pass criterion: list_count_final list_count_before, new_plan_code=201. Simulates: Repeated CP failover events.


VAL11-05 — Gate-Wait Plan Survives CP Restart

Action: After VAL11-03 restart, GET /v1/rollouts/val11-gate-1 and inspect phase. Evidence: val11-gate-plan.json, val11-gate-check.txt Pass criterion: Plan val11-gate-1 exists and has phase=published (the gate-wait state is preserved across the CP kill boundary). Simulates: CP failover during gate wait — the operator’s gate is not lost.


VAL11-06 — Device-Unresponsive Detection (Stuck Proxy)

Action: 3 “device-unresponsive” proxy plans (val11-dev-{1..3}) created in published phase; sleep 3 seconds (threshold=2s); GET /v1/rollouts/stuck?threshold_seconds=2. Evidence: val11-scan-stuck.json, val11-stuck-check.txt Pass criterion: stuck_count 3, all stuck plans have non-empty diagnosis. Note: Other pre-existing plans also become stale during the sleep; the criterion is 3, not = 3. Simulates: Operator detecting unresponsive devices during an active rollout.


VAL11-07 — Artifact Corruption Proxy: Plan Accepted

Action: POST /v1/rollouts with artifact_ref=corrupt://chaos/artifact@sha256:deadbeefcorrupt and target_lock_fingerprint=blake3:0000000000000000corruption as plan val11-corrupt-1. Then GET /v1/rollouts/val11-corrupt-1. Evidence: val11-corrupt-plan-create.txt, val11-corrupt-plan.json, val11-corrupt-check.txt Pass criterion: Create returns HTTP 201; GET returns a rollout payload with plan.metadata.id=val11-corrupt-1. Rationale: The CP stores artifact metadata without validating the URI scheme or hash format. An operator who notices unusual artifact metadata can still query and act on the plan. Simulates: Artifact corruption registered at rollout creation time.


VAL11-08 — Artifact Corruption Proxy: Rollback Succeeds

Action: autonomy rollback execute --target rollout_plan --resource val11-corrupt-1 --strategy rollback --reason chaos-artifact-corruption. Evidence: val11-execute-rollback-corrupt.txt, val11-rollback-corrupt-check.txt Pass criterion: Command exits 0. Output contains outcome=success or outcome field in JSON. Simulates: Operator issuing a rollback upon discovering artifact corruption.


VAL11-09 — Bulk Cascade Recovery

Action: For each of val11-dev-{1..3} (still in published / active phase after VAL11-06 stuck detection): run autonomy rollback execute --target rollout_plan --resource val11-dev-N --strategy retry --reason chaos-cascade-recovery. Evidence: cascade/execute-retry-dev-{1..3}.txt, val11-cascade-check.txt Pass criterion: All 3 commands exit 0. cascade_ok=3, cascade_fail=0. Simulates: Operator issuing batch recovery after a cascade device failure during rollout.


VAL11-10 — Chaos Audit Integrity

Action: autonomy audit query --event-type rollback.executed --actor val11-chaos-op --start-time <val11_start_time> --output json against the retained audit store. Evidence: val11-audit-executed.json, val11-audit-check.txt Pass criterion: At least 1 rollback.executed event with outcome=success is present from the chaos session operator (val11-chaos-op) for this slice.


Harness Plan

Tools

Tool

Purpose

kill (SIGTERM)

Chaos injection — CP process failure simulation

curl

Plan creates, stuck scan, plan GET, health checks

autonomy rollback execute

Client-side failure detection (VAL11-02), rollback (VAL11-08, 09)

python3 -c

JSON field extraction

autonomy audit query

Verify actor-scoped, time-scoped audit events from retained store

Control-Plane Setup

Resource

Value

Port

127.0.0.1:18996

Data dir

$WORK_DIR/val11 (removed and recreated before each run; reused across restarts)

RBAC

AUTONOMY_RBAC_ENFORCEMENT=0

Operator identity

AUTONOMY_OPERATOR=val11-chaos-op

Plan Inventory

Plan ID

Purpose

Pre-kill phase

Expected post-restart phase

val11-dur-{1..5}

Durability baseline (VAL11-03)

published

published

val11-gate-1

Gate-wait survival (VAL11-05)

published

published

val11-corrupt-1

Artifact corruption proxy (VAL11-07/08)

published

rolled_back

val11-dev-{1..3}

Device-unresponsive proxy (VAL11-06/09)

published → active (after retry)

active

val11-post-chaos-1

Created after 3× rapid restart (VAL11-04)

published

Execution Order

Start CP → VAL11-01 (health)
Create plans (dur-1..5, gate-1, corrupt-1, dev-1..3)
Sleep 3s (dev plans exceed threshold=2s)
VAL11-06 (stuck scan)
VAL11-07 (corrupt plan GET)
VAL11-08 (rollback execute corrupt)
VAL11-09 (cascade retry dev-1..3)
Kill CP
VAL11-02 (client error)
Restart CP
VAL11-03 (durability list)
VAL11-05 (gate plan phase)
Kill + restart CP × 3
VAL11-04 (rapid restart: list + new create)
VAL11-10 (audit integrity)
Kill CP (cleanup)

Known Failure Modes

Mode

Description

Detectable by

CP start failure

Port 18996 already in use

val11-health.txt status=unreachable

VAL11-02 false pass

CP not fully dead when autonomy CLI runs

Add sleep 0.3 after kill before CLI invocation

VAL11-03 phantom plans

SQLite WAL not recovered on restart

val11-durability-list.json plan count < 5

VAL11-06 stuck_count < 3

CP start took longer than 3s minus plan-create time

Increase val11_chaos_sleep from 3 to 5

VAL11-08/09 CLI error

RBAC guard firing (unexpected)

Verify AUTONOMY_RBAC_ENFORCEMENT=0 is exported

VAL11-10 no events

Audit store not shared with CLI emitter

Check $AUTONOMY_AUDIT_DIR path in CP invocation

Out-of-Scope Items

Item

Reason

iptables / network-level partition

Requires root; not available in lab environment

Concurrent creates under kill

Inherently racy; replaced by deterministic rapid-restart resilience test

CP crash under write-heavy load (SIGKILL)

WAL recovery is SQLite-internal; not observable at CLI layer

PostgreSQL backend chaos

Requires live PG instance; separate Gate D soak test

Edge-agent device reconnect after partition

Edge agent not running in this lab

Automatic rollout stage promotion under chaos

Requires running batch promoter


Evidence Files

File

Description

val11-cp.log

CP stdout/stderr (across all start/stop cycles; append mode)

val11-health.txt

CP baseline health check result

val11-plans-created.txt

9 plan create HTTP codes (dur-1..5, gate-1, corrupt-1, dev-1..3)

val11-scan-stuck.json

Stuck scan result (threshold=2s, after sleep=3s)

val11-stuck-check.txt

stuck_count, diagnosis_ok, pass (VAL11-06)

val11-corrupt-plan-create.txt

HTTP 201 response for corrupt-artifact plan

val11-corrupt-plan.json

GET /v1/rollouts/val11-corrupt-1

val11-corrupt-check.txt

create_code=201, get_ok=true with plan.metadata.id=val11-corrupt-1, pass (VAL11-07)

val11-execute-rollback-corrupt.txt

rollback execute stdout for val11-corrupt-1

val11-rollback-corrupt-check.txt

exit_code=0, pass (VAL11-08)

cascade/execute-retry-dev-1.txt

Cascade retry for val11-dev-1

cascade/execute-retry-dev-2.txt

Cascade retry for val11-dev-2

cascade/execute-retry-dev-3.txt

Cascade retry for val11-dev-3

val11-cascade-check.txt

cascade_ok=3, cascade_fail=0, pass (VAL11-09)

val11-kill-client-error.txt

CLI output after CP kill

val11-kill-check.txt

exit_code≠0, has_connection_error, pass (VAL11-02)

val11-durability-list.json

GET /v1/rollouts after restart

val11-durability-check.txt

list_count≥10, pass (VAL11-03)

val11-gate-plan.json

GET /v1/rollouts/val11-gate-1 after restart

val11-gate-check.txt

phase=published, pass (VAL11-05)

val11-rapid-restart.txt

3× kill/restart log; final list count + new plan create code (VAL11-04)

val11-audit-executed.json

rollback.executed events from retained store

val11-audit-check.txt

count≥1 scoped by actor + start_time, pass (VAL11-10)

val11-report.txt

Human-readable composite report (10 checks)

val11-report.json

Machine-readable JSON report


Pass/Fail Criteria

Full pass: All 10 checks report PASS.

Minimum acceptable: VAL11-01, VAL11-02, VAL11-03, VAL11-05 pass — the core durability + client-error story.

Key thresholds:

Check

Threshold

VAL11-02 (kill error)

Exit code ≠ 0; connection error keyword in output

VAL11-03 (durability)

list_count 10 after first restart

VAL11-04 (rapid restart)

list_count_final list_count_before; new_plan_code=201

VAL11-05 (gate wait)

phase=published after kill+restart

VAL11-06 (unresponsive)

stuck_count 3; all have non-empty diagnosis

VAL11-09 (cascade)

cascade_ok=3, cascade_fail=0

VAL11-10 (audit)

≥ 1 rollback.executed outcome=success


Recommendations Section Template

## VAL11 Chaos Run — Operator Recommendations

Date: <ISO-8601>
Pass: <N>/10
Environment: local-SQLite / port 18996

### Durability
- [PASS/FAIL] All pre-kill plans recovered after restart (VAL11-03, 04, 05)
- Finding: <any plans missing or phase unexpected>
- Recommendation: <if FAIL, check WAL checkpoint settings or SQLite build>

### Client Error Handling
- [PASS/FAIL] CLI exits non-zero with connection error after CP kill (VAL11-02)
- Finding: <actual exit code and error text>
- Recommendation: <if FAIL, verify CLI does not silently swallow dial errors>

### Artifact Corruption
- [PASS/FAIL] CP accepts and stores corrupt-artifact plan; rollback succeeds (VAL11-07, 08)
- Finding: <any unexpected rejection or rollback failure>
- Recommendation: Consider adding artifact URI scheme validation at plan-creation time

### Cascade Recovery
- [PASS/FAIL] All 3 stuck (device-unresponsive proxy) plans recovered via retry (VAL11-06, 09)
- Finding: <any recovery failures or unexpected phases>
- Recommendation: <if failures, check stuck detection threshold vs actual staleness>

### Audit Integrity
- [PASS/FAIL] Rollback events captured across CP kill boundary (VAL11-10)
- Finding: <event count vs expected>
- Recommendation: <if events missing, verify AUTONOMY_AUDIT_DIR is the same path for CLI and CP>