VAL 12 — Fleet Rollout 30-Day Soak¶

Purpose¶

This plan defines the long-duration validation framework that satisfies the workplan Gate D requirement:

“30-day soak, ≥100 concurrent device rollouts, ≥99% rollback success rate”

The framework continuously exercises the control-plane rollout and recovery surfaces over a 30-day window, collecting structured round evidence and daily summaries that aggregate into a final pass/fail report against the four VAL12 claims.

Branch-Specific Rule¶

Question	Answer
Covered by existing lab?	No. All existing labs (`run_cli_audit_lab.sh`, `run_edge_deadletter_lab.sh`) are synchronous, single-shot evidence collectors. A 30-day soak requires persistent infrastructure, scheduled round execution, rolling evidence windows, and final aggregation.
Lab to extend	N/A. Cannot extend the existing single-shot lab cleanly.
New runner required?	Yes. Three new scripts: `run_soak_val12_setup.sh`, `run_soak_val12_round.sh`, `run_soak_val12_report.sh`. Concrete reason: the existing runner has no persistent state, no round directory model, no daily aggregation, and no 30-day time horizon.
cli-audit-lab.md impact	§8 scope note only (no new slice in `run_cli_audit_lab.sh`).

Claims Under Test¶

ID	Claim
VAL12-C1	The control-plane sustains ≥ 100 concurrent plan entries in the store for 30 consecutive days without data corruption
VAL12-C2	`rollback execute` success rate across all executed recoveries is ≥ 99% over the 30-day window, with at least one sampled recovery in the measured window
VAL12-C3	P99 plan-create latency remains ≤ 500 ms throughout the soak (no degradation under accumulated store state)
VAL12-C4	CP availability is ≥ 99.9% measured as healthy rounds / total scheduled rounds

Soak Environment Design¶

Persistent Control-Plane¶

Resource	Value
CP listen	`127.0.0.1:19000`
Metrics	`127.0.0.1:19100`
Data dir	`$SOAK_DIR/cp-data` (persists across rounds and restarts)
Log	`$SOAK_DIR/cp.log` (append mode across restarts)
PID file	`$SOAK_DIR/cp.pid`
RBAC	`AUTONOMY_RBAC_ENFORCEMENT=0`
Audit dir	`$SOAK_DIR/audit-store` (persistent, shared with round CLI calls)
Operator	`AUTONOMY_OPERATOR=soak-operator`

The CP is started once by run_soak_val12_setup.sh and is expected to stay up for the full 30-day window. If the CP dies, the round script detects the failure, records health=failed, increments the downtime counter, attempts a restart, and continues only if the replacement instance becomes healthy.

Directory Structure¶

$SOAK_DIR/
  config.env                 # CP URL, binary path, thresholds (written by setup)
  cp.pid                     # PID of running CP process
  cp.log                     # CP stdout/stderr (append across restarts)
  cp-data/                   # SQLite WAL + database files
  audit-store/               # Persistent audit JSONL files
  rounds/
    YYYY-MM-DD/
      round-HHMMSS/
        round-summary.json   # Per-round stats
        plans-created.txt    # HTTP codes for 10 creates
        create-times.txt     # Latencies from curl -w '%{time_total}'
        stuck-scan.json      # GET /v1/rollouts/stuck response
        recovery.txt         # rollback execute outputs (if any)
        metrics-scrape.prom  # Prometheus scrape
  daily/
    YYYY-MM-DD-summary.json  # Daily aggregation from all rounds that day
  checkpoints/
    checkpoint-7d.json       # 7-day checkpoint report (written by report script)
    checkpoint-14d.json      # 14-day checkpoint report
  final-report.txt           # Human-readable 30-day final report
  final-report.json          # Machine-readable 30-day final report
  alerts.log                 # Append-only log of threshold breaches

CP Restart Recovery¶

If run_soak_val12_round.sh finds the CP PID dead, it:

Records "health": "failed", "restarted": true in round-summary.json
Starts a new CP instance (same data dir, same port)
Waits for health check to pass (up to 30 seconds)
Proceeds with the round if the restart succeeds
Marks v01_health as FAIL for this round (counts against uptime)

Workload Schedule¶

Round Frequency¶

Schedule	Recommended value
Round interval	Every 30 minutes (`/30 * * *` crontab)
Daily report	Once at 01:00 UTC (`0 1 * * *`)
7-day checkpoint	Day 7, 14, 21 at 02:00 UTC (manual or cron)
Final report	Day 30 (manual)

Per-Round Workload (10 plans × every 30 min = 480 plans/day)¶

Each invocation of run_soak_val12_round.sh performs the following in order:

CP health check — GET /v1/health; restart CP if dead
10 plan creates — unique IDs soak-<unix_ts>-{01..10}, timed with curl -w '%{time_total}'
Stuck scan — GET /v1/rollouts/stuck?threshold_seconds=3600; record stuck_count
Stuck recovery — for each stuck plan: rollback execute strategy=retry --reason soak-auto-recovery; record success/fail
Plan count check — page through GET /v1/rollouts?limit=100 until next_cursor="", sum the full corpus, and check >= 100 after round 10
Prometheus scrape — curl http://127.0.0.1:19100/metrics; extract cp_http_requests_total and cp_rollout_plans_total
Round summary write — write round-summary.json with all round stats

Plan Accumulation Model¶

Plans are created but not deleted by the round script. After 10 rounds (5 hours at 30-min intervals), the store has ≥ 100 plans, satisfying the concurrent-fleet target. Over 30 days at 48 rounds/day × 10 plans = 14,400 plans total.

Day	Plans in store (cumulative)	Remarks
0.2 (10 rounds)	≥ 100	Fleet target reached
1	≥ 480	1 day × 48 rounds × 10 plans
7	≥ 3,360	7-day checkpoint
30	≥ 14,400	Final report target

The SQLite WAL is append-only for this soak; store size growth is a tracked metric.

Crontab Example¶

# In crontab: crontab -e
SOAK_DIR=/path/to/evidence/soak-val12
REPO_ROOT=/path/to/autonomyops

# Every 30 minutes: execute one workload round
*/30 * * * * $REPO_ROOT/scripts/labs/run_soak_val12_round.sh $SOAK_DIR >> $SOAK_DIR/cron.log 2>&1

# Daily at 01:00 UTC: generate daily summary
0 1 * * * $REPO_ROOT/scripts/labs/run_soak_val12_report.sh $SOAK_DIR --type daily >> $SOAK_DIR/cron.log 2>&1

Metrics and Alerts¶

Metrics Captured Per Round¶

Metric	Source	Target
`create_error_rate`	round plans-created.txt	= 0
`p50_ms`, `p95_ms`, `p99_ms`	create-times.txt (Python nearest-rank)	p99 ≤ 500ms
`stuck_count`	stuck-scan.json	alert if > 0 for 3 consecutive rounds
`recovery_ok`, `recovery_fail`	recovery.txt exit codes	recovery_fail = 0
`health_status`	GET /v1/health	alert if failed; restarted rounds still count as failed for uptime
`cp_http_requests_total`	metrics-scrape.prom	increasing each round
`plan_store_size`	GET /v1/rollouts count	increasing each round

Daily Aggregated Metrics¶

Metric	Computation	Alert threshold
`daily_create_errors`	sum(round.create_errors)	> 0 → WARN
`daily_rollback_rate`	recovery_ok / (recovery_ok + recovery_fail), only when recovery attempts > 0	< 0.990 → ALERT
`daily_p99_ms`	max(round.p99_ms) across all rounds	> 1000 → WARN (2× baseline)
`daily_uptime_pct`	rounds_healthy / total_rounds × 100	< 99.9 → ALERT
`daily_stuck_rounds`	count(round.stuck_count > 0)	> 3 consecutive → ALERT

Alert Log Format (`alerts.log`)¶

<ISO-8601>  ALERT  daily_rollback_rate=0.950  threshold=0.990  day=7
<ISO-8601>  WARN   daily_p99_ms=850ms  threshold=1000ms  day=12
<ISO-8601>  ALERT  cp_health_failed  consecutive_failures=3  day=3

Evidence Retention Plan¶

Round Evidence¶

Retained for the full 30-day window. Each round directory is ≈ 20 KB (plans-created.txt + create-times.txt + stuck-scan.json + round-summary.json + metrics-scrape.prom). Total estimated storage:

1,440 rounds × 20 KB/round = ~28 MB over 30 days

This is well within typical lab storage budgets. No pruning required during the soak.

Daily Summaries¶

Retained permanently (one JSON file per day, ≈ 2 KB each = 60 KB total for 30 days).

Audit Store¶

The persistent audit JSONL files grow continuously. Expected size:

~25 audit events/round × 1,440 rounds × ~200 bytes/event = ~7 MB

No pruning during the soak. After the soak, autonomy audit prune --older-than 31d can be used to expire records.

Archiving at End of Soak¶

After final report generation, archive the full soak directory:

tar -czf soak-val12-$(date +%F).tar.gz "$SOAK_DIR"
sha256sum soak-val12-$(date +%F).tar.gz > soak-val12-$(date +%F).tar.gz.sha256

The archive + SHA file constitute the permanent Gate D evidence artifact.

Scenario Matrix (10 Checks)¶

VAL12-01 — Framework Provisioned¶

When: Setup time. Action: run_soak_val12_setup.sh exits 0; GET /v1/health returns 200. Evidence: setup-summary.txt Pass criterion: CP starts within 10s; setup exits 0.

VAL12-02 — Initial Workload Round¶

When: Round 1 (immediately after setup). Action: First run_soak_val12_round.sh invocation creates 10 plans with 0 errors. Evidence: rounds/<date>/round-<T>/round-summary.json Pass criterion: create_errors=0, ok=10.

VAL12-03 — Fleet Target Reached¶

When: After round 10 (≈5 hours into soak). Action: Page through GET /v1/rollouts?limit=100 and sum all returned plans until next_cursor is empty; total count must reach >= 100. Evidence: rounds/<date>/round-<T>/round-summary.json plan_count field. Pass criterion: plan_count ≥ 100 (workplan GA gate: ≥100 concurrent plans).

VAL12-04 — P99 Latency Baseline¶

When: Round 1 (establishes baseline before store accumulation). Action: Compute p99 of 10 create latencies from round 1 create-times.txt. Evidence: rounds/<date>/round-<T>/round-summary.json p99_ms field. Pass criterion: p99_ms ≤ 500 (workplan target: plan creation < 500ms p99).

VAL12-05 — Stuck Detection Operational¶

When: Every round. Action: GET /v1/rollouts/stuck?threshold_seconds=3600 returns HTTP 200 and a valid stuck-scan JSON payload. Evidence: rounds/<date>/round-<T>/stuck-scan.json Pass criterion: scan_http_code="200" and scan_ok=true in every round. A stuck_count > 0 is not a failure; it shows the operator surface is working. A scan transport error, non-200 status, or invalid JSON payload is a failure.

VAL12-06 — 7-Day Rollback Rate ≥ 99%¶

When: Day 7 checkpoint. Action: Aggregate recovery results from all rounds in days 1–7. Evidence: checkpoints/checkpoint-7d.json Pass criterion: total_recovery_attempts > 0 and rollback_rate_7d ≥ 0.990 (equivalent to ≤ 1 failure per 100 recoveries).

VAL12-07 — P99 Latency Maintained at 7 Days¶

When: Day 7 checkpoint. Action: Compute p99 across all rounds in days 1–7 from daily summaries. Evidence: checkpoints/checkpoint-7d.json Pass criterion: p99_ms_7d ≤ 500 — no degradation under accumulated store state.

VAL12-08 — CP Availability ≥ 99.9%¶

When: 30-day final report. Action: healthy_rounds / total_rounds × 100, where only rounds with health="ok" count as healthy; restarted rounds still count against uptime. Evidence: final-report.json uptime_pct field. Pass criterion: uptime_pct ≥ 99.9 (≤ 43 failed rounds out of 1,440 over 30 days).

VAL12-09 — Total Plans Created ≥ 4,000¶

When: 30-day final report. Action: Sum create_ok across all round summaries. Evidence: final-report.json total_creates field. Pass criterion: total_creates ≥ 4000 (conservative: 1,440 rounds × 10 plans × 30% success rate required — actual expected ≥ 14,000).

VAL12-10 — 30-Day Aggregate Rollback Rate ≥ 99%¶

When: 30-day final report. Action: total_recovery_ok / (total_recovery_ok + total_recovery_fail). Evidence: final-report.json agg_rollback_rate and total_recovery_attempts fields. Pass criterion: total_recovery_attempts > 0 and agg_rollback_rate ≥ 0.990 (workplan Gate D target).

Pass/Fail Criteria¶

Full pass: All 10 checks report PASS at their respective verification times.

Minimum acceptable (Gate D): VAL12-03, VAL12-10 pass — fleet target reached and 30-day rollback rate ≥ 99%.

Key thresholds:

Check	Threshold	Verification time
VAL12-01	CP starts within 10s	Setup
VAL12-02	create_errors=0 for round 1	Round 1
VAL12-03	paginated plan_count ≥ 100	Round 10 (~5h)
VAL12-04	p99_ms ≤ 500	Round 1
VAL12-05	scan_ok=true every round	Continuous
VAL12-06	recovery_attempts_7d > 0 and rollback_rate_7d ≥ 0.990	Day 7
VAL12-07	p99_ms_7d ≤ 500	Day 7
VAL12-08	uptime_pct ≥ 99.9	Day 30
VAL12-09	total_creates ≥ 4000	Day 30
VAL12-10	total_recovery_attempts > 0 and agg_rollback_rate ≥ 0.990	Day 30

Final Report Template¶

# VAL 12 — Fleet Rollout 30-Day Soak Final Report

Soak start:        <ISO-8601>
Soak end:          <ISO-8601>
Duration days:     30
Total rounds:      <N>   (scheduled: 1440)
Healthy rounds:    <N>
Failed rounds:     <N>
CP restart count:  <N>

## Fleet Coverage
Total plans created:    <N>   (target ≥ 4,000 | expected ~14,400)
Max concurrent plans:   <N>   (target ≥ 100)
Plan store final size:  <N>

## Rollback Reliability
Total recovery attempts:  <N>
Rollback sampled:         <true|false>
Total recovery successes: <N>
Total recovery failures:  <N>
Aggregate rollback rate:  <X.XXX>  (target ≥ 0.990)
7-day window rate:        <X.XXX>
30-day window rate:       <X.XXX>  ← Gate D target

## Latency (create P99 across all rounds)
Round 1 p99_ms:           <N>   (baseline)
Day 7 p99_ms:             <N>
Day 30 p99_ms:            <N>
All-time max p99_ms:      <N>   (target ≤ 500ms)
All-time 95th pct of p99: <N>

## Availability
Total rounds:             <N>
Healthy rounds:           <N>
Uptime %:                 <X.X>  (target ≥ 99.9%)
Total CP restarts:        <N>
Longest downtime streak:  <N> rounds

## Stuck Detection
Total stuck plans detected: <N>
Total recoveries attempted: <N>
Auto-recovery success rate: <X.XXX>

## Alerts Issued
<copy from alerts.log — or "none">

## VAL12 Check Results
VAL12-01 framework_provisioned:      PASS / FAIL
VAL12-02 initial_round_success:      PASS / FAIL
VAL12-03 fleet_target_reached:       PASS / FAIL  (plan_count=<N>, target=100)
VAL12-04 latency_baseline:           PASS / FAIL  (p99=<N>ms, target=500ms)
VAL12-05 stuck_detection_ops:        PASS / FAIL  (scan_failures=<N>)
VAL12-06 rollback_rate_7d:           PASS / FAIL  (attempts=<N>, rate=<X>, target=0.990)
VAL12-07 latency_maintained_7d:      PASS / FAIL  (p99_7d=<N>ms, target=500ms)
VAL12-08 cp_availability:            PASS / FAIL  (uptime=<X.X>%, target=99.9%)
VAL12-09 total_creates:              PASS / FAIL  (creates=<N>, target=4000)
VAL12-10 agg_rollback_rate:          PASS / FAIL  (attempts=<N>, rate=<X.XXX>, target=0.990)

Overall: PASS / FAIL (<N>/10 checks)

## Gate D Assessment
<PASS: All Gate D criteria met — fleet soak demonstrates ≥99% rollback reliability
across ≥100 concurrent plans over 30 days.>
OR
<FAIL: Gate D criteria not met — see check results above. Required: VAL12-03 + VAL12-10.>

## Artifact
Archive: soak-val12-<YYYY-MM-DD>.tar.gz
SHA-256: <hash>

Known Failure Modes¶

Mode	Description	Detectable by
SQLite lock contention	Under high round frequency, WAL may stall on writes	`create_error_rate > 0` in round-summary
P99 degradation	Accumulated plans slow `GET /v1/rollouts` list path	`p99_ms > 500` trending upward in daily summaries
CP OOM / crash	Long-running CP accumulates memory state	`cp.pid` check fails; restart counter increments
Port conflict on restart	19000 re-used by another process after CP death	`run_soak_val12_round.sh` exits 1 at health check
Stuck scan always empty	Soak plans never exceed threshold (72h set in spec)	`stuck_count=0` for all 1,440 rounds → OK, not an error

Out-of-Scope Items¶

Item	Reason
Edge agent behavior	Edge agents not running in soak; device simulation deferred
PostgreSQL backend	Requires live PG; SQLite soak is Gate C; PG soak is Gate D+
Network partition injection	Requires iptables/root; see VAL11 for process-level chaos
Multi-region soak	Out of scope for v1
Automated rollout promotion	Batch promoter not running; plans stay in published phase