VAL 12 — Fleet Rollout 30-Day Soak

Purpose

This plan defines the long-duration validation framework that satisfies the workplan Gate D requirement:

“30-day soak, ≥100 concurrent device rollouts, ≥99% rollback success rate”

The framework continuously exercises the control-plane rollout and recovery surfaces over a 30-day window, collecting structured round evidence and daily summaries that aggregate into a final pass/fail report against the four VAL12 claims.


Branch-Specific Rule

Question

Answer

Covered by existing lab?

No. All existing labs (run_cli_audit_lab.sh, run_edge_deadletter_lab.sh) are synchronous, single-shot evidence collectors. A 30-day soak requires persistent infrastructure, scheduled round execution, rolling evidence windows, and final aggregation.

Lab to extend

N/A. Cannot extend the existing single-shot lab cleanly.

New runner required?

Yes. Three new scripts: run_soak_val12_setup.sh, run_soak_val12_round.sh, run_soak_val12_report.sh. Concrete reason: the existing runner has no persistent state, no round directory model, no daily aggregation, and no 30-day time horizon.

cli-audit-lab.md impact

§8 scope note only (no new slice in run_cli_audit_lab.sh).


Claims Under Test

ID

Claim

VAL12-C1

The control-plane sustains ≥ 100 concurrent plan entries in the store for 30 consecutive days without data corruption

VAL12-C2

rollback execute success rate across all executed recoveries is ≥ 99% over the 30-day window, with at least one sampled recovery in the measured window

VAL12-C3

P99 plan-create latency remains ≤ 500 ms throughout the soak (no degradation under accumulated store state)

VAL12-C4

CP availability is ≥ 99.9% measured as healthy rounds / total scheduled rounds


Soak Environment Design

Persistent Control-Plane

Resource

Value

CP listen

127.0.0.1:19000

Metrics

127.0.0.1:19100

Data dir

$SOAK_DIR/cp-data (persists across rounds and restarts)

Log

$SOAK_DIR/cp.log (append mode across restarts)

PID file

$SOAK_DIR/cp.pid

RBAC

AUTONOMY_RBAC_ENFORCEMENT=0

Audit dir

$SOAK_DIR/audit-store (persistent, shared with round CLI calls)

Operator

AUTONOMY_OPERATOR=soak-operator

The CP is started once by run_soak_val12_setup.sh and is expected to stay up for the full 30-day window. If the CP dies, the round script detects the failure, records health=failed, increments the downtime counter, attempts a restart, and continues only if the replacement instance becomes healthy.

Directory Structure

$SOAK_DIR/
  config.env                 # CP URL, binary path, thresholds (written by setup)
  cp.pid                     # PID of running CP process
  cp.log                     # CP stdout/stderr (append across restarts)
  cp-data/                   # SQLite WAL + database files
  audit-store/               # Persistent audit JSONL files
  rounds/
    YYYY-MM-DD/
      round-HHMMSS/
        round-summary.json   # Per-round stats
        plans-created.txt    # HTTP codes for 10 creates
        create-times.txt     # Latencies from curl -w '%{time_total}'
        stuck-scan.json      # GET /v1/rollouts/stuck response
        recovery.txt         # rollback execute outputs (if any)
        metrics-scrape.prom  # Prometheus scrape
  daily/
    YYYY-MM-DD-summary.json  # Daily aggregation from all rounds that day
  checkpoints/
    checkpoint-7d.json       # 7-day checkpoint report (written by report script)
    checkpoint-14d.json      # 14-day checkpoint report
  final-report.txt           # Human-readable 30-day final report
  final-report.json          # Machine-readable 30-day final report
  alerts.log                 # Append-only log of threshold breaches

CP Restart Recovery

If run_soak_val12_round.sh finds the CP PID dead, it:

  1. Records "health": "failed", "restarted": true in round-summary.json

  2. Starts a new CP instance (same data dir, same port)

  3. Waits for health check to pass (up to 30 seconds)

  4. Proceeds with the round if the restart succeeds

  5. Marks v01_health as FAIL for this round (counts against uptime)


Workload Schedule

Round Frequency

Schedule

Recommended value

Round interval

Every 30 minutes (*/30 * * * * crontab)

Daily report

Once at 01:00 UTC (0 1 * * *)

7-day checkpoint

Day 7, 14, 21 at 02:00 UTC (manual or cron)

Final report

Day 30 (manual)

Per-Round Workload (10 plans × every 30 min = 480 plans/day)

Each invocation of run_soak_val12_round.sh performs the following in order:

  1. CP health checkGET /v1/health; restart CP if dead

  2. 10 plan creates — unique IDs soak-<unix_ts>-{01..10}, timed with curl -w '%{time_total}'

  3. Stuck scanGET /v1/rollouts/stuck?threshold_seconds=3600; record stuck_count

  4. Stuck recovery — for each stuck plan: rollback execute strategy=retry --reason soak-auto-recovery; record success/fail

  5. Plan count check — page through GET /v1/rollouts?limit=100 until next_cursor="", sum the full corpus, and check >= 100 after round 10

  6. Prometheus scrapecurl http://127.0.0.1:19100/metrics; extract cp_http_requests_total and cp_rollout_plans_total

  7. Round summary write — write round-summary.json with all round stats

Plan Accumulation Model

Plans are created but not deleted by the round script. After 10 rounds (5 hours at 30-min intervals), the store has ≥ 100 plans, satisfying the concurrent-fleet target. Over 30 days at 48 rounds/day × 10 plans = 14,400 plans total.

Day

Plans in store (cumulative)

Remarks

0.2 (10 rounds)

≥ 100

Fleet target reached

1

≥ 480

1 day × 48 rounds × 10 plans

7

≥ 3,360

7-day checkpoint

30

≥ 14,400

Final report target

The SQLite WAL is append-only for this soak; store size growth is a tracked metric.

Crontab Example

# In crontab: crontab -e
SOAK_DIR=/path/to/evidence/soak-val12
REPO_ROOT=/path/to/autonomyops

# Every 30 minutes: execute one workload round
*/30 * * * * $REPO_ROOT/scripts/labs/run_soak_val12_round.sh $SOAK_DIR >> $SOAK_DIR/cron.log 2>&1

# Daily at 01:00 UTC: generate daily summary
0 1 * * * $REPO_ROOT/scripts/labs/run_soak_val12_report.sh $SOAK_DIR --type daily >> $SOAK_DIR/cron.log 2>&1

Metrics and Alerts

Metrics Captured Per Round

Metric

Source

Target

create_error_rate

round plans-created.txt

= 0

p50_ms, p95_ms, p99_ms

create-times.txt (Python nearest-rank)

p99 ≤ 500ms

stuck_count

stuck-scan.json

alert if > 0 for 3 consecutive rounds

recovery_ok, recovery_fail

recovery.txt exit codes

recovery_fail = 0

health_status

GET /v1/health

alert if failed; restarted rounds still count as failed for uptime

cp_http_requests_total

metrics-scrape.prom

increasing each round

plan_store_size

GET /v1/rollouts count

increasing each round

Daily Aggregated Metrics

Metric

Computation

Alert threshold

daily_create_errors

sum(round.create_errors)

> 0 → WARN

daily_rollback_rate

recovery_ok / (recovery_ok + recovery_fail), only when recovery attempts > 0

< 0.990 → ALERT

daily_p99_ms

max(round.p99_ms) across all rounds

> 1000 → WARN (2× baseline)

daily_uptime_pct

rounds_healthy / total_rounds × 100

< 99.9 → ALERT

daily_stuck_rounds

count(round.stuck_count > 0)

> 3 consecutive → ALERT

Alert Log Format (alerts.log)

<ISO-8601>  ALERT  daily_rollback_rate=0.950  threshold=0.990  day=7
<ISO-8601>  WARN   daily_p99_ms=850ms  threshold=1000ms  day=12
<ISO-8601>  ALERT  cp_health_failed  consecutive_failures=3  day=3

Evidence Retention Plan

Round Evidence

Retained for the full 30-day window. Each round directory is ≈ 20 KB (plans-created.txt + create-times.txt + stuck-scan.json + round-summary.json + metrics-scrape.prom). Total estimated storage:

1,440 rounds × 20 KB/round = ~28 MB over 30 days

This is well within typical lab storage budgets. No pruning required during the soak.

Daily Summaries

Retained permanently (one JSON file per day, ≈ 2 KB each = 60 KB total for 30 days).

Audit Store

The persistent audit JSONL files grow continuously. Expected size:

~25 audit events/round × 1,440 rounds × ~200 bytes/event = ~7 MB

No pruning during the soak. After the soak, autonomy audit prune --older-than 31d can be used to expire records.

Archiving at End of Soak

After final report generation, archive the full soak directory:

tar -czf soak-val12-$(date +%F).tar.gz "$SOAK_DIR"
sha256sum soak-val12-$(date +%F).tar.gz > soak-val12-$(date +%F).tar.gz.sha256

The archive + SHA file constitute the permanent Gate D evidence artifact.


Scenario Matrix (10 Checks)

VAL12-01 — Framework Provisioned

When: Setup time. Action: run_soak_val12_setup.sh exits 0; GET /v1/health returns 200. Evidence: setup-summary.txt Pass criterion: CP starts within 10s; setup exits 0.


VAL12-02 — Initial Workload Round

When: Round 1 (immediately after setup). Action: First run_soak_val12_round.sh invocation creates 10 plans with 0 errors. Evidence: rounds/<date>/round-<T>/round-summary.json Pass criterion: create_errors=0, ok=10.


VAL12-03 — Fleet Target Reached

When: After round 10 (≈5 hours into soak). Action: Page through GET /v1/rollouts?limit=100 and sum all returned plans until next_cursor is empty; total count must reach >= 100. Evidence: rounds/<date>/round-<T>/round-summary.json plan_count field. Pass criterion: plan_count 100 (workplan GA gate: ≥100 concurrent plans).


VAL12-04 — P99 Latency Baseline

When: Round 1 (establishes baseline before store accumulation). Action: Compute p99 of 10 create latencies from round 1 create-times.txt. Evidence: rounds/<date>/round-<T>/round-summary.json p99_ms field. Pass criterion: p99_ms 500 (workplan target: plan creation < 500ms p99).


VAL12-05 — Stuck Detection Operational

When: Every round. Action: GET /v1/rollouts/stuck?threshold_seconds=3600 returns HTTP 200 and a valid stuck-scan JSON payload. Evidence: rounds/<date>/round-<T>/stuck-scan.json Pass criterion: scan_http_code="200" and scan_ok=true in every round. A stuck_count > 0 is not a failure; it shows the operator surface is working. A scan transport error, non-200 status, or invalid JSON payload is a failure.


VAL12-06 — 7-Day Rollback Rate ≥ 99%

When: Day 7 checkpoint. Action: Aggregate recovery results from all rounds in days 1–7. Evidence: checkpoints/checkpoint-7d.json Pass criterion: total_recovery_attempts > 0 and rollback_rate_7d 0.990 (equivalent to ≤ 1 failure per 100 recoveries).


VAL12-07 — P99 Latency Maintained at 7 Days

When: Day 7 checkpoint. Action: Compute p99 across all rounds in days 1–7 from daily summaries. Evidence: checkpoints/checkpoint-7d.json Pass criterion: p99_ms_7d 500 — no degradation under accumulated store state.


VAL12-08 — CP Availability ≥ 99.9%

When: 30-day final report. Action: healthy_rounds / total_rounds × 100, where only rounds with health="ok" count as healthy; restarted rounds still count against uptime. Evidence: final-report.json uptime_pct field. Pass criterion: uptime_pct 99.9 (≤ 43 failed rounds out of 1,440 over 30 days).


VAL12-09 — Total Plans Created ≥ 4,000

When: 30-day final report. Action: Sum create_ok across all round summaries. Evidence: final-report.json total_creates field. Pass criterion: total_creates 4000 (conservative: 1,440 rounds × 10 plans × 30% success rate required — actual expected ≥ 14,000).


VAL12-10 — 30-Day Aggregate Rollback Rate ≥ 99%

When: 30-day final report. Action: total_recovery_ok / (total_recovery_ok + total_recovery_fail). Evidence: final-report.json agg_rollback_rate and total_recovery_attempts fields. Pass criterion: total_recovery_attempts > 0 and agg_rollback_rate 0.990 (workplan Gate D target).


Pass/Fail Criteria

Full pass: All 10 checks report PASS at their respective verification times.

Minimum acceptable (Gate D): VAL12-03, VAL12-10 pass — fleet target reached and 30-day rollback rate ≥ 99%.

Key thresholds:

Check

Threshold

Verification time

VAL12-01

CP starts within 10s

Setup

VAL12-02

create_errors=0 for round 1

Round 1

VAL12-03

paginated plan_count ≥ 100

Round 10 (~5h)

VAL12-04

p99_ms ≤ 500

Round 1

VAL12-05

scan_ok=true every round

Continuous

VAL12-06

recovery_attempts_7d > 0 and rollback_rate_7d ≥ 0.990

Day 7

VAL12-07

p99_ms_7d ≤ 500

Day 7

VAL12-08

uptime_pct ≥ 99.9

Day 30

VAL12-09

total_creates ≥ 4000

Day 30

VAL12-10

total_recovery_attempts > 0 and agg_rollback_rate ≥ 0.990

Day 30


Final Report Template

# VAL 12 — Fleet Rollout 30-Day Soak Final Report

Soak start:        <ISO-8601>
Soak end:          <ISO-8601>
Duration days:     30
Total rounds:      <N>   (scheduled: 1440)
Healthy rounds:    <N>
Failed rounds:     <N>
CP restart count:  <N>

## Fleet Coverage
Total plans created:    <N>   (target ≥ 4,000 | expected ~14,400)
Max concurrent plans:   <N>   (target ≥ 100)
Plan store final size:  <N>

## Rollback Reliability
Total recovery attempts:  <N>
Rollback sampled:         <true|false>
Total recovery successes: <N>
Total recovery failures:  <N>
Aggregate rollback rate:  <X.XXX>  (target ≥ 0.990)
7-day window rate:        <X.XXX>
30-day window rate:       <X.XXX>  ← Gate D target

## Latency (create P99 across all rounds)
Round 1 p99_ms:           <N>   (baseline)
Day 7 p99_ms:             <N>
Day 30 p99_ms:            <N>
All-time max p99_ms:      <N>   (target ≤ 500ms)
All-time 95th pct of p99: <N>

## Availability
Total rounds:             <N>
Healthy rounds:           <N>
Uptime %:                 <X.X>  (target ≥ 99.9%)
Total CP restarts:        <N>
Longest downtime streak:  <N> rounds

## Stuck Detection
Total stuck plans detected: <N>
Total recoveries attempted: <N>
Auto-recovery success rate: <X.XXX>

## Alerts Issued
<copy from alerts.log — or "none">

## VAL12 Check Results
VAL12-01 framework_provisioned:      PASS / FAIL
VAL12-02 initial_round_success:      PASS / FAIL
VAL12-03 fleet_target_reached:       PASS / FAIL  (plan_count=<N>, target=100)
VAL12-04 latency_baseline:           PASS / FAIL  (p99=<N>ms, target=500ms)
VAL12-05 stuck_detection_ops:        PASS / FAIL  (scan_failures=<N>)
VAL12-06 rollback_rate_7d:           PASS / FAIL  (attempts=<N>, rate=<X>, target=0.990)
VAL12-07 latency_maintained_7d:      PASS / FAIL  (p99_7d=<N>ms, target=500ms)
VAL12-08 cp_availability:            PASS / FAIL  (uptime=<X.X>%, target=99.9%)
VAL12-09 total_creates:              PASS / FAIL  (creates=<N>, target=4000)
VAL12-10 agg_rollback_rate:          PASS / FAIL  (attempts=<N>, rate=<X.XXX>, target=0.990)

Overall: PASS / FAIL (<N>/10 checks)

## Gate D Assessment
<PASS: All Gate D criteria met — fleet soak demonstrates ≥99% rollback reliability
across ≥100 concurrent plans over 30 days.>
OR
<FAIL: Gate D criteria not met — see check results above. Required: VAL12-03 + VAL12-10.>

## Artifact
Archive: soak-val12-<YYYY-MM-DD>.tar.gz
SHA-256: <hash>

Known Failure Modes

Mode

Description

Detectable by

SQLite lock contention

Under high round frequency, WAL may stall on writes

create_error_rate > 0 in round-summary

P99 degradation

Accumulated plans slow GET /v1/rollouts list path

p99_ms > 500 trending upward in daily summaries

CP OOM / crash

Long-running CP accumulates memory state

cp.pid check fails; restart counter increments

Port conflict on restart

19000 re-used by another process after CP death

run_soak_val12_round.sh exits 1 at health check

Stuck scan always empty

Soak plans never exceed threshold (72h set in spec)

stuck_count=0 for all 1,440 rounds → OK, not an error

Out-of-Scope Items

Item

Reason

Edge agent behavior

Edge agents not running in soak; device simulation deferred

PostgreSQL backend

Requires live PG; SQLite soak is Gate C; PG soak is Gate D+

Network partition injection

Requires iptables/root; see VAL11 for process-level chaos

Multi-region soak

Out of scope for v1

Automated rollout promotion

Batch promoter not running; plans stay in published phase