VAL 12 — Fleet Rollout 30-Day Soak¶
Purpose¶
This plan defines the long-duration validation framework that satisfies the workplan Gate D requirement:
“30-day soak, ≥100 concurrent device rollouts, ≥99% rollback success rate”
The framework continuously exercises the control-plane rollout and recovery surfaces over a 30-day window, collecting structured round evidence and daily summaries that aggregate into a final pass/fail report against the four VAL12 claims.
Branch-Specific Rule¶
Question |
Answer |
|---|---|
Covered by existing lab? |
No. All existing labs ( |
Lab to extend |
N/A. Cannot extend the existing single-shot lab cleanly. |
New runner required? |
Yes. Three new scripts: |
cli-audit-lab.md impact |
§8 scope note only (no new slice in |
Claims Under Test¶
ID |
Claim |
|---|---|
VAL12-C1 |
The control-plane sustains ≥ 100 concurrent plan entries in the store for 30 consecutive days without data corruption |
VAL12-C2 |
|
VAL12-C3 |
P99 plan-create latency remains ≤ 500 ms throughout the soak (no degradation under accumulated store state) |
VAL12-C4 |
CP availability is ≥ 99.9% measured as healthy rounds / total scheduled rounds |
Soak Environment Design¶
Persistent Control-Plane¶
Resource |
Value |
|---|---|
CP listen |
|
Metrics |
|
Data dir |
|
Log |
|
PID file |
|
RBAC |
|
Audit dir |
|
Operator |
|
The CP is started once by run_soak_val12_setup.sh and is expected to stay up
for the full 30-day window. If the CP dies, the round script detects the failure,
records health=failed, increments the downtime counter, attempts a restart,
and continues only if the replacement instance becomes healthy.
Directory Structure¶
$SOAK_DIR/
config.env # CP URL, binary path, thresholds (written by setup)
cp.pid # PID of running CP process
cp.log # CP stdout/stderr (append across restarts)
cp-data/ # SQLite WAL + database files
audit-store/ # Persistent audit JSONL files
rounds/
YYYY-MM-DD/
round-HHMMSS/
round-summary.json # Per-round stats
plans-created.txt # HTTP codes for 10 creates
create-times.txt # Latencies from curl -w '%{time_total}'
stuck-scan.json # GET /v1/rollouts/stuck response
recovery.txt # rollback execute outputs (if any)
metrics-scrape.prom # Prometheus scrape
daily/
YYYY-MM-DD-summary.json # Daily aggregation from all rounds that day
checkpoints/
checkpoint-7d.json # 7-day checkpoint report (written by report script)
checkpoint-14d.json # 14-day checkpoint report
final-report.txt # Human-readable 30-day final report
final-report.json # Machine-readable 30-day final report
alerts.log # Append-only log of threshold breaches
CP Restart Recovery¶
If run_soak_val12_round.sh finds the CP PID dead, it:
Records
"health": "failed", "restarted": trueinround-summary.jsonStarts a new CP instance (same data dir, same port)
Waits for health check to pass (up to 30 seconds)
Proceeds with the round if the restart succeeds
Marks
v01_healthas FAIL for this round (counts against uptime)
Workload Schedule¶
Round Frequency¶
Schedule |
Recommended value |
|---|---|
Round interval |
Every 30 minutes ( |
Daily report |
Once at 01:00 UTC ( |
7-day checkpoint |
Day 7, 14, 21 at 02:00 UTC (manual or cron) |
Final report |
Day 30 (manual) |
Per-Round Workload (10 plans × every 30 min = 480 plans/day)¶
Each invocation of run_soak_val12_round.sh performs the following in order:
CP health check —
GET /v1/health; restart CP if dead10 plan creates — unique IDs
soak-<unix_ts>-{01..10}, timed withcurl -w '%{time_total}'Stuck scan —
GET /v1/rollouts/stuck?threshold_seconds=3600; recordstuck_countStuck recovery — for each stuck plan:
rollback execute strategy=retry --reason soak-auto-recovery; record success/failPlan count check — page through
GET /v1/rollouts?limit=100untilnext_cursor="", sum the full corpus, and check>= 100after round 10Prometheus scrape —
curl http://127.0.0.1:19100/metrics; extractcp_http_requests_totalandcp_rollout_plans_totalRound summary write — write
round-summary.jsonwith all round stats
Plan Accumulation Model¶
Plans are created but not deleted by the round script. After 10 rounds (5 hours at 30-min intervals), the store has ≥ 100 plans, satisfying the concurrent-fleet target. Over 30 days at 48 rounds/day × 10 plans = 14,400 plans total.
Day |
Plans in store (cumulative) |
Remarks |
|---|---|---|
0.2 (10 rounds) |
≥ 100 |
Fleet target reached |
1 |
≥ 480 |
1 day × 48 rounds × 10 plans |
7 |
≥ 3,360 |
7-day checkpoint |
30 |
≥ 14,400 |
Final report target |
The SQLite WAL is append-only for this soak; store size growth is a tracked metric.
Crontab Example¶
# In crontab: crontab -e
SOAK_DIR=/path/to/evidence/soak-val12
REPO_ROOT=/path/to/autonomyops
# Every 30 minutes: execute one workload round
*/30 * * * * $REPO_ROOT/scripts/labs/run_soak_val12_round.sh $SOAK_DIR >> $SOAK_DIR/cron.log 2>&1
# Daily at 01:00 UTC: generate daily summary
0 1 * * * $REPO_ROOT/scripts/labs/run_soak_val12_report.sh $SOAK_DIR --type daily >> $SOAK_DIR/cron.log 2>&1
Metrics and Alerts¶
Metrics Captured Per Round¶
Metric |
Source |
Target |
|---|---|---|
|
round plans-created.txt |
= 0 |
|
create-times.txt (Python nearest-rank) |
p99 ≤ 500ms |
|
stuck-scan.json |
alert if > 0 for 3 consecutive rounds |
|
recovery.txt exit codes |
recovery_fail = 0 |
|
GET /v1/health |
alert if failed; restarted rounds still count as failed for uptime |
|
metrics-scrape.prom |
increasing each round |
|
GET /v1/rollouts count |
increasing each round |
Daily Aggregated Metrics¶
Metric |
Computation |
Alert threshold |
|---|---|---|
|
sum(round.create_errors) |
> 0 → WARN |
|
recovery_ok / (recovery_ok + recovery_fail), only when recovery attempts > 0 |
< 0.990 → ALERT |
|
max(round.p99_ms) across all rounds |
> 1000 → WARN (2× baseline) |
|
rounds_healthy / total_rounds × 100 |
< 99.9 → ALERT |
|
count(round.stuck_count > 0) |
> 3 consecutive → ALERT |
Alert Log Format (alerts.log)¶
<ISO-8601> ALERT daily_rollback_rate=0.950 threshold=0.990 day=7
<ISO-8601> WARN daily_p99_ms=850ms threshold=1000ms day=12
<ISO-8601> ALERT cp_health_failed consecutive_failures=3 day=3
Evidence Retention Plan¶
Round Evidence¶
Retained for the full 30-day window. Each round directory is ≈ 20 KB (plans-created.txt + create-times.txt + stuck-scan.json + round-summary.json + metrics-scrape.prom). Total estimated storage:
1,440 rounds × 20 KB/round = ~28 MB over 30 days
This is well within typical lab storage budgets. No pruning required during the soak.
Daily Summaries¶
Retained permanently (one JSON file per day, ≈ 2 KB each = 60 KB total for 30 days).
Audit Store¶
The persistent audit JSONL files grow continuously. Expected size:
~25 audit events/round × 1,440 rounds × ~200 bytes/event = ~7 MB
No pruning during the soak. After the soak, autonomy audit prune --older-than 31d
can be used to expire records.
Archiving at End of Soak¶
After final report generation, archive the full soak directory:
tar -czf soak-val12-$(date +%F).tar.gz "$SOAK_DIR"
sha256sum soak-val12-$(date +%F).tar.gz > soak-val12-$(date +%F).tar.gz.sha256
The archive + SHA file constitute the permanent Gate D evidence artifact.
Scenario Matrix (10 Checks)¶
VAL12-01 — Framework Provisioned¶
When: Setup time.
Action: run_soak_val12_setup.sh exits 0; GET /v1/health returns 200.
Evidence: setup-summary.txt
Pass criterion: CP starts within 10s; setup exits 0.
VAL12-02 — Initial Workload Round¶
When: Round 1 (immediately after setup).
Action: First run_soak_val12_round.sh invocation creates 10 plans with 0 errors.
Evidence: rounds/<date>/round-<T>/round-summary.json
Pass criterion: create_errors=0, ok=10.
VAL12-03 — Fleet Target Reached¶
When: After round 10 (≈5 hours into soak).
Action: Page through GET /v1/rollouts?limit=100 and sum all returned plans until
next_cursor is empty; total count must reach >= 100.
Evidence: rounds/<date>/round-<T>/round-summary.json plan_count field.
Pass criterion: plan_count ≥ 100 (workplan GA gate: ≥100 concurrent plans).
VAL12-04 — P99 Latency Baseline¶
When: Round 1 (establishes baseline before store accumulation).
Action: Compute p99 of 10 create latencies from round 1 create-times.txt.
Evidence: rounds/<date>/round-<T>/round-summary.json p99_ms field.
Pass criterion: p99_ms ≤ 500 (workplan target: plan creation < 500ms p99).
VAL12-05 — Stuck Detection Operational¶
When: Every round.
Action: GET /v1/rollouts/stuck?threshold_seconds=3600 returns HTTP 200 and a
valid stuck-scan JSON payload.
Evidence: rounds/<date>/round-<T>/stuck-scan.json
Pass criterion: scan_http_code="200" and scan_ok=true in every round. A
stuck_count > 0 is not a failure; it shows the operator surface is working.
A scan transport error, non-200 status, or invalid JSON payload is a failure.
VAL12-06 — 7-Day Rollback Rate ≥ 99%¶
When: Day 7 checkpoint.
Action: Aggregate recovery results from all rounds in days 1–7.
Evidence: checkpoints/checkpoint-7d.json
Pass criterion: total_recovery_attempts > 0 and rollback_rate_7d ≥ 0.990
(equivalent to ≤ 1 failure per 100 recoveries).
VAL12-07 — P99 Latency Maintained at 7 Days¶
When: Day 7 checkpoint.
Action: Compute p99 across all rounds in days 1–7 from daily summaries.
Evidence: checkpoints/checkpoint-7d.json
Pass criterion: p99_ms_7d ≤ 500 — no degradation under accumulated store state.
VAL12-08 — CP Availability ≥ 99.9%¶
When: 30-day final report.
Action: healthy_rounds / total_rounds × 100, where only rounds with
health="ok" count as healthy; restarted rounds still count against uptime.
Evidence: final-report.json uptime_pct field.
Pass criterion: uptime_pct ≥ 99.9 (≤ 43 failed rounds out of 1,440 over 30 days).
VAL12-09 — Total Plans Created ≥ 4,000¶
When: 30-day final report.
Action: Sum create_ok across all round summaries.
Evidence: final-report.json total_creates field.
Pass criterion: total_creates ≥ 4000 (conservative: 1,440 rounds × 10 plans × 30%
success rate required — actual expected ≥ 14,000).
VAL12-10 — 30-Day Aggregate Rollback Rate ≥ 99%¶
When: 30-day final report.
Action: total_recovery_ok / (total_recovery_ok + total_recovery_fail).
Evidence: final-report.json agg_rollback_rate and total_recovery_attempts
fields.
Pass criterion: total_recovery_attempts > 0 and agg_rollback_rate ≥ 0.990
(workplan Gate D target).
Pass/Fail Criteria¶
Full pass: All 10 checks report PASS at their respective verification times.
Minimum acceptable (Gate D): VAL12-03, VAL12-10 pass — fleet target reached and 30-day rollback rate ≥ 99%.
Key thresholds:
Check |
Threshold |
Verification time |
|---|---|---|
VAL12-01 |
CP starts within 10s |
Setup |
VAL12-02 |
create_errors=0 for round 1 |
Round 1 |
VAL12-03 |
paginated plan_count ≥ 100 |
Round 10 (~5h) |
VAL12-04 |
p99_ms ≤ 500 |
Round 1 |
VAL12-05 |
scan_ok=true every round |
Continuous |
VAL12-06 |
recovery_attempts_7d > 0 and rollback_rate_7d ≥ 0.990 |
Day 7 |
VAL12-07 |
p99_ms_7d ≤ 500 |
Day 7 |
VAL12-08 |
uptime_pct ≥ 99.9 |
Day 30 |
VAL12-09 |
total_creates ≥ 4000 |
Day 30 |
VAL12-10 |
total_recovery_attempts > 0 and agg_rollback_rate ≥ 0.990 |
Day 30 |
Final Report Template¶
# VAL 12 — Fleet Rollout 30-Day Soak Final Report
Soak start: <ISO-8601>
Soak end: <ISO-8601>
Duration days: 30
Total rounds: <N> (scheduled: 1440)
Healthy rounds: <N>
Failed rounds: <N>
CP restart count: <N>
## Fleet Coverage
Total plans created: <N> (target ≥ 4,000 | expected ~14,400)
Max concurrent plans: <N> (target ≥ 100)
Plan store final size: <N>
## Rollback Reliability
Total recovery attempts: <N>
Rollback sampled: <true|false>
Total recovery successes: <N>
Total recovery failures: <N>
Aggregate rollback rate: <X.XXX> (target ≥ 0.990)
7-day window rate: <X.XXX>
30-day window rate: <X.XXX> ← Gate D target
## Latency (create P99 across all rounds)
Round 1 p99_ms: <N> (baseline)
Day 7 p99_ms: <N>
Day 30 p99_ms: <N>
All-time max p99_ms: <N> (target ≤ 500ms)
All-time 95th pct of p99: <N>
## Availability
Total rounds: <N>
Healthy rounds: <N>
Uptime %: <X.X> (target ≥ 99.9%)
Total CP restarts: <N>
Longest downtime streak: <N> rounds
## Stuck Detection
Total stuck plans detected: <N>
Total recoveries attempted: <N>
Auto-recovery success rate: <X.XXX>
## Alerts Issued
<copy from alerts.log — or "none">
## VAL12 Check Results
VAL12-01 framework_provisioned: PASS / FAIL
VAL12-02 initial_round_success: PASS / FAIL
VAL12-03 fleet_target_reached: PASS / FAIL (plan_count=<N>, target=100)
VAL12-04 latency_baseline: PASS / FAIL (p99=<N>ms, target=500ms)
VAL12-05 stuck_detection_ops: PASS / FAIL (scan_failures=<N>)
VAL12-06 rollback_rate_7d: PASS / FAIL (attempts=<N>, rate=<X>, target=0.990)
VAL12-07 latency_maintained_7d: PASS / FAIL (p99_7d=<N>ms, target=500ms)
VAL12-08 cp_availability: PASS / FAIL (uptime=<X.X>%, target=99.9%)
VAL12-09 total_creates: PASS / FAIL (creates=<N>, target=4000)
VAL12-10 agg_rollback_rate: PASS / FAIL (attempts=<N>, rate=<X.XXX>, target=0.990)
Overall: PASS / FAIL (<N>/10 checks)
## Gate D Assessment
<PASS: All Gate D criteria met — fleet soak demonstrates ≥99% rollback reliability
across ≥100 concurrent plans over 30 days.>
OR
<FAIL: Gate D criteria not met — see check results above. Required: VAL12-03 + VAL12-10.>
## Artifact
Archive: soak-val12-<YYYY-MM-DD>.tar.gz
SHA-256: <hash>
Known Failure Modes¶
Mode |
Description |
Detectable by |
|---|---|---|
SQLite lock contention |
Under high round frequency, WAL may stall on writes |
|
P99 degradation |
Accumulated plans slow |
|
CP OOM / crash |
Long-running CP accumulates memory state |
|
Port conflict on restart |
19000 re-used by another process after CP death |
|
Stuck scan always empty |
Soak plans never exceed threshold (72h set in spec) |
|
Out-of-Scope Items¶
Item |
Reason |
|---|---|
Edge agent behavior |
Edge agents not running in soak; device simulation deferred |
PostgreSQL backend |
Requires live PG; SQLite soak is Gate C; PG soak is Gate D+ |
Network partition injection |
Requires iptables/root; see VAL11 for process-level chaos |
Multi-region soak |
Out of scope for v1 |
Automated rollout promotion |
Batch promoter not running; plans stay in published phase |