VAL 08 — Fleet Rollout Throughput Validation¶

Purpose¶

This plan validates that the AutonomyOps control-plane can accept concurrent rollout plan creation at the workplan target of ≥100 concurrent device rollouts without errors.

It measures throughput at four concurrency tiers (N=1, 10, 50, 100 workers), documents the scaling behaviour of the SQLite-backed store under load, and produces a repeatable evidence record that is safe to quote in design-partner conversations.

Claims Under Test¶

ID	Claim
VAL08-C1	The control-plane accepts 500 concurrent plan creates (100 workers × 5 plans each) with zero errors
VAL08-C2	All plans created across the full N=1/10/50/100 matrix are stored durably and appear in paginated `GET /v1/rollouts` results after the run
VAL08-C3	The entire N=100 scenario completes within 30 seconds wall-clock
VAL08-C4	Throughput at N=100 is ≥ throughput at N=1 (no regression under concurrency)

Why a New Function (Not Extending VAL07)¶

Concern	VAL07	VAL08
Focus	Single-operation latency (p50/p95/p99)	Throughput at increasing concurrency
Plan history	Polluted by 45 plans from VAL07 sequential + concurrent batches	Clean DB (removed and recreated before each run)
Concurrency tiers	Fixed N=5 for wall-clock check	N=1, 10, 50, 100 (scenario matrix)
Workplan target	`plan_create p99 ≤ 500ms`	`≥100 concurrent device rollouts`

Using a fresh CP avoids accumulated plan history from influencing the list-consistency check and isolates timing measurements.

Architecture¶

Tooling¶

Tool	Purpose
`bash` background jobs (`&` + `wait`)	Concurrent worker simulation — N subshells, each creating `batch_per_worker=5` plans sequentially
`curl -s -o /dev/null -w '%{http_code}'`	HTTP status capture without response body overhead
`date +%s%3N`	Wall-clock millisecond timestamps bracketing each scenario
`python3 -c`	Throughput arithmetic (plans/sec = total / elapsed_s)
Per-worker temp files (`worker-N.txt`)	Error count aggregation across concurrent subshells

Port Allocation¶

Resource	Value
CP listen	`127.0.0.1:18993`
Metrics	`127.0.0.1:19093`
Data dir	`$WORK_DIR/val08` (removed and recreated before the run)

Concurrency Model¶

Each scenario launches N bash background subshells. Each subshell POSTs batch_per_worker (5) plans sequentially using deterministic plan IDs (val08-n${N}-w${w}-b${b}). All subshells are awaited with wait before wall-clock is stopped.

Plan IDs are globally unique across all scenarios because N, w, and b together form a unique triple.

SQLite Serialisation Note¶

The control-plane uses modernc.org/sqlite with SetMaxOpenConns(1), serialising all writes to a single connection. This is the expected bottleneck. VAL08-07 checks that throughput at N=100 does not drop below N=1 — a plateau is acceptable and expected; a regression would indicate lock contention or a panic loop.

Environment Assumptions¶

Assumption	Value
Platform	Linux (`linux/amd64`)
CP binary	`$AUTONOMY_BIN` from lab script
Transport	Plain HTTP (no TLS)
RBAC enforcement	Not set (bootstrap-mode; see lab script CP invocation)
Workers per scenario	N ∈ {1, 10, 50, 100}
Plans per worker	5 (sequential within worker)
Total plans created	(1+10+50+100) × 5 = 805
Wall-clock bound (N=100)	30 s

Scenario Matrix¶

VAL08-01 — Control-Plane Reachable¶

Action: GET /v1/health against dedicated CP at 18993. Evidence: val08-health.txt Pass criterion: HTTP 200.

VAL08-02 — N=1 Scenario: Zero Errors¶

Action: 1 worker creates 5 plans sequentially. Evidence: scenario-n1/scenario-report.txt, scenario-n1/worker-1.txt Pass criterion: errors=0, ok=5.

VAL08-03 — N=10 Scenario: Zero Errors¶

Action: 10 concurrent workers each create 5 plans (50 total). Evidence: scenario-n10/scenario-report.txt, scenario-n10/worker-{1..10}.txt Pass criterion: errors=0, ok=50.

VAL08-04 — N=50 Scenario: Zero Errors¶

Action: 50 concurrent workers each create 5 plans (250 total). Evidence: scenario-n50/scenario-report.txt, scenario-n50/worker-{1..50}.txt Pass criterion: errors=0, ok=250.

VAL08-05 — N=100 Scenario: Zero Errors (workplan target)¶

Action: 100 concurrent workers each create 5 plans (500 total). Evidence: scenario-n100/scenario-report.txt, scenario-n100/worker-{1..100}.txt Pass criterion: errors=0, ok=500. Workplan reference: “≥100 concurrent device rollouts (proposed validation target)”

VAL08-06 — N=100 Wall-Clock ≤ 30 s¶

Action: Measure elapsed time for the N=100 scenario. Evidence: val08-wall-clock-n100.txt Pass criterion: elapsed_ms ≤ 30000.

VAL08-07 — Throughput Scaling (N=100 ≥ N=1)¶

Action: Compare tput_n100 (plans/sec) to tput_n1. Evidence: val08-throughput-scaling.txt Pass criterion: tput_n100 ≥ tput_n1. Rationale: Verifies that issuing more concurrent requests does not make throughput worse. A plateau (equal throughput) is acceptable given SQLite’s single-writer model.

VAL08-08 — Aggregate Zero Errors¶

Action: Sum error counts across all four scenarios. Evidence: val08-error-aggregate.txt Pass criterion: total_errors=0.

VAL08-09 — List Count Consistent¶

Action: Page through GET /v1/rollouts?limit=100 after all scenarios complete and compare the accumulated .plans[] count to grand_total − total_errors. Evidence: val08-list-consistency.txt Pass criterion: list_count ≥ expected_min.

VAL08-10 — Prometheus Observations Recorded¶

Action: Scrape http://127.0.0.1:19093/metrics; sum cp_http_requests_total. Evidence: val08-metrics-raw.txt, val08-prometheus-check.txt Pass criterion: cp_http_requests_total > 0.

Evidence Files¶

File	Description
`val08-cp.log`	Control-plane stdout/stderr
`val08-health.txt`	Health check result
`scenario-n1/scenario-report.txt`	N=1 throughput + error summary
`scenario-n10/scenario-report.txt`	N=10 throughput + error summary
`scenario-n50/scenario-report.txt`	N=50 throughput + error summary
`scenario-n100/scenario-report.txt`	N=100 throughput + error summary
`scenario-n{1,10,50,100}/worker-*.txt`	Per-worker error counts (one file per worker)
`val08-wall-clock-n100.txt`	Elapsed ms for N=100 scenario
`val08-throughput-scaling.txt`	Throughput at all four tiers + scaling pass flag
`val08-error-aggregate.txt`	Total error count across all scenarios
`val08-list-consistency.txt`	List endpoint count vs expected minimum
`val08-metrics-raw.txt`	Raw Prometheus scrape
`val08-prometheus-check.txt`	`cp_http_requests_total` aggregate + pass flag
`val08-report.txt`	Human-readable composite report (10 checks)
`val08-report.json`	Machine-readable composite report

Pass/Fail Criteria¶

Full pass: All 10 checks report PASS.

Minimum acceptable: VAL08-01, VAL08-05, VAL08-08 pass (workplan target + zero-error guarantee). Remaining checks are context for performance characterisation.

Key thresholds:

Check	Threshold
VAL08-05 (N=100 success)	`errors=0` across 500 plans
VAL08-06 (wall clock)	`elapsed_ms ≤ 30000`
VAL08-07 (scaling)	`tput_n100 ≥ tput_n1`
VAL08-08 (aggregate)	`total_errors=0`

Failure Handling¶

Symptom	Likely Cause	Resolution
VAL08-01 FAIL	CP binary missing or port conflict	Check `val08-cp.log`; verify port 18993 free
VAL08-05 FAIL, errors > 0	SQLite write contention returning 5xx	Inspect `scenario-n100/worker-*.txt` for non-201 codes
VAL08-06 FAIL, elapsed > 30s	Very slow host or high swap	Note hardware; bound is intentionally generous (30s for 500 creates)
VAL08-07 FAIL, tput_n100 < tput_n1	Serialisation regression or panic loop	Check CP logs for errors during N=100 scenario
VAL08-09 FAIL, list_count low	Missing pages, failed creates, or list-path regression	Check `val08-list-consistency.txt`, `next_cursor` handling, and scenario error files