CLI Audit And Support Bundle Lab¶

Audience: reviewers and operators who want reproducible local evidence that the PR-17 CLI and retained audit surfaces, the PR-18 support-bundle surface, the PR-19 audit query/export refinements, the PR-20 RBAC role model CLI work, the PR-21 RBAC enforcement plumbing work, the PR-22 metrics CLI surface work, the PR-24 HA quorum status surface work, the PR-25 split-brain detect workflow work, the PR-26 split-brain manual recovery workflow work, the PR-27 config migration tooling work, the PR-29-followup-a RBAC default-on hardening work, the PR-29-followup-c cert revocation transport enforcement work, the PR-29-followup-d database-backed audit persistence work, and the PR-29-followup-e cert-management RBAC coverage completion work against real command runs.

This lab captures actual CLI stdout plus the structured audit lines emitted on stderr by live commands in the current branch, and then queries and exports the retained records written by those same runs:

autonomy rollout plan create
autonomy rollout plan publish
autonomy rollout plan cancel
autonomy ha backup create
autonomy ha backup restore
autonomy ha failover trigger
autonomy cert issue
autonomy cert rotate
autonomy rbac role create
autonomy rbac role list
autonomy rbac role assign
edgectl relay deadletter retry
edgectl relay deadletter purge --force
edgectl relay config set-bandwidth
autonomy audit query
autonomy audit export --format json|csv
autonomy support-bundle generate
Default-on RBAC bootstrap mode denial (no env var set, empty store)
RBAC bootstrap seed via rbac role assign (bootstrap path)
RBAC break-glass bypass (AUTONOMY_RBAC_BREAK_GLASS=1)
RBAC opt-out backward compat (AUTONOMY_RBAC_ENFORCEMENT=0)
Default-on RBAC-enforced autonomy audit query
Default-on RBAC-enforced autonomy audit export
Default-on RBAC-enforced autonomy ha status
Direct /v1/ha/status requests with and without operator identity
autonomy metrics list
autonomy metrics query
autonomy ha quorum status
autonomy ha split-brain detect
autonomy ha split-brain recover
autonomy config migrate
Cert RBAC enforcement — denied and allowed flows for all six autonomy cert subcommands under AUTONOMY_RBAC_ENFORCEMENT=1

1. Gaps Closed by This Lab¶

Before this lab refresh, the checked-in PR-17 evidence relied on in-package test harnesses. That left five evidence gaps:

rollout, HA, and backup audit captures were not tied to a live control-plane process
relay audit captures were not clearly tied to the existing daemon-backed edge lab
the HA helper lived outside the repo under /tmp, which weakened reproducibility
the retained query and export surface was covered by tests only, without checked-in live evidence from the same captured audit dataset
support-bundle generation had no checked-in live evidence against a real control-plane, retained audit store, and log file
RBAC role create/list/assign had no checked-in live evidence against the retained audit store and local file-backed assignment model
RBAC enforcement had no checked-in live evidence proving allowed and denied behavior across CLI and direct HA-status HTTP paths
metrics list/query had no checked-in live evidence against a real control-plane metrics endpoint
HA quorum status had no checked-in live evidence for healthy, degraded, lost, and restored transitions from the same live HA lab
split-brain detection had no checked-in live evidence for healthy, detected, and possible states or for deduplicated ha.split_brain.detected audit emission
split-brain recovery had no checked-in live evidence for dry-run planning, executed recovery, post-recovery convergence, or retained ha.split_brain.recovered audit records
config migration tooling had no checked-in live evidence for deterministic dry-run output, format selection, file and in-place writes, or fail-closed handling of malformed and invalid legacy input

This lab closes those gaps by:

starting a real local SQLite-backed control-plane for rollout verification
running the real PostgreSQL primary + standby HA workflow for backup and failover verification
reusing the live daemon-backed edge relay lab for retry, purge, and bandwidth
building the HA helper from the repo at scripts/labs/orchestrator_ha_server.go
retaining audit records under the evidence directory and querying/exporting those same live records through autonomy audit query and autonomy audit export
generating a support bundle from the same live HA endpoint, retained audit store, and HA server log used by the rest of the lab
capturing live evidence for the PR-19 retained-audit refinements:
- category, outcome, and source filtering
- explicit invalid --output rejection
- invalid start/end time-range rejection
- invalid export format rejection without truncating an existing file
capturing live evidence for PR-20 RBAC role create/list/assign, including:
- canonicalized role-name persistence
- idempotent repeat assignment behavior
- retained auth.role.assigned query proof
capturing live evidence for PR-21 RBAC enforcement and PR-29-followup-a default-on hardening, including:
- default-on bootstrap mode denial with no AUTONOMY_RBAC_ENFORCEMENT set
- bootstrap seed via rbac role assign with empty store (only allowed bootstrap action)
- post-bootstrap outsider denial (full enforcement is active)
- post-bootstrap allowed access after correct role is granted
- break-glass bypass (AUTONOMY_RBAC_BREAK_GLASS=1) with mandatory auth.break_glass.used audit event
- break-glass safety: without AUTONOMY_OPERATOR the bypass is still denied
- opt-out backward compat (AUTONOMY_RBAC_ENFORCEMENT=0)
- default-on denied audit query and audit export for a non-auditor (no env var)
- default-on allowed audit query, audit export, and ha status for authorized roles
- denied direct /v1/ha/status access without operator identity
- retained auth.access.denied and auth.break_glass.used audit proof
capturing live evidence for PR-22 metrics visibility, including:
- local metrics list text and JSON catalog output
- live metrics query output against a real /metrics endpoint
- exact metric-family filtering for rollout counters
- histogram-family filtering for request-duration buckets, count, and sum
capturing live evidence for PR-24 quorum visibility, including:
- healthy quorum status on a live HA leader
- degraded quorum when the sync standby is stopped
- lost quorum when the primary PostgreSQL container is unavailable
- restored quorum after PostgreSQL connectivity is recovered
- retained ha.quorum.lost / ha.quorum.restored audit records from the same live run
capturing live evidence for PR-25 split-brain visibility, including:
- healthy split-brain status on the live quorum helper
- detected split-brain after durable epoch divergence is introduced
- possible split-brain after a ghost unclosed epoch row is injected
- deduplicated retained ha.split_brain.detected audit records after repeated identical detected-state polls
capturing live evidence for PR-26 split-brain manual recovery, including:
- manual-reconcile dry-run output against a real detected stale-leader state
- promote-leader execution against that same live stale-leader state
- post-recovery risk: none verification from the same helper
- retained ha.split_brain.recovered audit records for both dry-run and executed recovery
capturing live evidence for PR-27 config migration tooling, including:
- deterministic dry-run output for a full v0 fixture
- stdout migration output in both YAML and TOML
- file and in-place writes for migrated configs
- fail-closed rejection of unsupported schema versions
- fail-closed rejection of malformed input and invalid migrated configs
capturing live evidence for PR-29-followup-d database-backed audit persistence (when AUTONOMY_AUDIT_PG_URL or POSTGRES_URL is set), including:
- audit_events table write via PGAuditEmitter (parallel to file emitter)
- autonomy audit query --pg-url reading from PostgreSQL (primary query path)
- autonomy audit export --pg-url exporting in JSON and CSV from the DB
- autonomy audit prune --older-than Nd retention enforcement
- post-prune query confirming deleted rows are absent
- file-backed query remaining as fallback when no --pg-url is set
capturing live evidence for PR-29-followup-e cert-management RBAC coverage, including:
- all six autonomy cert subcommands denied when operator lacks cert:manage/cert:read
- cert list and cert check-revocation denied (newly guarded; previously unguarded)
- cert issue, cert rotate, cert revoke, cert sync-crl denied (require cert:manage)
- cert list and cert issue allowed after granting the appropriate cert role
- retained auth.access.denied audit records for all denied cert operations

2. Prerequisites¶

Docker available locally
Go 1.25.7
PostgreSQL client tools available in the Docker postgres:16 image
openssl, curl, and xxd available locally
ability to bind localhost TCP ports and a Unix socket

3. Run the Lab¶

From the repository root:

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_cli_audit_lab.sh

Optional custom evidence directory:

bash scripts/labs/run_cli_audit_lab.sh \
  "$PWD/evidence/pr17-cli-audit-local-$(date +%F)"

4. What the Runner Does¶

The runner performs twenty-five real verification slices:

Starts autonomy-orchestrator serve on 127.0.0.1:18888 and runs rollout create, publish, and cancel against the live HTTP API, while also exposing /metrics on 127.0.0.1:19090 for the PR-22 metrics captures.
Generates a local CA with openssl and runs the certificate workflow against real certificate files, including:
- autonomy cert issue
- autonomy cert rotate
- autonomy cert revoke
- autonomy cert check-revocation
- autonomy cert sync-crl --min-sources 2 against a multi-publisher set
- a source control-plane with --tls-crl-file
- a second source control-plane serving the same CRL
- a follower control-plane with repeated --tls-crl-sync-url and --tls-crl-sync-min-sources 2 proving cross-host CRL pull plus publisher agreement
- VAL 01 — certificate rotation validation (Phase 8): issues a 2-day node-c cert for expiry-window testing, proves the old cert is accepted over live mTLS before rotation, times the atomic cert rotate to confirm it completes within 300 seconds, confirms the expiry window clears after rotation, and then connects with the new cert without restarting the control-plane — proving continuity for fresh mTLS client connections across the rotation window; also captures a cert.rotated retained audit query and a composite 6-check PASS/FAIL report
- VAL 02 — trust-chain rejection validation (Phase 9): generates a rogue CA and an expired cert inline, then proves the control-plane consistently rejects all five rejection cases against the same live control-plane: missing client cert, invalid chain (rogue CA with legitimate-looking CN), expired cert, revoked cert (reusing the CRL from Phase 1), and wrong server trust (client-side CA bundle mismatch); captures stderr evidence for each and a composite 5-check PASS/FAIL report
Brings up a disposable PostgreSQL primary + streaming standby, runs the HA helper from scripts/labs/orchestrator_ha_server.go, and verifies:
- backup create
- backup restore in maintenance mode
- manual failover to a second helper
Runs autonomy rbac role create, list, and assign against a local file-backed RBAC store and captures the emitted auth.role.assigned audit line.
Demonstrates default-on RBAC enforcement (no env var required) and verifies the bootstrap, break-glass, and opt-out paths:
- default-on bootstrap mode denial against a fresh empty RBAC store
- bootstrap seed via rbac role assign with no enforcement env var
- post-bootstrap outsider denial (full enforcement)
- post-bootstrap allowed access after correct role assignment
- opt-out backward compat via AUTONOMY_RBAC_ENFORCEMENT=0
- break-glass bypass with AUTONOMY_RBAC_BREAK_GLASS=1 + AUTONOMY_OPERATOR
- break-glass safety check: no bypass without AUTONOMY_OPERATOR Then restarts the HA helper and verifies default-on enforcement with a populated store:
- denied direct /v1/ha/status without operator identity
- allowed direct /v1/ha/status with X-Autonomy-Operator
- denied autonomy audit query and autonomy audit export for an operator lacking audit_history:read (default-on, no env var)
- allowed autonomy audit query, autonomy audit export, and autonomy ha status for authorized identities
Runs the existing daemon-backed edge relay lab with --with-bandwidth so retry, purge, and bandwidth configuration audit lines come from a live edged process.
Queries and exports the retained audit records written by those live runs from AUTONOMY_AUDIT_DIR under the evidence bundle, including:
- full retained query
- category-filtered auth query
- category-filtered rollout query
- source-filtered edge query
- outcome-filtered success query
- invalid --output and invalid time-range checks
- invalid export-format preservation of an existing output file
Generates a live support bundle against:
- --orchestrator-url http://127.0.0.1:18089
- the retained audit directory under the evidence bundle
- the active HA server log
- a redactable config file created by the runner
Starts a dedicated HA quorum helper on 127.0.0.1:18091 with a fast quorum monitor interval and captures:
- healthy status while the primary and synchronous standby are available
- degraded status after stopping the standby
- lost status after stopping the primary container
- restored status after the database containers are started again
- retained ha.quorum.lost and ha.quorum.restored audit records
Reuses the live quorum helper on 127.0.0.1:18091 and captures:
- healthy split-brain status with matching local and durable epochs
- detected split-brain after advancing leadership_state.current_epoch away from the helper’s cached epoch
- manual-reconcile dry-run output for the detected stale-leader case
- promote-leader execution output for the same detected stale-leader case
- post-recovery risk: none verification after the stale elector clears its local leader claim
- possible split-brain after inserting a second unclosed leader_epochs row
- deduplicated retained ha.split_brain.detected audit records after repeated identical detected-state polls
- retained ha.split_brain.recovered audit records from the same live run
Runs autonomy config migrate against checked-in v0 fixtures and captures:
- dry-run output without writing files
- migrated YAML and TOML stdout output
- migrated YAML and TOML file output
- in-place migration with before/after mode and checksum captures
- unsupported-version rejection
- malformed-input rejection
- invalid-migrated-config rejection
Runs the database-backed audit persistence slice when AUTONOMY_AUDIT_PG_URL or POSTGRES_URL is set (skipped cleanly when absent), capturing:
- DB-backed audit query text output (primary query path via audit_events)
- DB-backed category-filtered query (cert category, JSON format)
- DB-backed JSON export and CSV export from audit_events
- audit prune --older-than 90d with no-op output (no qualifying rows)
- audit prune --older-than 1d deleting the seeded 1h and 2h rows
- post-prune query confirming only the most recent row remains
Runs the cert RBAC enforcement slice (run_cert_rbac_lab), capturing:
- all six autonomy cert subcommands denied for an operator without cert:manage or cert:read (confirmed by the presence of “cert:manage” in the error output)
- cert issue allowed for cert-admin after granting a custom cert-operator role holding cert:manage
- cert list and cert check-revocation allowed for cert-reader after granting a custom cert-reader role holding cert:read
- autonomy audit query --category auth capturing retained auth.access.denied records from the same denied-access runs
- file-backed fallback query captured alongside DB query for comparison
Runs the RBAC permission enforcement validation slice (run_rbac_val03_lab), proving the three VAL03 claims across a 14-check matrix:
- 5 DENY checks: unassigned and under-privileged identities are blocked with RBAC error messages before any network call is made
- 5 ALLOW checks: identities whose role includes the required permission proceed past the guard and reach the control-plane successfully
- 3 NOT_GUARDED checks: commands without an RBAC guard (rbac role list, rollout plan list, support-bundle generate) succeed or fail for non-RBAC reasons regardless of the operator’s assignment
- 1 PRESENT check: the retained audit store contains auth.access.denied records from the denial runs Uses the HA server still running on 127.0.0.1:18090 (started by slice 5) for ALLOW-path ha status probes; if the helper is unavailable those three checks are recorded as SKIP instead of being counted as PASS. Produces both a human-readable val03-report.txt and a machine-readable val03-report.json.
Runs the audit completeness validation slice (run_audit_completeness_val04_lab), proving the four VAL04 claims across a 10-check matrix:
- VAL04-01: retained store is non-empty after all prior phases have run
- VAL04-02: all 6 audit categories (rollout, ha, cert, relay, auth, rollback) return at least one record from the retained store
- VAL04-03..08: every record returned in each category query contains all 6 mandatory schema fields (event_name, category, action, outcome, source, timestamp)
- VAL04-09: a full retained-store query with --limit 0 --output json succeeds and completes within a 2000 ms latency bound (actual typical elapsed < 100 ms)
- VAL04-10: all 25 of the 25 defined wired event types are present in the retained store, proving complete wired-surface coverage for the current lab contract Produces both a human-readable val04-report.txt and a machine-readable val04-report.json.
Runs the OTel integration validation slice (run_otel_val05_lab), proving the four VAL05 claims across a 9-check matrix:
- VAL05-01..03 (Prometheus): the control-plane /metrics endpoint returns HTTP 200, all expected metric families are present, and cp_http_requests_total, cp_http_request_duration_seconds_count, cp_rollout_plans_total, plus cp_events_ingested_total show non-zero observations after lab traffic and an explicit POST /v1/events ingest
- VAL05-04..06 (WAL pipeline): the telemetry WAL is non-empty after telemetry_emit_helper runs, telemetry export produces non-empty JSONL, and the JSONL contains all mandatory event fields
- VAL05-07 (OTLP delivery): telemetry flush to a local autonomy telemetry sink exits 0 and the sink prints at least one received N log records payload line
- VAL05-08..09 (correlation IDs): the known trace_id and span_id from the helper events appear in the JSONL export, and traceId / spanId appear in the OTLP sink output Uses an isolated temp WAL directory (does not touch the runtime WAL). Produces both a human-readable val05-report.txt and a machine-readable val05-report.json.
Runs the support-bundle validation slice (run_support_bundle_val06_lab), proving the four VAL06 claims across a 10-check matrix:
- VAL06-01: support-bundle generate exits 0 and writes a non-empty .tar.gz archive
- VAL06-02: generation completes within 30 seconds
- VAL06-03: the three always-present core files (manifest.json, system_info.json, build_info.json) appear in the archive listing
- VAL06-04: manifest.json records all 6 collector names (system_info, build_info, config, ha_status, audit_recent, logs) regardless of their individual status
- VAL06-05: system_info.json contains all 5 required fields (os, arch, go_version, hostname, collected_at)
- VAL06-06: audit_recent.json is a non-empty JSON array with ≥ 1 record from the retained audit store
- VAL06-07: config_redacted.yaml contains fleet_salt: <REDACTED> and the original salt value is absent
- VAL06-08: config_redacted.yaml contains REDACTED in the postgres URL and the original password (val06-secret-pass) is absent
- VAL06-09: no PEM block (-----BEGIN) appears in any archive entry
- VAL06-10: degraded mode — generating with --orchestrator-url pointing to a non-existent server exits 0; manifest.json records ha_status: "failed" (not a fatal error) Creates two bundles: a normal bundle for checks VAL06-01..09 and a degraded bundle for VAL06-10. Produces both a human-readable val06-report.txt and a machine-readable val06-report.json.
Runs the rollout latency baseline slice (run_rollout_latency_val07_lab), proving the four VAL07 claims across a 9-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18992 with an isolated SQLite data directory that is removed and recreated on each run (no prior state) and measures latency using curl -w '%{time_total}' with Python percentile computation:
- VAL07-01: dedicated control plane starts and GET /v1/health returns 200
- VAL07-02..04 (plan-create latency): 20 sequential POST /v1/rollouts requests; p50 ≤ 100 ms, p95 ≤ 300 ms, p99 ≤ 500 ms — the 500 ms p99 bound is the primary workplan latency target for rollout plan creation
- VAL07-05 (plan-list latency): 20 sequential GET /v1/rollouts requests with 20 plans in store; p99 ≤ 500 ms
- VAL07-06..07 (concurrent): 5 parallel plan creates all return 2xx and total wall-clock time is ≤ 2000 ms, proving operator-facing responsiveness is not blocked by single-writer SQLite serialisation
- VAL07-08: zero non-2xx responses across all 45 benchmark requests
- VAL07-09: cp_http_requests_total Prometheus counter is non-zero after the benchmark run Produces both a human-readable val07-report.txt and a machine-readable val07-report.json (includes plan_create_ms and plan_list_ms latency objects).
Runs the rollout throughput slice (run_rollout_throughput_val08_lab), proving the four VAL08 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18993 with an isolated SQLite data directory that is removed and recreated on each run, then runs four concurrency tiers (N=1, 10, 50, 100 workers) each creating 5 plans sequentially using background bash subshells:
- VAL08-01: dedicated control plane starts and GET /v1/health returns 200
- VAL08-02..05 (concurrency tiers): at N=1, 10, 50, and 100 concurrent workers all plans are accepted with zero errors; VAL08-05 (N=100, 500 total plans) is the primary workplan target (≥100 concurrent device rollouts)
- VAL08-06 (wall clock): the N=100 scenario completes within 30 s
- VAL08-07 (throughput scaling): throughput at N=100 is ≥ throughput at N=1, confirming concurrency does not regress the write path
- VAL08-08: aggregate error count across all four scenarios is zero
- VAL08-09: GET /v1/rollouts after all scenarios returns ≥ (grand_total − errors) plans, confirming durable storage
- VAL08-10: cp_http_requests_total Prometheus counter is non-zero Produces both a human-readable val08-report.txt and a machine-readable val08-report.json (includes throughput object with plans/sec at each tier).
Runs the stuck rollout detection slice (run_stuck_detection_val09_lab), proving the four VAL09 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18994 with an isolated SQLite data directory that is removed and recreated on each run. Staleness injection uses threshold_seconds=3 plus a 4-second sleep — no SQLite manipulation required:
- VAL09-01: dedicated control plane starts and GET /v1/health returns 200
- VAL09-02 (empty baseline): scan on empty store returns stuck_count=0
- VAL09-03 (fresh plans): 5 plans created; immediate scan returns stuck_count=0 (plans younger than threshold)
- VAL09-04 (stale detection): after sleeping 4 s, rescan returns stuck_count=5 — all 5 plans detected as stuck
- VAL09-05 (diagnosis): all 5 stuck plans carry the exact expected diagnosis string ("zero activations — nodes may not be receiving the plan or artifact distribution is incomplete")
- VAL09-06 (paused excluded): pausing plan-b removes it from the stuck scan (paused is not an active phase)
- VAL09-07 (terminal excluded): cancelling plan-c removes it from the stuck scan (terminal phase)
- VAL09-08 (retry recovery): POST recover strategy=retry on plan-d returns new_phase=active; updated_at is refreshed
- VAL09-09 (rollback recovery): POST recover strategy=rollback on plan-e returns new_phase=rolled_back
- VAL09-10 (post-recovery clean): final scan confirms plan-a still stuck, plan-b/c/d/e absent (paused/terminal/retry-refreshed) Produces both a human-readable val09-report.txt and a machine-readable val09-report.json.
Runs the rollback reliability slice (run_rollback_reliability_val10_lab), proving the four VAL10 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18995. Runs rollback preview for all four target kinds (read-only, no CP needed) then exercises the CLI execute path with 5 + 5 = 10 plan creates and dispatches:
- VAL10-01: all 4 preview commands exit 0
- VAL10-02: rollout_plan JSON preview has safety_class=terminal, orchestrated=true, valid strategies retry and rollback
- VAL10-03: relay_deadletter JSON preview has orchestrated=false and edgectl instructions in manual_path
- VAL10-04: 5 × rollback execute strategy=retry on real plans — all exit 0 (success_rate=1.000)
- VAL10-05: 5 × rollback execute strategy=rollback on real plans — all exit 0 (success_rate=1.000)
- VAL10-06: --output json execute has outcome, new_state, kind fields
- VAL10-07: execute on nonexistent plan exits non-zero
- VAL10-08: execute on relay_deadletter (not orchestrated) exits non-zero and prints edgectl instructions
- VAL10-09: audit query --event-type rollback.preview.requested returns ≥ 4 actor-scoped events from this slice of the retained store
- VAL10-10: aggregate success rate across 10 executes is ≥ 0.990 (workplan target: ≥99%); at least 10 actor-scoped rollback.executed success events from this slice are retained Produces both a human-readable val10-report.txt and a machine-readable val10-report.json (includes success_rate object with per-strategy and aggregate rates).
Runs the chaos validation slice (run_chaos_val11_lab), proving the four VAL11 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18996 with an isolated SQLite data directory that is reused across kill/restart cycles within the function (durability requires the same data dir). Chaos injection is via SIGTERM — no root/iptables required:
- VAL11-01: dedicated control plane starts and GET /v1/health returns 200
- VAL11-02 (CP kill → client error): after SIGTERM, autonomy rollback execute exits non-zero with a connection-refused / dial error, confirming the CLI does not silently swallow CP unavailability
- VAL11-03 (data durability): after CP restart against the same SQLite data dir, GET /v1/rollouts returns ≥ 10 plans, confirming the full pre-kill corpus survived
- VAL11-04 (rapid restart resilience): 3 additional kill+restart cycles; final plan count ≥ pre-rapid count; new plan create returns 201 — confirms repeated restarts do not corrupt the store or break the write path
- VAL11-05 (gate-wait survival): plan val11-gate-1 (created in published phase before the kill) retains phase=published after restart, confirming the gate-wait state is not lost across the CP kill boundary
- VAL11-06 (device-unresponsive proxy): 3 proxy plans created, sleep 3s (threshold=2s); stuck scan returns stuck_count ≥ 3 with non-empty diagnosis strings, proving the operator can detect unresponsive devices
- VAL11-07 (artifact corruption proxy): plan with invalid artifact_ref and target_lock_fingerprint is accepted with HTTP 201 and is queryable by the operator via GET /v1/rollouts/val11-corrupt-1
- VAL11-08 (corrupt plan rollback): rollback execute strategy=rollback on the corrupt-artifact plan exits 0, proving the operator can roll back a plan with suspicious artifact metadata
- VAL11-09 (bulk cascade recovery): 3 × rollback execute strategy=retry on the device-unresponsive proxy plans — all exit 0 (cascade_ok=3)
- VAL11-10 (audit integrity): audit query --event-type rollback.executed scoped to the chaos actor and slice start-time returns ≥ 1 success events, confirming audit capture is not disrupted by CP kill/restart cycles Produces both a human-readable val11-report.txt and a machine-readable val11-report.json.
Runs the HA failover validation slice (run_ha_failover_val13_lab), proving the HA leader-election subsystem under unplanned failure conditions. Two orchestrator_ha_server nodes (ports 18997/18998) share a dedicated Docker PostgreSQL instance. The 10-check matrix covers:
- VAL13-01 (baseline): node-1 acquires leadership first; node-2 is a follower
- VAL13-02 (pre-kill write): SQL probe row inserted while node-1 is leader
- VAL13-03 (SIGTERM failover): failover time measured from kill to node-2 “acquired leadership” log entry; threshold ≤ 5000 ms
- VAL13-04 (zero data loss): probe row note=pre-kill readable from the shared PG instance after failover, proving shared-database durability
- VAL13-05 (post-failover leader active): node-2 ha status confirms it holds leadership after node-1 exits
- VAL13-06 (PG crash detected): docker stop on PG primary → node-2 /v1/ha/quorum transitions to quorum_health=lost
- VAL13-07 (PG crash recovery): manual docker start → quorum_health=healthy restored; probe row still intact
- VAL13-08 (rapid kill cycles): 3 × alternating SIGTERM failover cycles, each measured ≤ 5000 ms, followed by leader-status and probe-row checks
- VAL13-09 (SIGKILL disk-fault proxy): SIGKILL on leader (no graceful Resign, simulates OOM/disk crash) → node-2 takes over ≤ 5000 ms
- VAL13-10 (post-chaos stability): single write-ready leader confirmed after all failure scenarios
Produces val13-report.txt and val13-report.json.
Runs the HA replication lag baseline slice (run_ha_replication_lag_val14_lab), benchmarking PostgreSQL streaming replication lag under the autonomyops HA architecture. A val14-pg-primary + val14-pg-standby Docker pair is provisioned via pg_basebackup; a single HA server at port 18999 uses --min-sync-replicas 1 so quorum health tracks standby availability. The 10-check matrix covers:
- VAL14-01 (replication streaming): pg_stat_replication.state = streaming confirmed after standby provisioning
- VAL14-02 (idle LSN gap zero): (pg_current_wal_lsn() - write_lsn)::bigint = 0 at rest — no unreplicated WAL
- VAL14-03 (light load drain): 100 rows × 500 bytes inserted with synchronous_commit=off; lag is sampled during the active drain window and WAL LSN gap drains to 0 within 2000 ms
- VAL14-04 (heavy load drain): 500 rows × 2000 bytes (~1 MB WAL) with synchronous_commit=off; LSN gap drains within 5000 ms
- VAL14-05 (post-drain gap closed): LSN gap confirmed 0 after the heavy drain sequence completes
- VAL14-06 (HA replication endpoint): /v1/health/replication responds 200 with a valid JSON body
- VAL14-07 (standby stop degraded): docker stop val14-pg-standby → quorum_health=degraded within 30 s
- VAL14-08 (standby start healthy): docker start val14-pg-standby → quorum_health=healthy restored
- VAL14-09 (catch-up after restart): after generating WAL while the standby is offline, the backlog drains to 0 within 10 s of standby restart, proving the replication slot catches up correctly
- VAL14-10 (threshold report): practical alerting thresholds derived from observed p95 lag using the formula healthy = max(p95×3+1, 10), degraded = max(healthy×10, 100), alert = max(healthy×50, 500) — all three thresholds must be positive
Produces val14-report.txt and val14-report.json.
Runs the backup/restore validation slice (run_backup_restore_val15_lab), proving the ha backup create/list/restore workflow end-to-end with integrity verification, timing bounds, and error-path safety checks. A dedicated Docker PostgreSQL instance (val15-pg-primary) is provisioned with two fixture tables (val15_small: 100 rows x 200 bytes; val15_medium: 1,000 rows x 1,000 bytes). The HA server at port 19001 is restarted between phases to transition between normal mode (backup) and maintenance mode (restore). The 10-check matrix covers:
- VAL15-01 (backup created): ha backup create exits 0; .dump file exists with size > 0
- VAL15-02 (backup file valid): pg_restore -l on the .dump file exits 0 and the TOC contains TABLE DATA entries
- VAL15-03 (backup metadata correct): ha backup list --output json shows backup_id with status=completed
- VAL15-04 (checksum verified): SHA-256 of the .dump file computed independently matches the checksum=<hex> field in the CLI output
- VAL15-05 (backup timing bound): ha backup create wall time ≤ 30,000 ms
- VAL15-06 (restore correct): after mutating the tables post-backup (UPDATE first 50 rows; DELETE half the medium table), ha backup restore --confirm reverts the data; post-restore counts are small=100, medium=1000 with correct spot-check payload values
- VAL15-07 (restore timing bound): ha backup restore wall time ≤ 60,000 ms
- VAL15-08 (multi-backup inventory): a second ha backup create followed by ha backup list returns count ≥ 2 with both backup IDs present
- VAL15-09 (restore requires confirm): ha backup restore without --confirm exits non-zero and mentions --confirm in the error output (safety gate enforcement)
- VAL15-10 (audit events captured): audit query --event-type ha.backup.created and --event-type ha.backup.restored each return ≥ 1 event from the shared audit store
Produces val15-report.txt and val15-report.json.
Runs the split-brain chaos validation slice (run_split_brain_chaos_val16_lab), proving the split-brain detection and recovery subsystem under SQL-injected fault conditions. A dedicated Docker PostgreSQL instance (val16-pg-primary) is provisioned with a single HA server at port 19002 (--min-sync-replicas 0, --campaign 500ms). Injection is performed directly via psql against leadership_state and leader_epochs — no iptables or second HA node required. The 10-check matrix covers:
- VAL16-01 (baseline risk none): /v1/ha/split-brain returns risk=none before any injection
- VAL16-02 (epoch inject detected): after UPDATE leadership_state SET current_epoch = current_epoch + 99, holder_id = 'val16-injected-node', the API returns risk=detected
- VAL16-03 (detection idempotent): a second API call returns the same risk=detected without side effects
- VAL16-04 (dry-run reconcile ok): ha split-brain recover --strategy manual-reconcile exits 0 and a follow-up /v1/ha/split-brain read still reports risk=detected (planning path, no DB writes)
- VAL16-05 (promote leader recovers): ha split-brain recover --strategy promote-leader exits 0 and the API subsequently returns risk=none
- VAL16-06 (data integrity after recovery): a user-table probe row (val16_probe.note = 'pre-inject') is untouched after promote-leader recovery, confirming recovery is scoped to metadata tables only
- VAL16-07 (ghost node possible): inserting two leader_epochs rows with resigned_at IS NULL and foreign holder_id values raises risk=possible
- VAL16-08 (ghost node self-clears): stamping resigned_at on the injected rows causes the API to return risk=none without any HA server restart
- VAL16-09 (audit events captured): audit query --event-type ha.split_brain.detected --start-time <slice_start> and --event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start> each return ≥ 1 event from this slice in the shared audit store
- VAL16-10 (post chaos stability): /v1/ha/status confirms holder_id contains cp-val16-node after all chaos scenarios complete
Produces val16-report.txt and val16-report.json.
Runs the quorum loss validation slice (run_quorum_loss_val17_lab), proving quorum loss detection timing, write-blocking safety behavior, and recovery detection against the workplan’s ≤ 60 s target (Gap HA-004). A dedicated Docker PostgreSQL instance (val17-pg-primary) is provisioned with a single HA server at port 19003 (--min-sync-replicas 0, --quorum-monitor-interval 500ms). Quorum loss is induced by docker stop val17-pg-primary; recovery by docker start. Timing is measured in milliseconds using python3 time.time()*1000 wrapped around each docker stop/start call. The 10-check matrix covers:
- VAL17-01 (baseline quorum healthy): /v1/ha/quorum returns quorum_health=healthy and write_block_active=false before any fault
- VAL17-02 (loss detected): quorum_health=lost after docker stop
- VAL17-03 (loss detection timing bound): loss_ms ≤ 30,000 (30 s threshold for workplan ≤ 60 s target with --quorum-monitor-interval 500ms)
- VAL17-04 (write blocked during loss): write_block_active=true and can_accept_protected_writes=false in the quorum status JSON during loss
- VAL17-05 (loss reason populated): quorum_loss_reason non-empty during loss
- VAL17-06 (recovery detected): quorum_health=healthy after docker start
- VAL17-07 (recovery timing bound): recovery_ms ≤ 30,000
- VAL17-08 (write unblocked after recovery): write_block_active=false; can_accept_protected_writes=true; last_lost_at and last_restored_at timestamps populated; detected_loss_count ≥ 1
- VAL17-09 (second cycle count increments): after a second loss/recovery completes, detected_loss_count ≥ 2
- VAL17-10 (audit events captured): audit query --event-type ha.quorum.lost --start-time <val17_start_time> and --event-type ha.quorum.restored --start-time <val17_start_time> each return ≥ 1 event
Produces val17-report.txt and val17-report.json.

5. Generated Artifacts¶

The evidence bundle is split by surface:

autonomy/ for rollout and cert captures
metrics/ for the live PR-22 metrics list/query captures
ha/ for live HA backup and failover captures
quorum/ for live HA quorum status and transition captures
split-brain/ for live split-brain detection captures
rbac/ for local RBAC store and assignment captures
relay/ for the live edge relay captures
retained/ for the file-backed audit store plus query/export captures
support-bundle/ for the generated archive plus extracted proof files
config-migrate/ for the live PR-27 config migration captures
cert_rbac/ for the PR-29-followup-e cert RBAC enforcement captures
val03/ for the VAL03 RBAC permission enforcement validation captures
val04/ for the VAL04 audit completeness validation captures
val05/ for the VAL05 OTel integration validation captures
val06/ for the VAL06 support-bundle validation captures
val07/ for the VAL07 rollout latency baseline captures
val08/ for the VAL08 rollout throughput validation captures
val09/ for the VAL09 stuck rollout detection validation captures
val10/ for the VAL10 rollback reliability validation captures
val11/ for the VAL11 chaos validation captures

Representative files:

autonomy/rollout-plan-create.txt
autonomy/audit-rollout-plan-create.log
autonomy/cert-rotate.txt
autonomy/audit-cert-rotate.log
autonomy/cert-rotation-list-expiring.txt — VAL01-1: 2-day cert appears in expiry window
autonomy/cert-rotation-prerotate-health.json — VAL01-2: old cert accepted over live mTLS
autonomy/cert-rotation-before-dates.txt — openssl serial/dates before rotation (baseline)
autonomy/cert-rotation-timing.txt — VAL01-3: elapsed seconds + pass=true vs 300s bound
autonomy/cert-rotation-rotate.txt — cert rotate stdout
autonomy/cert-rotation-audit-rotate.log — slog cert.rotated event from rotation
autonomy/cert-rotation-after-dates.txt — openssl serial/dates after rotation
autonomy/cert-rotation-list-after.txt — VAL01-4: no certificates matched in expiry window
autonomy/cert-rotation-postrotate-health.json — VAL01-5: rotated client cert accepted without restart
autonomy/cert-rotation-audit-events.json — VAL01-6: retained cert.rotated query
autonomy/cert-rotation-val01-report.txt — composite 6/6 PASS report + serial assertion
autonomy/cert-rejection-missing-client-cert.stderr — VAL02-1: TLS cert required error
autonomy/cert-rejection-invalid-chain.stderr — VAL02-2: chain verification failure (rogue CA)
autonomy/cert-rejection-expired-cert.stderr — VAL02-3: cert expired rejection
autonomy/cert-rejection-revoked.stderr — VAL02-4: CRL rejection (node-a, same gate as Phase 5)
autonomy/cert-rejection-wrong-server-trust.stderr — VAL02-5: client-side verify failure
autonomy/cert-rejection-val02-report.txt — composite 5/5 PASS report + right_ca_wrong_cn note
metrics/orchestrator-metrics-raw.txt
metrics/metrics-list.txt
metrics/metrics-list.json
metrics/metrics-query-all.txt
metrics/metrics-query-rollout-plans.json
metrics/metrics-query-http-duration.txt
rbac/rbac-role-create.txt
rbac/rbac-role-list-before-assign.txt
rbac/rbac-role-assign.txt
rbac/audit-rbac-role-assign.log
rbac/rbac-role-assign-repeat.txt
rbac/rbac-role-list-after-assign.txt
rbac/rbac-role-list.json
rbac/assignments.json
rbac/rbac-default-on-denied.stderr — bootstrap mode denial with no env var
rbac/rbac-bootstrap-seed.txt — bootstrap role assign output
rbac/audit-rbac-bootstrap-seed.log — auth.bootstrap.access audit event
rbac/rbac-post-bootstrap-insufficient.stderr — outsider denied after bootstrap
rbac/rbac-post-bootstrap-allowed.json — allowed after bootstrap + role grant
rbac/rbac-enforcement-disabled.json — opt-out AUTONOMY_RBAC_ENFORCEMENT=0
rbac/rbac-break-glass-allowed.json — break-glass bypassed action
rbac/audit-rbac-break-glass.log — auth.break_glass.used audit event
rbac/audit-break-glass-events.json — retained break-glass events
rbac/rbac-break-glass-no-operator.stderr — break-glass safety: denied without operator
rbac/rbac-role-assign-operator.txt
rbac/rbac-ha-status-denied.stderr — default-on denial (no env var)
rbac/rbac-ha-status-allowed.txt
rbac/rbac-audit-query-denied.stderr — default-on denial (no env var)
rbac/rbac-audit-query-allowed.json
rbac/rbac-audit-export-denied.stderr — default-on denial (no env var)
rbac/rbac-audit-export-allowed.json
rbac/retained-auth-access-denied.json
ha/ha-status-no-header.headers
ha/ha-status-no-header.json
ha/ha-status-with-header.headers
ha/ha-status-with-header.json
ha/ha-backup-create.txt
ha/audit-ha-backup-create.log
ha/ha-backup-restore.txt
ha/audit-ha-backup-restore.log
ha/ha-failover-trigger.txt
ha/audit-ha-failover-trigger.log
quorum/ha-quorum-healthy.txt
quorum/ha-quorum-degraded.txt
quorum/ha-quorum-lost.json
quorum/ha-quorum-restored.json
quorum/audit-ha-quorum-lost.json
quorum/audit-ha-quorum-restored.json
split-brain/ha-split-brain-healthy.txt
split-brain/ha-split-brain-detected.json
split-brain/ha-split-brain-detected-repeat.json
split-brain/ha-split-brain-recover-dry-run.txt
split-brain/ha-split-brain-recover-execute.txt
split-brain/ha-split-brain-recovered.txt
split-brain/ha-split-brain-recovered.json
split-brain/ha-status-after-recovery.json
split-brain/ha-split-brain-possible.txt
split-brain/audit-ha-split-brain-detected-after-dedupe.json
split-brain/audit-ha-split-brain-detected-all.json
split-brain/audit-ha-split-brain-recovered-after-execute.json
split-brain/audit-ha-split-brain-recovered-all.json
split-brain/ha-split-brain-summary.json
relay/relay-deadletter-retry.txt
relay/audit-relay-deadletter-retry.log
relay/relay-bandwidth-set.txt
relay/audit-relay-bandwidth-set.log
retained/audit-query-all.txt
retained/audit-query-category-auth.json
retained/audit-query-category-rollout.txt
retained/audit-query-ha-backup-created.json
retained/audit-query-source-edge.json
retained/audit-query-outcome-success.txt
retained/audit-query-invalid-output.stderr
retained/audit-query-invalid-range.stderr
retained/audit-export-all.json
retained/audit-export-all.csv
retained/audit-export-invalid-format.stderr
retained/audit-export-invalid-format-target-before.sha256
retained/audit-export-invalid-format-target-after.sha256
retained/audit-export-invalid-format-target.txt
retained/retained-file-list.txt
support-bundle/autonomy-support-bundle-live.tar.gz
support-bundle/support-bundle-generate.log
support-bundle/support-bundle-contents.txt
support-bundle/manifest.json
support-bundle/config_redacted.yaml
support-bundle/ha_status.json
support-bundle/audit_recent.json
support-bundle/logs-autonomy.log
support-bundle/support-bundle-summary.txt
config-migrate/config-migrate-dry-run.txt
config-migrate/config-migrate-stdout.yaml
config-migrate/config-migrate-stdout.toml
config-migrate/config-migrated.yaml
config-migrate/config-migrated.toml
config-migrate/config-migrated-in-place.yaml
config-migrate/config-migrate-in-place-before.stat
config-migrate/config-migrate-in-place-after.stat
config-migrate/config-migrate-unsupported.stderr
config-migrate/config-migrate-invalid-input.stderr
config-migrate/config-migrate-invalid-v0.stderr
db_audit/schema-dry-run.txt (DB-backed; skipped when no PG URL)
db_audit/query-file-all.txt — file-backed baseline query (fallback path)
db_audit/query-db-all.txt — DB-backed query text (primary path)
db_audit/query-db-cert.json — DB-backed query filtered to cert category
db_audit/export-db-all.json — DB-backed JSON export
db_audit/export-db-all.csv — DB-backed CSV export
db_audit/prune-90d.txt — prune with 90d cutoff (no-op output)
db_audit/prune-1d.txt — prune with 1d cutoff (deletes seeded old rows)
db_audit/query-db-after-prune.json — post-prune query confirming remaining rows
cert_rbac/denied-issue.txt — cert issue denied (cert:manage in error)
cert_rbac/denied-rotate.txt — cert rotate denied
cert_rbac/denied-revoke.txt — cert revoke denied
cert_rbac/denied-list.txt — cert list denied (newly guarded; cert:manage in error)
cert_rbac/denied-check-revocation.txt — cert check-revocation denied (newly guarded)
cert_rbac/denied-sync-crl.txt — cert sync-crl denied
cert_rbac/allowed-issue.txt — successful cert issue under cert:manage
cert_rbac/allowed-list.txt — successful cert list under cert:read
cert_rbac/allowed-check-revocation.txt — successful read-only non-revoked check under cert:read
cert_rbac/audit-denied-events.json — retained auth.access.denied records for cert operations
val03/setup-seed-auditor.txt — VAL03 bootstrap auditor assignment
val03/setup-assign-operator.txt — VAL03 operator assignment
val03/setup-assign-analyst.txt — VAL03 analyst assignment
val03/setup-server-assign-auditor.txt — mirrored server-side auditor assignment
val03/setup-server-assign-operator.txt — mirrored server-side operator assignment
val03/setup-server-assign-analyst.txt — mirrored server-side analyst assignment
val03/val03-01-ha-status-deny.stderr — unassigned denied ha status (rbac: pattern)
val03/val03-02-ha-status-operator-allow.txt — operator allowed ha status
val03/val03-03-ha-status-analyst-allow.txt — analyst allowed ha status
val03/val03-04-ha-status-auditor-allow.txt — auditor allowed ha status
val03/val03-05-audit-query-operator-deny.stderr — operator denied audit query
val03/val03-06-audit-query-analyst-deny.stderr — analyst denied audit query
val03/val03-07-audit-query-auditor-allow.json — auditor allowed audit query
val03/val03-08-rbac-role-list-unassigned.txt — unassigned rbac role list (not guarded)
val03/val03-09-rollout-plan-list-unassigned.stderr — unassigned rollout plan list (connection error, not RBAC)
val03/val03-10-rbac-role-create-operator-deny.stderr — operator denied rbac role create
val03/val03-11-rbac-role-create-analyst-deny.stderr — analyst denied rbac role create
val03/val03-12-rbac-role-create-auditor-allow.txt — auditor allowed rbac role create
val03/val03-13-support-bundle-unassigned.stderr — unassigned support-bundle generate progress with bundle written: (top-level command not guarded)
val03/val03-support-bundle.tar.gz — bundle generated by the unguarded support-bundle check
val03/val03-14-access-denied-events.json — retained denial tuples from this VAL03 slice
val03/val03-report.txt — composite 14-check PASS/FAIL report
val03/val03-report.json — machine-readable JSON report
val04/val04-store-inventory.txt — retained store JSONL file count
val04/val04-category-rollout.json — all rollout-category records from retained store
val04/val04-category-ha.json — all ha-category records
val04/val04-category-cert.json — all cert-category records
val04/val04-category-relay.json — all relay-category records
val04/val04-category-auth.json — all auth-category records
val04/val04-category-rollback.json — all rollback-category records
val04/val04-category-summary.txt — category=<cat> count=<N> lines for all 6 categories
val04/val04-schema-check.txt — per-category schema field check results (PASS/FAIL + MISSING lines)
val04/val04-latency.txt — query_ok, query_elapsed_ms, bound_ms=2000, pass=true/false
val04/val04-coverage-all.json — full retained store JSON (all records, --limit 0)
val04/val04-coverage-report.txt — PRESENT/ABSENT per event type, final found/expected/threshold/pass
val04/val04-report.txt — composite 10-check PASS/FAIL report
val04/val04-report.json — machine-readable JSON report with latency and coverage counts
val05/val05-prometheus-status.txt — HTTP status code for /metrics endpoint
val05/val05-prometheus-raw.txt — full Prometheus text exposition from orchestrator
val05/val05-prometheus-families.txt — PRESENT/ABSENT per required metric family
val05/val05-events-ingest.json — explicit POST /v1/events response used to exercise cp_events_ingested_total
val05/val05-prometheus-observations.txt — sample lines for 4 non-zero observation checks
val05/val05-emit-helper.txt — telemetry_emit_helper output (emitted 3 events to …)
val05/val05-wal-status.json — telemetry status --json output {total, exported, pending}
val05/val05-wal-inventory.txt — WAL dir, file count, total events, pass flag
val05/val05-export.jsonl — JSONL output from telemetry export (3 events)
val05/val05-export-fields.txt — PRESENT/ABSENT per mandatory JSONL event field
val05/val05-sink-output.txt — OTLP payloads received by autonomy telemetry sink
val05/val05-flush-stdout.txt — telemetry flush: OK — N events sent to …
val05/val05-flush-summary.txt — flush_ok, sink_lines, sink_payloads, pass flag
val05/val05-traceid-jsonl.txt — trace_id_found=<value>, span_id_found=<value>, or ABSENT
val05/val05-traceid-otlp.txt — traceId_found=true/false, spanId_found=true/false
val05/val05-report.txt — composite 9-check PASS/FAIL report
val05/val05-report.json — machine-readable JSON report
val06/val06-bundle.tar.gz — normal support bundle archive under inspection
val06/val06-generate-stdout.txt — normal stdout capture (typically empty; progress is emitted on stderr)
val06/val06-generate.log — stderr including generating support bundle, per-collector progress, and bundle written: confirmation
val06/val06-timing.txt — generate_ok, elapsed_s, bound_s=30, pass_timing
val06/val06-contents.txt — archive file listing from tar -tzf
val06/val06-manifest.json — extracted manifest.json with all 6 collector statuses
val06/val06-system-info.json — extracted system_info.json
val06/val06-config-redacted.yaml — extracted config_redacted.yaml for redaction inspection
val06/val06-audit-recent.json — extracted audit_recent.json (50 most recent records)
val06/val06-core-files.txt — PRESENT/ABSENT per required core file
val06/val06-manifest-check.txt — collector=<name> status=<status> for each of the 6 collectors
val06/val06-sysinfo-check.txt — PRESENT/ABSENT per required system_info field
val06/val06-audit-check.txt — audit_recent_count, pass
val06/val06-redaction-salt.txt — fleet_salt_placeholder, fleet_salt_actual_absent, pass
val06/val06-redaction-pg.txt — pg_redacted_present, pg_secret_absent, pass
val06/val06-privkey-check.txt — privkey_hits=0, pass
val06/val06-bundle-degraded.tar.gz — degraded bundle (ha_status failed)
val06/val06-degraded-manifest.json — extracted manifest from degraded bundle
val06/val06-degraded-check.txt — bundle_exit_ok=true, ha_status_status=failed, pass
val06/val06-report.txt — composite 10-check PASS/FAIL report
val06/val06-report.json — machine-readable JSON report with elapsed_s and per-check statuses
val07/val07-cp.log — dedicated VAL07 control-plane startup and per-request logs
val07/val07-health.txt — health_code=200, pass
val07/val07-create-raw.txt — 20 lines: <http_code> <time_total_s> from sequential creates
val07/val07-create-percentiles.txt — expected_n=20, n=20, sample_complete=true, p50_ms, p95_ms, p99_ms, min_ms, max_ms
val07/val07-list-raw.txt — 20 lines: <http_code> <time_total_s> from sequential lists
val07/val07-list-percentiles.txt — expected_n=20, n=20, sample_complete=true, p50_ms, p95_ms, p99_ms, min_ms, max_ms
val07/val07-concurrent-raw.txt — 5 lines from concurrent creates
val07/val07-concurrent-summary.txt — concurrent_n=5, conc_ok, conc_errors, wall_ms, bound_ms=2000, wall_pass
val07/val07-error-summary.txt — total_requests=45, error_count, pass
val07/val07-metrics-raw.txt — Prometheus text exposition from dedicated VAL07 control plane
val07/val07-prometheus-check.txt — cp_http_requests_total count, pass
val07/val07-report.txt — composite 9-check PASS/FAIL report with p50/p95/p99 values
val07/val07-report.json — machine-readable JSON report with plan_create_ms and plan_list_ms latency objects
val08/val08-cp.log — dedicated VAL08 control-plane startup and per-request logs
val08/val08-health.txt — status=ok, pass=true
val08/scenario-n1/scenario-report.txt — n_workers=1, total_plans=5, ok=5, errors=0, throughput_plans_per_sec
val08/scenario-n10/scenario-report.txt — n_workers=10, total_plans=50, ok=50, errors=0, throughput_plans_per_sec
val08/scenario-n50/scenario-report.txt — n_workers=50, total_plans=250, ok=250, errors=0, throughput_plans_per_sec
val08/scenario-n100/scenario-report.txt — n_workers=100, total_plans=500, ok=500, errors=0, throughput_plans_per_sec
val08/scenario-n{1,10,50,100}/worker-*.txt — per-worker error counts (one file per worker; all should contain 0)
val08/val08-wall-clock-n100.txt — elapsed_ms, bound_ms=30000, pass
val08/val08-throughput-scaling.txt — tput_n1, tput_n10, tput_n50, tput_n100, scaling_pass
val08/val08-error-aggregate.txt — total_errors=0, pass=true
val08/val08-list-consistency.txt — grand_total_created=805, expected_min, list_count, pass
val08/val08-metrics-raw.txt — Prometheus text exposition from dedicated VAL08 control plane
val08/val08-prometheus-check.txt — cp_http_requests_total count, pass
val08/val08-report.txt — composite 10-check PASS/FAIL report with throughput-by-tier table
val08/val08-report.json — machine-readable JSON report with throughput object and per-check statuses
val09/val09-cp.log — dedicated VAL09 control-plane startup and per-request logs
val09/val09-health.txt — status=ok, pass=true
val09/val09-plans-created.txt — 5 plan IDs with HTTP 201 create codes
val09/val09-scan-empty.json — stuck scan on empty store (stuck_count=0)
val09/val09-baseline-check.txt — stuck_count=0, expected=0, pass=true
val09/val09-scan-fresh.json — stuck scan immediately after creation (stuck_count=0)
val09/val09-fresh-check.txt — stuck_count=0, expected=0, pass=true
val09/val09-scan-stale.json — stuck scan after sleep; full stuck_plans array with diagnosis fields
val09/val09-stale-check.txt — stuck_count=5, expected=5, pass=true
val09/val09-diagnosis-check.txt — total=5, diagnosis_populated=5, pass=true
val09/val09-pause-planb.json — pause response for plan-b (phase=paused)
val09/val09-scan-after-pause.json — stuck scan after pause (plan-b absent)
val09/val09-pause-check.txt — paused_plan=val09-plan-b, in_stuck_scan=no, pass=true
val09/val09-cancel-planc.json — cancel response for plan-c
val09/val09-scan-after-cancel.json — stuck scan after cancel (plan-c absent)
val09/val09-cancel-check.txt — cancelled_plan=val09-plan-c, in_stuck_scan=no, pass=true
val09/val09-recover-retry-pland.json — retry recovery response (new_phase=active)
val09/val09-retry-check.txt — new_phase=active, pass=true
val09/val09-recover-rollback-plane.json — rollback recovery response (new_phase=rolled_back)
val09/val09-rollback-check.txt — new_phase=rolled_back, pass=true
val09/val09-scan-final.json — final stuck scan (plan-a present; b/c/d/e absent)
val09/val09-final-check.txt — per-plan presence flags + pass=true
val09/val09-report.txt — composite 10-check PASS/FAIL report
val09/val09-report.json — machine-readable JSON report with scan counts and per-check statuses
val10/val10-cp.log — dedicated VAL10 control-plane startup and per-request logs
val10/val10-health.txt — status=ok, pass=true
val10/val10-preview-rollout_plan.txt — text safety profile for rollout_plan target
val10/val10-preview-rollout_stage.txt — text safety profile for rollout_stage target
val10/val10-preview-ha_leader_resign.txt — text safety profile for ha_leader_resign target
val10/val10-preview-relay_deadletter.txt — text safety profile for relay_deadletter target
val10/val10-preview-check.txt — preview_errors=0, pass=true
val10/val10-preview-rollout_plan.json — JSON preview (safety_class=terminal, orchestrated=true)
val10/val10-preview-rollout_plan-check.txt — field checks, pass=true
val10/val10-preview-relay_deadletter.json — JSON preview (orchestrated=false)
val10/val10-preview-relay-check.txt — orchestrated=false, manual_path_has_edgectl=true, pass=true
val10/val10-retry-plans-created.txt — 5 × plan_id=val10-retry-N code=201
val10/retry/execute-retry-{1..5}.txt — per-plan retry execute output (outcome=success previous=published new=active)
val10/val10-retry-rate.txt — strategy=retry ok=5 fail=0 total=5 success_rate=1.000 pass=true
val10/val10-rollback-plans-created.txt — 5 × plan_id=val10-rollback-N code=201
val10/rollback/execute-rollback-{1..5}.txt — per-plan rollback execute output (outcome=success previous=published new=rolled_back)
val10/val10-rollback-rate.txt — strategy=rollback ok=5 fail=0 total=5 success_rate=1.000 pass=true
val10/val10-execute-json.json — JSON output from retry execute (Outcome, NewState, Kind fields)
val10/val10-execute-json-check.txt — field presence checks, pass=true
val10/val10-execute-nonexistent.txt — error output for nonexistent plan
val10/val10-nonexistent-check.txt — exit_code=1, pass=true
val10/val10-execute-relay-not-orchestrated.txt — edgectl instructions in error output
val10/val10-relay-not-orchestrated-check.txt — exit_code!=0, has_edgectl_instructions=1, pass=true
val10/val10-audit-preview-events.json — rollback.preview.requested events from audit store
val10/val10-audit-preview-check.txt — rollback.preview.requested_count ≥ 4 with actor/start-time scope, pass=true
val10/val10-audit-execute-events.json — rollback.executed events from audit store
val10/val10-aggregate-rate.txt — agg_ok=10 agg_total=10 agg_success_rate=1.000 pass=true
val10/val10-report.txt — composite 10-check PASS/FAIL report with success rate table
val10/val10-report.json — machine-readable JSON with success_rate object and per-check statuses
val11/val11-cp.log — chaos CP stdout/stderr across all start/stop cycles (append mode)
val11/val11-health.txt — status=ok, pass=true
val11/val11-plans-created.txt — 9 plan create HTTP codes (dur-1..5, gate-1, corrupt-1, dev-1..3)
val11/val11-scan-stuck.json — stuck scan result (threshold=2s, after sleep=3s)
val11/val11-stuck-check.txt — stuck_count≥3, diagnosis_ok=true, pass=true
val11/val11-corrupt-plan-create.txt — HTTP 201 for corrupt-artifact plan
val11/val11-corrupt-plan.json — GET /v1/rollouts/val11-corrupt-1 response
val11/val11-corrupt-check.txt — create_code=201, get_ok=true with plan.metadata.id=val11-corrupt-1, pass=true
val11/val11-execute-rollback-corrupt.txt — rollback execute output for val11-corrupt-1
val11/val11-rollback-corrupt-check.txt — exit_ok=true, pass=true
val11/cascade/execute-retry-dev-{1..3}.txt — cascade retry output per plan
val11/val11-cascade-check.txt — cascade_ok=3, cascade_fail=0, pass=true
val11/val11-kill-client-error.txt — CLI output after CP kill
val11/val11-kill-check.txt — exit_nonzero=true, has_connection_error=true, pass=true
val11/val11-durability-list.json — GET /v1/rollouts after first restart
val11/val11-durability-check.txt — list_count≥10, pass=true
val11/val11-gate-plan.json — GET /v1/rollouts/val11-gate-1 after restart
val11/val11-gate-check.txt — phase=published, pass=true
val11/val11-rapid-restart.txt — 3× kill/restart cycle log + final list count + new plan code
val11/val11-audit-executed.json — rollback.executed events from retained store
val11/val11-audit-check.txt — rollback_executed_success_count≥1 with actor/start-time scope, pass=true
val11/val11-report.txt — composite 10-check PASS/FAIL chaos report
val11/val11-report.json — machine-readable JSON chaos report
val13/ for the VAL13 HA failover validation captures
val13/probe-table-create.txt — output of CREATE TABLE val13_probe
val13/val13-node1-s0.log — node-1 HA server initial session log
val13/val13-node2-s0.log — node-2 HA server initial session log
val13/val13-node1-s1.log — node-1 HA server restart-1 log
val13/val13-node1-s2.log — node-1 HA server restart-2 log
val13/val13-node2-s1.log — node-2 HA server restart-1 log
val13/val13-node2-s2.log — node-2 HA server restart-2 log
val13/val13-01-node1-status.txt — ha status for node-1 at baseline
val13/val13-01-node2-status.txt — ha status for node-2 at baseline (follower)
val13/val13-02-pre-kill-write.txt — SQL INSERT output before kill
val13/val13-03-failover-timing.txt — failover_ms=<N> signal=TERM
val13/val13-03-node2-status-after.txt — ha status immediately after node-2 takes over
val13/val13-04-data-probe.txt — SELECT note result (should contain pre-kill)
val13/val13-05-write-ready.txt — ha status confirming node-2 is active leader
val13/val13-06-quorum-lost.json — /v1/ha/quorum JSON showing quorum_health=lost
val13/val13-07-quorum-healthy.json — /v1/ha/quorum JSON showing quorum_health=healthy
val13/val13-07-data-after-pg-restart.txt — probe row intact after PG stop/start
val13/val13-08-cycle1.txt — rapid kill cycle 1 timing
val13/val13-08-cycle2.txt — rapid kill cycle 2 timing
val13/val13-08-cycle3.txt — rapid kill cycle 3 timing
val13/val13-08-rapid-summary.txt — cycle1_ms=N cycle2_ms=N cycle3_ms=N
val13/val13-08-post-cycle-status.txt — ha status after the rapid cycles
val13/val13-08-data-after-rapid.txt — probe row still intact after the rapid cycles
val13/val13-09-sigkill-timing.txt — failover_ms=<N> signal=KILL
val13/val13-10-final-status.txt — ha status for surviving leader after all tests
val13/val13-report.txt — composite 10-check PASS/FAIL HA failover report
val13/val13-report.json — machine-readable JSON HA failover report
val14/ for the VAL14 HA replication lag baseline captures
val14/val14-pg-setup.txt — role create, pg_hba update, table create, synchronous config
val14/val14-pg-basebackup.txt — pg_basebackup progress + replication slot creation
val14/val14-ha-server.log — HA server log (leader election + quorum monitor)
val14/val14-01-replication.txt — pg_stat_replication output (state=streaming)
val14/val14-02-idle-lsn-gap.txt — LSN gap at rest (should be 0)
val14/val14-02-idle-lag-samples.txt — 10 idle write_lag_ms samples
val14/val14-light-write.txt — INSERT output for 100-row light write
val14/val14-03-light-lag-samples.txt — 5 lag samples captured before the light-load drain fully completes
val14/val14-03-light-result.txt — light_drain_ms=<N> threshold=2000
val14/val14-heavy-write.txt — INSERT output for 500-row heavy write
val14/val14-04-heavy-lag-samples.txt — 5 lag samples during/after heavy write
val14/val14-04-heavy-result.txt — heavy_drain_ms=<N> threshold=5000
val14/val14-05-post-drain-samples.txt — 5 lag samples after heavy drain
val14/val14-05-drain-result.txt — post_drain_gap=<N>
val14/val14-06-ha-endpoint.json — /v1/health/replication JSON response
val14/val14-07-quorum-degraded.json — /v1/ha/quorum JSON showing quorum_health=degraded
val14/val14-08-quorum-healthy.json — /v1/ha/quorum JSON showing quorum_health=healthy
val14/val14-09-offline-write.txt — INSERT output for the backlog generated while standby is offline
val14/val14-09-post-restart-replication.txt — pg_stat_replication after standby restart
val14/val14-09-catchup-result.txt — catchup_drain_ms=<N> ok=true/false
val14/val14-report.txt — composite 10-check report with lag measurements and derived thresholds
val14/val14-report.json — machine-readable JSON replication lag baseline report
val15/ for the VAL15 backup/restore validation captures
val15/val15-pg-setup.txt — Docker container IP, postgres URL, table create and fixture load output
val15/val15-ha-server.log — HA server log (normal mode, backup phase)
val15/val15-ha-server-maint.log — HA server log (maintenance mode, restore phase)
val15/val15-ha-server-post.log — HA server log (normal mode, multi-backup and error-path phase)
val15/val15-01-backup-create.txt — ha backup create stdout+stderr (backup_id, checksum, size_bytes)
val15/val15-01-backup-file-stat.txt — stat of the .dump file
val15/val15-02-backup-toc.txt — pg_restore -l table-of-contents of the backup archive
val15/val15-03-backup-list.txt — ha backup list --output json after first backup
val15/val15-03-metadata-check.txt — assertion result: backup_id found with status=completed
val15/val15-04-checksum-verify.txt — cli_checksum=<hex> + file_checksum=<hex> comparison
val15/val15-05-backup-timing.txt — backup_ms=<N>
val15/val15-05-db-before.txt — row counts before backup (small=100 medium=1000)
val15/val15-06-post-backup-mutation.txt — SQL UPDATE/DELETE output confirming post-backup mutation
val15/val15-06-restore.txt — ha backup restore stdout+stderr
val15/val15-06-db-after-restore.txt — row counts after restore
val15/val15-06-data-check.txt — SQL spot-check query output (100|1000|t|t)
val15/val15-06-integrity-result.txt — assertion result: restore_correct=true small=100 medium=1000
val15/val15-07-restore-timing.txt — restore_ms=<N>
val15/val15-08-backup-create-2.txt — second ha backup create output
val15/val15-08-backup-list-multi.txt — ha backup list --output json showing both backup IDs
val15/val15-08-inventory-check.txt — assertion result: multi_backup_count=2
val15/val15-09-restore-no-confirm.txt — ha backup restore without --confirm (expected error mentioning --confirm)
val15/val15-10-audit-backup-created.json — audit query --event-type ha.backup.created result
val15/val15-10-audit-backup-restored.json — audit query --event-type ha.backup.restored result
val15/val15-10-audit-check.txt — assertion result: event counts for both audit event types
val15/backups/backup-val15-a.dump — pg_dump custom-format archive (first backup)
val15/backups/backup-val15-b.dump — pg_dump custom-format archive (second backup)
val15/val15-report.txt — composite 10-check PASS/FAIL report with timing values
val15/val15-report.json — machine-readable JSON report with backup_ms, restore_ms, pass_count
val16/ for the VAL16 split-brain chaos validation captures
val16/val16-ha-server.log — HA server log (single normal session throughout)
val16/val16-probe-setup.txt — Docker psql output: CREATE TABLE + INSERT for probe row
val16/val16-01-baseline.json — /v1/ha/split-brain JSON at baseline (risk=none)
val16/val16-01-baseline.txt — ha split-brain detect CLI output at baseline
val16/val16-02-epoch-inject.txt — SQL UPDATE output (epoch divergence injection)
val16/val16-02-detected.json — /v1/ha/split-brain JSON after injection (risk=detected)
val16/val16-02-detected.txt — ha split-brain detect CLI output after injection
val16/val16-03-detect-repeat.json — second /v1/ha/split-brain call (idempotency check)
val16/val16-04-recover-dry-run.txt — ha split-brain recover --strategy manual-reconcile stdout+stderr
val16/val16-04-risk-after-dry-run.json — /v1/ha/split-brain JSON after dry-run (risk unchanged)
val16/val16-04-risk-check.txt — assertion result: risk_after_dry_run=detected
val16/val16-05-recover-execute.txt — ha split-brain recover --strategy promote-leader stdout+stderr
val16/val16-05-recovered.json — /v1/ha/split-brain JSON after promote-leader (risk=none)
val16/val16-05-recovered.txt — ha split-brain detect CLI output after recovery
val16/val16-06-probe-after-recovery.txt — SQL SELECT: note from val16_probe WHERE id=1
val16/val16-07-ghost-inject.txt — SQL INSERT output (ghost-node epoch rows)
val16/val16-07-possible.json — /v1/ha/split-brain JSON after ghost injection (risk=possible)
val16/val16-07-possible.txt — ha split-brain detect CLI output for risk=possible
val16/val16-08-ghost-clear.txt — SQL UPDATE output (stamp resigned_at on ghost rows)
val16/val16-08-cleared.json — /v1/ha/split-brain JSON after clearing (risk=none)
val16/val16-09-audit-detected.json — audit query --event-type ha.split_brain.detected --start-time <slice_start> JSON result
val16/val16-09-audit-recovered.json — audit query --event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start> JSON result
val16/val16-09-audit-check.txt — assertion result: slice-scoped event counts for both audit event types
val16/val16-10-final-status.json — /v1/ha/status JSON confirming cp-val16-node holds leadership
val16/val16-10-stability-check.txt — assertion result: holder_id check
val16/val16-report.txt — composite 10-check PASS/FAIL report
val16/val16-report.json — machine-readable JSON report with pass_count and scenario results
val17/ for the VAL17 quorum loss validation captures
val17/val17-pg-setup.txt — Docker container IP and PG URL used by HA server
val17/val17-ha-server.log — HA server log (single session throughout all phases)
val17/val17-01-baseline.json — /v1/ha/quorum JSON at baseline (quorum_health=healthy)
val17/val17-01-baseline.txt — ha quorum status CLI output at baseline
val17/val17-01-baseline-check.txt — assertion: quorum_health=healthy write_block_active=False
val17/val17-02-quorum-lost.json — /v1/ha/quorum JSON after PG stop (quorum_health=lost)
val17/val17-02-quorum-lost.txt — ha quorum status CLI output during loss
val17/val17-03-loss-timing.txt — loss_ms=<N>
val17/val17-04-write-block-check.txt — assertion: write_block_active=True can_accept_protected_writes=False
val17/val17-05-loss-reason.txt — assertion: quorum_loss_reason=<message>
val17/val17-06-quorum-recovered.json — /v1/ha/quorum JSON after PG restart (quorum_health=healthy)
val17/val17-06-quorum-recovered.txt — ha quorum status CLI output after recovery
val17/val17-07-recovery-timing.txt — recovery_ms=<N>
val17/val17-08-recovery-check.txt — assertion: write_block_active, can_accept_protected_writes, timestamps, detected_loss_count
val17/val17-09-second-loss.json — /v1/ha/quorum JSON during second loss
val17/val17-09-second-recovery.json — /v1/ha/quorum JSON after second recovery
val17/val17-09-count-check.txt — assertion: detected_loss_count=2 (second cycle confirmed)
val17/val17-10-audit-lost.json — audit query --event-type ha.quorum.lost JSON result
val17/val17-10-audit-restored.json — audit query --event-type ha.quorum.restored JSON result
val17/val17-10-audit-check.txt — assertion: lost_events=N restored_events=M
val17/val17-report.txt — composite 10-check PASS/FAIL report with loss_ms and recovery_ms
val17/val17-report.json — machine-readable JSON with loss_ms, recovery_ms, pass_count

6. Expected Results¶

Every audit log in the bundle should include the canonical fields:

audit.event
audit.category
audit.action
audit.outcome
audit.resource
audit.resource_type
audit.source=cli

Expected event mapping:

audit-rollout-plan-create.log → rollout.plan.created
audit-rollout-plan-publish.log → rollout.plan.published
audit-rollout-plan-cancel.log → rollout.plan.cancelled
audit-ha-backup-create.log → ha.backup.created
audit-ha-backup-restore.log → ha.backup.restored
audit-ha-failover-trigger.log → ha.failover.triggered and ha.failover.completed
audit-cert-issue.log → cert.issued
audit-cert-rotate.log → cert.rotated
rbac/audit-rbac-role-assign.log → auth.role.assigned
relay/audit-relay-deadletter-retry.log → relay.deadletter.retried
relay/audit-relay-deadletter-purge.log → relay.deadletter.purged
relay/audit-relay-bandwidth-set.log → relay.bandwidth.configured

The paired stdout files should show that each command completed successfully in the same live run.

The metrics surface should also show:

metrics/orchestrator-metrics-raw.txt contains the real Prometheus exposition emitted by the live control-plane started by the lab
metrics/metrics-list.txt and metrics/metrics-list.json contain the same metric catalog in text and JSON form
metrics/metrics-query-all.txt contains live samples such as cp_http_requests_total, cp_health_checks_total, and cp_rollout_plans_total
metrics/metrics-query-rollout-plans.json contains the filtered rollout phase counters generated by the rollout actions in the same run
metrics/metrics-query-http-duration.txt contains the histogram family, including _bucket, _count, and _sum

The RBAC surface should also show:

rbac/rbac-role-create.txt creates a custom lowercase role derived from the canonicalized input name
rbac/rbac-role-list-before-assign.txt shows the custom role with 0 assignments
rbac/rbac-role-assign.txt assigns the role to the trimmed subject identity
rbac/rbac-role-assign-repeat.txt shows the idempotent repeat-assignment no-op
rbac/rbac-role-list-after-assign.txt shows the role with 1 assignment
rbac/assignments.json persists the normalized role and subject
retained/audit-query-category-auth.json returns the retained auth.role.assigned record from the same live run
rbac/rbac-role-assign-operator.txt assigns the predefined operator role used by the PR-21 allow-path checks
rbac/rbac-audit-query-denied.stderr and rbac/rbac-audit-export-denied.stderr show fail-closed denial for an operator who lacks audit_history:read
rbac/rbac-audit-query-allowed.json and rbac/rbac-audit-export-allowed.json show the authorized retained auth view for reviewer@example.com
rbac/rbac-ha-status-denied.stderr shows fail-closed denial for an unassigned operator on ha status
rbac/rbac-ha-status-allowed.txt shows successful HA status output for fleet-op@example.com
ha/ha-status-no-header.headers and ha/ha-status-no-header.json show the server-side 403 denial when /v1/ha/status is called without operator identity under enforcement
ha/ha-status-with-header.headers and ha/ha-status-with-header.json show the server-side 200 success path when the authorized operator header is set
rbac/retained-auth-access-denied.json returns the retained auth.access.denied records from the same live enforcement run

The retained surface should also show:

retained/retained-file-list.txt includes the daily JSONL file under retained/store/
retained/audit-query-all.txt returns a mixed set of rollout, HA, cert, and relay records from the live run
retained/audit-query-ha-backup-created.json returns the filtered ha.backup.created records
retained/audit-query-category-rollout.txt returns only rollout-category records from the same live retained dataset
retained/audit-query-source-edge.json returns only source=edge relay records from the same live retained dataset
retained/audit-query-outcome-success.txt returns the success-only retained dataset
retained/audit-query-invalid-output.stderr shows unknown --output values fail closed with the supported value list
retained/audit-query-invalid-range.stderr shows malformed time ranges fail closed instead of returning a silent empty result
retained/audit-export-all.json and retained/audit-export-all.csv contain the same retained dataset in export form
retained/audit-export-invalid-format.stderr shows unsupported export formats fail before writing
retained/audit-export-invalid-format-target-before.sha256 and retained/audit-export-invalid-format-target-after.sha256 match, proving the invalid-format failure did not truncate the existing target file

The support-bundle surface should also show:

support-bundle/support-bundle-generate.log records the live collection run
support-bundle/support-bundle-contents.txt lists the expected archive members
support-bundle/manifest.json marks system_info, build_info, config, ha_status, audit_recent, and logs as ok
support-bundle/config_redacted.yaml contains <REDACTED> for fleet_salt and REDACTED in the postgres_url password position
support-bundle/ha_status.json contains the live HA snapshot from the running helper
support-bundle/audit_recent.json contains retained records from the same lab run
support-bundle/logs-autonomy.log contains the tailed HA server log
support-bundle/support-bundle-sha256.txt records the resulting archive hash

The database-backed audit surface (when AUTONOMY_AUDIT_PG_URL is set) should also show:

db_audit/query-db-all.txt contains the three seeded rows (rollout, cert, ha) ordered newest-first with event_name, actor, resource, outcome, and source
db_audit/query-db-cert.json contains exactly one record with category=cert
db_audit/export-db-all.json and db_audit/export-db-all.csv contain the same seeded dataset as query-db-all.txt in export format
db_audit/prune-90d.txt shows deleted=0 (no rows older than 90 days in the lab)
db_audit/prune-1d.txt shows deleted=2 (the 1h and 2h rows are removed)
db_audit/query-db-after-prune.json contains exactly one row (the most recent)
db_audit/query-file-all.txt contains records from the file store in parallel, proving the file emitter remained active alongside the DB emitter

The VAL 01 zero-downtime rotation surface should also show:

autonomy/cert-rotation-list-expiring.txt contains either expiring or node-c, confirming the 2-day cert falls inside the --expiring-within-days 5 window
autonomy/cert-rotation-prerotate-health.json contains "status":"ok", confirming the old cert was accepted over live mTLS before rotation
autonomy/cert-rotation-timing.txt contains pass=true and rotation_elapsed_seconds=<N> where N is well below the 300-second bound, proving the rotation operation itself is effectively instantaneous
autonomy/cert-rotation-rotate.txt contains rotated identity=node-c.edge.local cert=... valid_days=90, confirming the operation succeeded with the default 90-day renewal
autonomy/cert-rotation-list-after.txt contains no certificates matched, confirming the 90-day replacement cert is not in the 5-day expiry window
autonomy/cert-rotation-postrotate-health.json contains "status":"ok", proving the rotated client cert was accepted without restarting the control-plane
autonomy/cert-rotation-audit-events.json contains a record with cert.rotated, confirming the event was retained in the audit store
autonomy/cert-rotation-before-dates.txt and autonomy/cert-rotation-after-dates.txt have different serial= values, confirming a new keypair was issued
autonomy/cert-rotation-val01-report.txt reports 6/6 checks PASS with serials_differ=true

The VAL 02 trust-chain rejection surface should also show:

autonomy/cert-rejection-missing-client-cert.stderr is non-empty and the paired .stdout is empty, and stderr matches the missing-client-cert handshake pattern, confirming the control-plane requires a client certificate
autonomy/cert-rejection-invalid-chain.stderr matches the invalid-chain pattern, confirming a cert signed by a rogue CA is rejected even when the CN matches a legitimate node identity
autonomy/cert-rejection-expired-cert.stderr matches the expired-cert pattern, confirming expired certificates are rejected (validity period enforcement)
autonomy/cert-rejection-revoked.stderr matches the revoked-cert pattern, and the existing autonomy/cert-revocation-rejected-events.json retained audit evidence still proves the VerifyPeerCertificate callback path
autonomy/cert-rejection-wrong-server-trust.stderr matches the server-trust-verify pattern, confirming mTLS is bidirectional: the client cannot connect when it cannot verify the server’s cert chain
autonomy/cert-rejection-val02-report.txt reports 5/5 checks PASS and includes the right_ca_wrong_cn note confirming that a cert from the trusted CA with an unexpected CN is accepted at the TLS layer (identity-layer authorization is RBAC-based, not CN-based)

The cert RBAC surface should also show:

cert_rbac/denied-list.txt contains the string “cert:manage”, confirming that cert list (previously unguarded) now requires RBAC authorization
cert_rbac/denied-check-revocation.txt contains the string “cert:manage”, confirming that cert check-revocation (previously unguarded) now requires RBAC authorization
cert_rbac/denied-issue.txt, denied-rotate.txt, denied-revoke.txt, and denied-sync-crl.txt each contain “cert:manage”, confirming consistent coverage
cert_rbac/allowed-issue.txt contains the successful issued identity=... line and does NOT contain an RBAC denial, confirming mutation success under cert:manage
cert_rbac/allowed-list.txt contains the listed node-a.edge.local row and does NOT contain an RBAC denial, confirming read-only success under cert:read
cert_rbac/allowed-check-revocation.txt contains not_revoked, confirming read-only revocation inspection succeeds under cert:read
cert_rbac/audit-denied-events.json contains auth.access.denied records with permission fields referencing cert:manage and cert:read | cert:manage, confirming denial is audited before the error is returned

The VAL03 RBAC enforcement surface should also show:

val03/val03-01-ha-status-deny.stderr contains rbac: and the fleet:read permission name, confirming the guard fires before any HTTP call when the operator has no assignment
val03/val03-05-audit-query-operator-deny.stderr and val03/val03-06-audit-query-analyst-deny.stderr each contain rbac: with audit_history:read, confirming the operator and analyst roles both lack the audit permission
val03/val03-10-rbac-role-create-operator-deny.stderr and val03/val03-11-rbac-role-create-analyst-deny.stderr each contain rbac: with rbac:manage, confirming neither operator nor analyst can create roles
val03/val03-02-ha-status-operator-allow.txt, val03-03-ha-status-analyst-allow.txt, and val03-04-ha-status-auditor-allow.txt each contain HA status JSON (or SKIP if the HA server was unavailable), confirming all three predefined roles include fleet:read and that the mirrored server-side RBAC store authorizes the same identities
val03/val03-07-audit-query-auditor-allow.json contains auth-category audit records, confirming audit_history:read in the auditor role allows the query
val03/val03-12-rbac-role-create-auditor-allow.txt contains created role "val03-test-role", confirming rbac:manage in the auditor role allows custom role creation
val03/val03-08-rbac-role-list-unassigned.txt does NOT contain rbac: operator and lists known roles such as operator and auditor, confirming rbac role list has no RBAC guard
val03/val03-09-rollout-plan-list-unassigned.stderr contains a connection error, NOT rbac:, confirming rollout plan list has no RBAC guard
val03/val03-13-support-bundle-unassigned.stderr contains normal bundle generation progress plus bundle written:, and val03/val03-support-bundle.tar.gz is non-empty, confirming support-bundle generate has no top-level RBAC guard even if optional nested collectors emit RBAC warnings for guarded HA sub-requests
val03/val03-14-access-denied-events.json contains auth.access.denied records for the five expected VAL03 deny tuples, confirming every denial from the VAL03 DENY checks was written to the retained audit store before the error was returned
val03/val03-report.txt reports the final pass=<N> skip=<M> fail=<K> total=14 summary with zero failures
val03/val03-report.json contains pass_count, skip_count, and per-check status values so HA unavailability is recorded as SKIP rather than PASS

The VAL04 audit completeness surface should also show:

val04/val04-store-inventory.txt reports store_jsonl_files > 0, confirming the retained store is non-empty after all prior lab phases
val04/val04-category-summary.txt reports count > 0 for all 6 categories (rollout, ha, cert, relay, auth, rollback), confirming every category is populated
val04/val04-schema-check.txt reports PASS for all 6 category schema checks with no MISSING field lines, confirming every returned record carries the mandatory audit fields
val04/val04-latency.txt reports query_ok=true and pass=true with query_elapsed_ms ≤ 2000, confirming the full retained store query both succeeded and stayed within the latency bound
val04/val04-coverage-report.txt reports PRESENT for all 25 of the 25 wired event types, with pass=true on the final summary line; any ABSENT line is now a real validation failure because the runner is expected to exercise the full wired event surface deterministically
val04/val04-report.txt reports pass=10 fail=0 total=10 summary
val04/val04-report.json contains pass_count=10, coverage_found=25, latency_ms ≤ 2000, and per-check status values

The VAL05 OTel integration surface should also show:

val05/val05-prometheus-status.txt reports http_code=200, confirming the control-plane Prometheus endpoint is reachable
val05/val05-prometheus-families.txt reports PRESENT for all 4 required metric families (cp_http_requests_total, cp_http_request_duration_seconds, cp_rollout_plans_total, cp_events_ingested_total)
val05/val05-events-ingest.json shows a successful POST /v1/events response, confirming that VAL05 itself exercised the event-ingestion path
val05/val05-prometheus-observations.txt reports non-zero sample values for cp_http_requests_total, cp_http_request_duration_seconds_count, cp_rollout_plans_total, and cp_events_ingested_total, confirming that lab traffic plus the explicit ingest produced real observations
val05/val05-emit-helper.txt contains emitted 3 events to, confirming the telemetry.Emitter → WAL write path succeeded
val05/val05-wal-status.json contains "total":3 and "pending":3, confirming events are persisted and not yet flushed
val05/val05-export.jsonl contains exactly 3 lines with "kind", "ts", "seq", "written_at", and "attrs" fields present in each line
val05/val05-flush-stdout.txt contains telemetry flush: OK — 3 events sent to http://127.0.0.1:14318, confirming end-to-end OTLP delivery
val05/val05-flush-summary.txt reports sink_payloads > 0, confirming the sink printed at least one actual OTLP payload receipt line rather than only its startup banner
val05/val05-traceid-jsonl.txt reports trace_id_found=4bf92f3577b34da6a3ce929d0e0e4736 and span_id_found=00f067aa0ba902b7, confirming trace/span propagation through WAL → JSONL
val05/val05-traceid-otlp.txt reports traceId_found=true and spanId_found=true, confirming trace/span propagation in the OTLP/HTTP path
val05/val05-report.txt reports pass=9 fail=0 total=9 summary
val05/val05-report.json contains pass_count=9 and per-check status values

The VAL06 support-bundle surface should also show:

val06/val06-timing.txt reports generate_ok=true and elapsed_s ≤ 30, confirming the bundle was created within the time bound
val06/val06-core-files.txt reports PRESENT for all three core files (manifest.json, system_info.json, build_info.json)
val06/val06-manifest-check.txt reports a status line for each of the 6 collectors (system_info, build_info, config, ha_status, audit_recent, logs), confirming all are recorded in the manifest
val06/val06-sysinfo-check.txt reports PRESENT for all 5 required fields (os, arch, go_version, hostname, collected_at)
val06/val06-audit-check.txt reports audit_recent_count > 0, confirming the bundle captured records from the retained audit store
val06/val06-redaction-salt.txt reports fleet_salt_placeholder=true and fleet_salt_actual_absent=true, confirming the known test salt was replaced with <REDACTED> and the original value does not appear
val06/val06-redaction-pg.txt reports pg_redacted_present=true and pg_secret_absent=true, confirming the postgres password was replaced with REDACTED in the URL and the original password does not appear
val06/val06-privkey-check.txt reports privkey_hits=0, confirming no PEM block (-----BEGIN) appears anywhere in the bundle archive
val06/val06-degraded-check.txt reports bundle_exit_ok=true and ha_status_status=failed, confirming graceful degradation when the control-plane URL is unreachable
val06/val06-report.txt reports pass=10 fail=0 total=10 summary
val06/val06-report.json contains pass_count=10 and per-check status values

The VAL07 rollout latency surface should also show:

val07/val07-health.txt reports health_code=200, confirming the dedicated VAL07 control plane started and is reachable before the benchmark begins
val07/val07-create-percentiles.txt reports p50_ms, p95_ms, and p99_ms all within their respective bounds (100/300/500 ms), with n=20 and sample_complete=true confirming a full successful sample was collected
val07/val07-list-percentiles.txt reports p99_ms ≤ 500, confirming the list path (with 20 existing plans) is within the same latency target; it also reports n=20 and sample_complete=true
val07/val07-concurrent-summary.txt reports conc_ok=5 and conc_errors=0, confirming all 5 parallel creates succeeded; wall_ms ≤ 2000, confirming that single-writer SQLite serialisation does not make concurrent operator requests unacceptably slow
val07/val07-error-summary.txt reports error_count=0 across all 45 benchmark requests (20 creates + 20 lists + 5 concurrent creates)
val07/val07-prometheus-check.txt reports cp_http_requests_total > 0, confirming the Prometheus instrumentation on the VAL07 control plane is wired and received observations from the benchmark traffic
val07/val07-report.txt reports pass=9 fail=0 total=9 summary
val07/val07-report.json contains pass_count=9, plan_create_ms latency object, and per-check status values

The VAL08 rollout throughput surface should also show:

val08/val08-health.txt reports status=ok, confirming the dedicated VAL08 control plane started and is reachable before the throughput run begins
val08/scenario-n100/scenario-report.txt reports ok=500 and errors=0, proving the primary workplan target (≥100 concurrent device rollouts without errors) is met
val08/val08-wall-clock-n100.txt reports elapsed_ms ≤ 30000, confirming 500 concurrent plan creates complete within the 30-second bound
val08/val08-throughput-scaling.txt reports tput_n100 ≥ tput_n1, confirming that issuing 100 concurrent worker streams does not regress throughput below the single-worker serial rate; the SQLite single-writer model is expected to produce a plateau (near-equal throughput) rather than linear scaling, which is acceptable
val08/val08-error-aggregate.txt reports total_errors=0 across all 805 plans created (N=1+10+50+100, 5 plans each)
val08/val08-list-consistency.txt reports list_count ≥ 805, confirming that all created plans are durably stored and returned across paginated list results
val08/val08-prometheus-check.txt reports cp_http_requests_total > 0, confirming Prometheus instrumentation received observations from the throughput traffic
val08/val08-report.txt reports pass=10 fail=0 total=10 summary
val08/val08-report.json contains pass_count=10, throughput object with n1/n10/n50/n100 plans/sec values, and per-check status values

The VAL09 stuck detection surface should also show:

val09/val09-health.txt reports status=ok, confirming the dedicated VAL09 control plane started before any stuck checks run
val09/val09-baseline-check.txt reports stuck_count=0 on an empty store, confirming the detection function handles the zero-plan case without error
val09/val09-fresh-check.txt reports stuck_count=0 immediately after creating 5 plans, confirming freshly-created plans are not falsely reported as stuck before the 3-second threshold elapses
val09/val09-stale-check.txt reports stuck_count=5 after the 4-second sleep, confirming all 5 published-phase plans exceed the threshold and are detected as stuck
val09/val09-diagnosis-check.txt reports diagnosis_populated=5 and diagnosis_exact=5, confirming every stuck plan carries the exact expected "zero activations" diagnosis string
val09/val09-pause-check.txt reports in_stuck_scan=no for val09-plan-b, confirming paused plans are excluded from the active-phase scan
val09/val09-cancel-check.txt reports in_stuck_scan=no for val09-plan-c, confirming terminal plans are excluded from the active-phase scan
val09/val09-retry-check.txt reports new_phase=active, confirming the retry recovery strategy transitions the plan to the active phase and refreshes updated_at (removing it from the stuck list at VAL09-10)
val09/val09-rollback-check.txt reports new_phase=rolled_back, confirming the rollback recovery strategy transitions the plan to the terminal phase
val09/val09-final-check.txt reports plan_a_present=true, plan_b_absent=true, plan_c_absent=true, plan_d_absent=true, plan_e_absent=true, pass=true — the final scan correctly surfaces only the one plan that received no operator action
val09/val09-report.txt reports pass=10 fail=0 total=10 summary
val09/val09-report.json contains pass_count=10 and per-check status values with scan_stale_count=5 and scan_final_count=1

The VAL10 rollback reliability surface should also show:

val10/val10-preview-check.txt reports preview_errors=0, confirming all four rollback preview targets exit 0 and produce safety profiles
val10/val10-preview-rollout_plan-check.txt reports safety_class=terminal, orchestrated=true, and valid_strategies=['retry', 'rollback'], confirming the rollout_plan preview JSON schema is correct
val10/val10-preview-relay-check.txt reports orchestrated=false and manual_path_has_edgectl=true, confirming relay_deadletter is correctly surfaced as a manual-only target with edgectl instructions
val10/val10-retry-rate.txt reports ok=5 fail=0 success_rate=1.000, confirming all 5 retry executes on real plans succeed
val10/val10-rollback-rate.txt reports ok=5 fail=0 success_rate=1.000, confirming all 5 rollback executes on real plans succeed
val10/retry/execute-retry-*.txt each show outcome=success previous=published new=active, confirming the retry strategy transitions plans to active phase
val10/rollback/execute-rollback-*.txt each show outcome=success previous=published new=rolled_back, confirming rollback transitions to terminal
val10/val10-execute-json-check.txt reports all three JSON output fields present (outcome, new_state, kind), confirming the --output json format is stable
val10/val10-nonexistent-check.txt reports non-zero exit code, confirming the CLI surfaces CP 404 errors as non-zero exit rather than silently succeeding
val10/val10-relay-not-orchestrated-check.txt reports non-zero exit code and edgectl instructions present, confirming manual-only targets are blocked from execute with actionable guidance
val10/val10-audit-preview-check.txt reports rollback.preview.requested_count ≥ 4 with actor/start-time scope, confirming this slice’s preview commands emit audit records to the retained store
val10/val10-aggregate-rate.txt reports agg_success_rate=1.000 (10/10), plus at least 10 retained success events scoped to this slice, satisfying the workplan target of ≥99% rollback success rate
val10/val10-report.txt reports pass=10 fail=0 total=10 summary
val10/val10-report.json contains pass_count=10, success_rate.aggregate.rate=1.000, and per-check status values

The VAL11 chaos surface should also show:

val11/val11-health.txt reports status=ok confirming the dedicated chaos CP started cleanly on port 18996
val11/val11-kill-check.txt reports exit_nonzero=true and has_connection_error=true, confirming the CLI surfaces CP unavailability as a non-zero exit with an actionable message rather than silently succeeding or hanging
val11/val11-durability-check.txt reports list_count ≥ 10, confirming the full pre-kill plan corpus committed before the CP kill are recoverable from SQLite’s WAL after process restart
val11/val11-rapid-restart.txt shows all 3 kill+restart cycles completing with cp_ready=true, list_count_final ≥ list_count_before, and new_plan_code=201, confirming the write path is fully operational after repeated restarts
val11/val11-gate-check.txt reports phase=published, confirming a plan in the gate-wait state is not lost or corrupted by a CP kill — the operator’s pending gate decision survives
val11/val11-stuck-check.txt reports stuck_count ≥ 3 and diagnosis_ok=true, confirming the stuck-detection surface correctly identifies the device-unresponsive proxy plans and populates operator-visible diagnosis strings
val11/val11-corrupt-check.txt reports create_code=201 and get_ok=true with plan.metadata.id=val11-corrupt-1, confirming the CP accepts and stores plans with unconventional artifact references without rejecting them at ingestion (validation is the edge agent’s responsibility)
val11/val11-rollback-corrupt-check.txt reports exit_ok=true, confirming the operator can roll back a suspect plan regardless of its artifact metadata
val11/val11-cascade-check.txt reports cascade_ok=3 cascade_fail=0, confirming all 3 device-unresponsive proxy plans are recoverable via retry in a single operator pass
val11/val11-audit-check.txt reports rollback_executed_success_count ≥ 1 with actor/start-time scope, confirming the audit capture pipeline is not disrupted by CP kill/restart cycles — events emitted during chaos recovery sessions are retained in the shared audit store
val11/val11-report.txt reports pass=10 fail=0 total=10 summary
val11/val11-report.json contains pass_count=10 and per-check status values

The HA failover surface should also show:

val13/val13-node1-s0.log contains acquired leadership, confirming node-1 won the initial leader election via advisory lock Campaign
val13/val13-01-node1-status.txt contains cp-val13-node1 confirming that node-1 reports itself as the active leader at baseline
val13/val13-03-failover-timing.txt reports failover_ms=<N> where N ≤ 5000; the measured latency from SIGTERM to node-2 logging “acquired leadership”
val13/val13-04-data-probe.txt contains exactly pre-kill, confirming the shared PostgreSQL instance retained the probe row across leader failover
val13/val13-06-quorum-lost.json contains "quorum_health":"lost", confirming the quorum monitor detects PG unavailability within the polling window
val13/val13-07-quorum-healthy.json contains "quorum_health":"healthy", confirming quorum health returns after the explicit docker start recovery step
val13/val13-08-rapid-summary.txt reports all three cycle timings ≤ 5000 ms, and val13-08-post-cycle-status.txt plus val13-08-data-after-rapid.txt confirm the rapid cycles end with a stable leader and intact probe row
val13/val13-09-sigkill-timing.txt reports failover_ms=<N> where N ≤ 5000 after SIGKILL (no graceful Resign), validating that advisory lock release via TCP RST is fast enough to meet the HA readiness threshold
val13/val13-report.txt reports pass=10 fail=0 total=10
val13/val13-report.json contains pass_count=10, sigterm_failover_ms, sigkill_failover_ms, and rapid_cycle_ms array values as measurement evidence

The replication lag baseline surface should show:

val14/val14-01-replication.txt contains streaming in the state column of pg_stat_replication, confirming the standby is connected and receiving WAL
val14/val14-02-idle-lsn-gap.txt contains 0, confirming no unacknowledged WAL at rest
val14/val14-03-light-result.txt shows light_drain_ms ≤ 2000 ms for the 100-row × 500-byte write workload
val14/val14-04-heavy-result.txt shows heavy_drain_ms ≤ 5000 ms for the 500-row × 2000-byte (~1 MB) write workload
val14/val14-07-quorum-degraded.json shows quorum_health=degraded after docker stop val14-pg-standby
val14/val14-08-quorum-healthy.json shows quorum_health=healthy after docker start val14-pg-standby
val14/val14-09-offline-write.txt shows the write burst issued while the standby was offline, and val14/val14-09-catchup-result.txt shows ok=true and catchup_drain_ms ≤ 10000 ms after standby restart
val14/val14-report.txt shows pass=10 fail=0 total=10 and lists derived threshold values: healthy_thresh_ms, degraded_thresh_ms, alert_thresh_ms
val14/val14-report.json contains pass_count=10, idle_p95_ms, light_p95_ms, heavy_p99_ms, light_drain_ms, heavy_drain_ms, healthy_thresh_ms, degraded_thresh_ms, and alert_thresh_ms as measurement and threshold evidence

The backup/restore validation surface should show:

val15/val15-01-backup-create.txt contains created backup_id=backup-val15-a with non-empty checksum= and positive size= values
val15/val15-02-backup-toc.txt exits without error and contains TABLE DATA entries for val15_small and val15_medium
val15/val15-03-metadata-check.txt contains backup_id=backup-val15-a status=completed
val15/val15-04-checksum-verify.txt shows cli_checksum and file_checksum fields with matching 64-character hex strings
val15/val15-05-backup-timing.txt shows backup_ms ≤ 30,000
val15/val15-06-data-check.txt contains 100|1000|t|t — row counts and payload spot-checks confirming tables were restored to pre-backup state
val15/val15-06-integrity-result.txt contains restore_correct=true small=100 medium=1000
val15/val15-07-restore-timing.txt shows restore_ms ≤ 60,000
val15/val15-08-inventory-check.txt contains multi_backup_count=2
val15/val15-09-restore-no-confirm.txt shows an error message about missing --confirm; the CLI must exit non-zero
val15/val15-10-audit-check.txt contains created_events=N restored_events=M with both N ≥ 1 and M ≥ 1
val15/val15-report.txt shows pass=10 fail=0 total=10
val15/val15-report.json contains pass_count=10, backup_ms, and restore_ms as baseline timing evidence

The split-brain chaos validation surface should show:

val16/val16-01-baseline.json contains "risk": "none" before any injection
val16/val16-02-detected.json contains "risk": "detected" after epoch divergence injection
val16/val16-03-detect-repeat.json also contains "risk": "detected", confirming idempotency
val16/val16-04-recover-dry-run.txt exits without error and includes planning/recommendation output from manual-reconcile
val16/val16-04-risk-after-dry-run.json still contains "risk": "detected" (dry-run does not write to DB)
val16/val16-04-risk-check.txt reports risk_after_dry_run=detected
val16/val16-05-recover-execute.txt exits successfully and val16/val16-05-recovered.json contains "risk": "none" after promote-leader
val16/val16-06-probe-after-recovery.txt contains pre-inject, confirming user data is untouched by recovery
val16/val16-07-possible.json contains "risk": "possible" after ghost-node epoch injection
val16/val16-08-cleared.json contains "risk": "none" after resigned_at is stamped on ghost rows
val16/val16-09-audit-check.txt contains detected_events=N recovered_events=M with both N ≥ 1 and M ≥ 1, scoped to this slice by start-time and actor=val16-operator for the recovery event
val16/val16-10-final-status.json confirms holder_id contains cp-val16-node
val16/val16-report.txt shows pass=10 fail=0 total=10
val16/val16-report.json contains pass_count=10 and per-check results

The quorum loss validation surface should show:

val17/val17-01-baseline.json contains "quorum_health": "healthy" and "write_block_active": false before any fault
val17/val17-02-quorum-lost.json contains "quorum_health": "lost" after docker stop val17-pg-primary
val17/val17-03-loss-timing.txt shows loss_ms ≤ 30,000
val17/val17-04-write-block-check.txt confirms write_block_active=True and can_accept_protected_writes=False during loss
val17/val17-05-loss-reason.txt shows a non-empty quorum_loss_reason (e.g. "database connection unavailable")
val17/val17-06-quorum-recovered.json contains "quorum_health": "healthy" after docker start val17-pg-primary
val17/val17-07-recovery-timing.txt shows recovery_ms ≤ 30,000
val17/val17-08-recovery-check.txt confirms write_block_active=False, can_accept_protected_writes=True, non-empty last_lost_at and last_restored_at, and detected_loss_count ≥ 1
val17/val17-09-count-check.txt contains detected_loss_count=2 (second cycle confirmed) after the second recovery succeeds
val17/val17-10-audit-check.txt contains lost_events=N restored_events=M with both N ≥ 1 and M ≥ 1
val17/val17-report.txt shows pass=10 fail=0 total=10
val17/val17-report.json contains pass_count=10, loss_ms, and recovery_ms as baseline timing evidence against workplan ≤ 60 s target

The config migration surface should also show:

config-migrate/config-migrate-dry-run.txt reports the planned v0-to-v1 changes without writing any output file
config-migrate/config-migrate-stdout.yaml and config-migrate/config-migrate-stdout.toml show deterministic migrated output in both supported formats
config-migrate/config-migrated.yaml and config-migrate/config-migrated.toml show successful file output
config-migrate/config-migrated-in-place.yaml differs from the pre-migrate checksum while preserving the target file mode captured in the paired stat files
config-migrate/config-migrate-unsupported.stderr names both supported schema versions for unsupported input
config-migrate/config-migrate-invalid-input.stderr shows malformed input fails closed instead of producing a synthetic v1 skeleton
config-migrate/config-migrate-invalid-v0.stderr shows invalid migrated configs fail validation before any output is written

7. Current Evidence Bundle¶

Reference local run:

evidence/pr17-cli-audit-local-2026-03-17/README.md
evidence/pr18-support-bundle-local-2026-03-18/README.md
evidence/pr20-rbac-local-2026-03-18/README.md
evidence/pr22-metrics-local-2026-03-18/README.md
evidence/pr27-config-migration-local-2026-03-19/README.md

8. Scope Boundary¶

This lab proves the current PR-17 and PR-18 scope honestly:

canonical audit schema
CLI-side emission at the wired action sites
reproducible local evidence from real command invocations
support-bundle generation against a live control-plane, retained audit store, and log file
RBAC role create/list/assign against the local file-backed model plus retained audit capture
RBAC enforcement on the current read surfaces, including the server-side /v1/ha/status path and retained denial auditing
metrics catalog visibility and point-in-time metric queries against a live control-plane metrics endpoint
config migration tooling against checked-in v0 fixtures, including safe dry-run output, format selection, fail-closed invalid input handling, and atomic in-place replacement

The VAL 01 surface (Phase 8 of the cert lab) proves bounded certificate rotation with continuity for new client connections:

A 2-day cert for node-c.edge.local is accepted over live mTLS before rotation
autonomy cert rotate completes within the 300-second bound (actual: sub-second)
The rotated 90-day cert is accepted over live mTLS without restarting the control-plane, proving continuity for a fresh client connection using the same cert/key file paths after atomic replacement
The cert.rotated audit event is retained and queryable via audit query
Serial numbers differ before and after rotation, proving a new keypair was issued
All six checks captured in cert-rotation-val01-report.txt as a composite PASS/FAIL

It does not prove CA rotation, server-certificate hot reload, uninterrupted in-flight request continuity, or coordinated multi-node rotation. See the deferred coverage matrix in cert-rotation-validation.md for the exact status of each excluded area.

The VAL 02 surface (Phase 9 of the cert lab) proves consistent rejection across all five trust-chain failure categories:

Missing client certificate is rejected (RequireAndVerifyClientCert active)
Certificate from a rogue CA is rejected by chain verification — even when the CN matches a known legitimate node, proving rejection is CA-anchor-based, not CN-based
Expired certificate is rejected by the validity period check in Go’s chain verification
Revoked certificate is rejected by the VerifyPeerCertificate CRL callback
Wrong CA bundle on the client causes server cert verification failure, proving mTLS is bidirectional
A cert from the trusted CA with an unexpected CN is accepted at the TLS layer (documented in the report as right_ca_wrong_cn: expected accepted) — confirming that identity-layer authorization is RBAC-based, not cert CN-based

See cert-rejection-validation.md for the full VAL 02 validation plan, scenario matrix, pass/fail criteria, and report template.

The PR-29-followup-e surface proves cert-management RBAC coverage:

All six autonomy cert subcommands (issue, rotate, revoke, list, check-revocation, sync-crl) require cert:manage or, for read-only operations, cert:read
cert list and cert check-revocation were unguarded before PR-29-followup-e; they are now guarded with newRBACGuard().CheckAny([]string{"cert:read","cert:manage"}, ...)
cert:read is a new recognized permission included in the auditor predefined role; cert:manage requires a custom role
RBAC denial for any cert command emits auth.access.denied before returning the error; no separate audit-on-denial code is needed — rbacGuard.emitDenied() fires automatically

The PR-29-followup-d surface proves the database-backed audit query path:

audit_events table in the pgstore PostgreSQL schema (append-only, INV-AUDIT-01)
PGAuditEmitter writing records at Class 3 (best-effort) durability — write failures are logged and counted but never propagated to the audited operation
InitPGAuditEmitter(db) upgrading the package-level emitter to MultiEmitter (slog + PG + file) after a successful pgstore connection
autonomy audit query --pg-url / AUTONOMY_AUDIT_PG_URL as the primary operator query surface when PostgreSQL is available
autonomy audit export --pg-url for JSON and CSV export from the DB
autonomy audit prune --older-than Nd for operator-initiated retention enforcement against the audit_events table
QueryAuditEvents as a read-only function safe to run on any replica
file-based audit.FileEmitter preserved in parallel as the fallback / compat mode when no --pg-url is provided

It does not yet claim background retention jobs, OCSP-style online status queries, or multi-tenant audit isolation.

The VAL03 surface (slice 14) proves RBAC permission enforcement across a 14-check matrix covering all three enforcement claims:

VAL03-C1 (unauthorized blocked): ha status (fleet:read), audit query (audit_history:read), and rbac role create (rbac:manage) are each denied for identities whose role does not include the required permission, with auth.access.denied emitted before any network call
VAL03-C2 (authorized succeeds): all three permissions are exercised on the allow path: fleet:read for operator, analyst, and auditor; audit_history:read for auditor; rbac:manage for auditor. The VAL03 identities are mirrored into the HA helper’s server-side RBAC store so the ha status allow-path checks exercise both client-side and server-side authorization. If the HA helper is unavailable, the three ha status allow-path checks are recorded as SKIP, not PASS
VAL03-C3 (unguarded unrestricted): rbac role list, rollout plan list, and support-bundle generate succeed or fail for non-RBAC reasons regardless of the operator’s assignment, confirming those commands have no guard
VAL03-C4 (denial audit visibility): the retained audit query is narrowed to the current VAL03 time window and must contain the five expected actor/action/permission denial tuples from this slice itself

VAL03 covers representative commands from the guarded surface; it does not re-exercise the bootstrap, break-glass, opt-out, or cert RBAC paths already covered by run_rbac_enforcement_lab and run_cert_rbac_lab. See rbac-enforcement-validation.md for the full VAL03 validation plan, guard coverage map, pass/fail criteria, and report template.

The VAL04 surface (slice 15) proves audit completeness across a 10-check matrix covering all four completeness claims:

VAL04-C1 (store populated): the retained file-backed store is non-empty and all 6 audit categories contain at least one record after all prior lab phases have run
VAL04-C2 (schema complete): every audit category’s records contain all 6 mandatory fields (event_name, category, action, outcome, source, timestamp), confirming no field is silently dropped by any emit call
VAL04-C3 (queryable within latency bound): a full retained-store query with --limit 0 --output json succeeds and completes in ≤ 2000 ms, confirming operational usability at lab corpus sizes
VAL04-C4 (event-type coverage): all 25 of the 25 defined wired event types appear in the retained store; absent events are listed explicitly in val04-coverage-report.txt and any absence is a validation failure

VAL04 validates against the 25 wired event types. The 6 deferred event types (rollout.gate.approved, rollout.recovered, rollout.stuck.detected, auth.login.succeeded, auth.login.failed, relay.deadletter.inspected) are excluded — their absence is expected and correct. See audit-completeness-validation.md for the full VAL04 validation plan, wired event inventory, pass/fail criteria, and report template.

The VAL05 surface (slice 16) proves OTel integration across a 9-check matrix covering all four integration claims:

VAL05-C1 (Prometheus metrics): the control-plane /metrics endpoint returns HTTP 200, all 4 required metric families are present, and cp_http_requests_total / cp_rollout_plans_total have non-zero values after lab traffic — confirming that Prometheus instrumentation is wired and receiving real observations
VAL05-C2 (WAL durability): events emitted via telemetry.NewEmitter are persisted to the local WAL and readable by telemetry status and telemetry export; this is validated with an isolated test WAL populated by telemetry_emit_helper (a small lab binary), because no CLI command exists to emit adapter-side telemetry events directly
VAL05-C3 (JSONL export): telemetry export --out produces non-empty JSONL with all mandatory event fields (kind, ts, seq, written_at, attrs) — confirming the offline pipeline can surface events to downstream consumers that do not use OTLP
VAL05-C4 (correlation ID propagation): trace_id / span_id set at emit time survive through the WAL → JSONL path (as trace_id / span_id) and the WAL → OTLP flush path (as traceId / spanId in the OTLP log record), confirming that the custom OTLP encoding correctly propagates correlation context for consumers such as Jaeger and Grafana Tempo

VAL05 validates the two implemented observability paths (Prometheus metrics and WAL/OTLP events). It does not validate the OTel Go SDK (not used), automatic traceparent header extraction (not implemented), slog trace context injection (not implemented), or the edge Prometheus metrics (edge process not started by this lab). See otel-integration-validation.md for the full VAL05 validation plan, architecture notes, pass/fail criteria, and report template.

The VAL06 surface (slice 17) proves support-bundle correctness across a 10-check matrix covering all four bundle claims:

VAL06-C1 (generation succeeds): support-bundle generate exits 0 and produces a non-empty .tar.gz archive within 30 seconds — confirming the collector pipeline runs to completion and the archive-write path is functional at lab corpus sizes
VAL06-C2 (diagnostic coverage): the archive contains all three always- present core files (manifest.json, system_info.json, build_info.json) and manifest.json records all 6 collector names regardless of their individual outcome; system_info.json contains all 5 required runtime fields; audit_recent.json has at least 1 record from the retained store, proving end-to-end connectivity between the bundle and the audit subsystem
VAL06-C3 (secrets redacted): config_redacted.yaml replaces the known test fleet_salt with <REDACTED> and the postgres URL password with REDACTED; original values are verified absent; no PEM block (-----BEGIN) appears anywhere in the extracted archive
VAL06-C4 (graceful degradation): generating a bundle with a non- existent --orchestrator-url exits 0 and manifest.json records ha_status as "failed", confirming the non-fatal collector pattern is preserved for optional data sources

VAL06 validates the CLI surface and archive structure. It does not test bundle ingestion by external tools, RBAC guarding of the command (confirmed unguarded by VAL03-C3), per-field value correctness of system_info.json, or the DB-backed audit path (requires a live PostgreSQL instance). See support-bundle-validation.md for the full VAL06 validation plan, bundle architecture, collector status definitions, pass/fail criteria, and report template.

The VAL07 surface (slice 18) establishes a rollout latency baseline across a 9-check matrix covering all four latency claims:

VAL07-C1 (control plane reachable): a dedicated fresh control-plane instance starts on 127.0.0.1:18992 with an isolated SQLite data directory and responds to GET /v1/health with 200, establishing a clean starting point for the benchmark
VAL07-C2 (plan-create latency): 20 sequential POST /v1/rollouts requests are timed with curl -w '%{time_total}' and Python percentiles are computed; p50 ≤ 100 ms, p95 ≤ 300 ms, and p99 ≤ 500 ms prove the primary workplan target (rollout plan creation < 500ms p99) is met in the local- SQLite environment
VAL07-C3 (plan-list latency): 20 sequential GET /v1/rollouts requests against a store containing 20 plans; p99 ≤ 500 ms proves the read path is within the same bound after realistic state accumulation
VAL07-C4 (concurrent responsiveness): 5 parallel POST /v1/rollouts requests all return 2xx and complete within a 2000 ms wall clock, proving that the single-writer SQLite connection serialises concurrent creates without returning errors or making the operator API unacceptably slow

VAL07 is a local-lab latency baseline. The bounds (100/300/500 ms) are generous for in-process loopback SQLite and are designed to detect regressions (e.g. an accidentally synchronous fsync, a missing index on the list path) rather than to measure production PostgreSQL performance. Benchmark methodology, sample size rationale, and environment assumptions are documented in rollout-latency-validation.md.

The VAL08 surface (slice 19) validates concurrent fleet rollout throughput across a 10-check matrix covering all four throughput claims:

VAL08-C1 (N=100 zero errors): 100 concurrent workers each creating 5 plans produce zero errors, proving the workplan target (≥100 concurrent device rollouts) is met in the local-SQLite environment
VAL08-C2 (durable storage): all 805 created plans (across four concurrency tiers) appear across paginated GET /v1/rollouts?limit=100 responses, confirming the serialised SQLite writer commits every plan before returning 201
VAL08-C3 (wall-clock bound): the N=100 scenario (500 plans) completes within 30 seconds, establishing a safe upper bound for operator-facing throughput at design-partner fleet sizes
VAL08-C4 (no concurrency regression): throughput at N=100 is ≥ throughput at N=1, confirming that the single-writer SQLite connection serialises concurrent creates without causing a performance regression

VAL08 validates the control-plane write path under concurrent load at lab scale. It does not test PostgreSQL backend throughput (requires a live PG instance), network-constrained relay delivery, edge-agent reconciliation latency, or fleet sizes beyond 1,000 devices (the proposed workplan maximum). Scenario matrix design, SQLite serialisation notes, and pass/fail thresholds are documented in rollout-throughput-validation.md.

The VAL09 surface (slice 20) validates stuck rollout detection and recovery across a 10-check matrix covering all four stuck-detection claims:

VAL09-C1 (detection accuracy): plans in active phases with updated_at older than the threshold appear in GET /v1/rollouts/stuck with non-empty diagnosis strings — validated by detecting all 5 test plans after a 4-second sleep against a 3-second threshold
VAL09-C2 (exclusion correctness): plans in paused or terminal phases are excluded from stuck detection regardless of updated_at staleness — validated by pausing plan-b and cancelling plan-c, then confirming both are absent from subsequent scans
VAL09-C3 (retry recovery): recover strategy=retry transitions the plan to active and refreshes updated_at, removing it from future stuck scans — validated by the plan-d flow and confirmed at VAL09-10
VAL09-C4 (rollback recovery): recover strategy=rollback transitions the plan to the rolled_back terminal phase, removing it from all subsequent scans — validated by the plan-e flow and confirmed at VAL09-10

VAL09 validates the detection and recovery surfaces against lab-scale plans in published phase with no edge-agent activity. It does not test automatic periodic stuck scanning (not yet implemented), stuck detection across HA replicas, the skip_failed recovery strategy (which requires a plan with an open stage in stage_in_progress phase), or the rollout.stuck.detected audit event path (slog-only; not yet wired to the retained audit store). Staleness injection method, diagnosis logic, and scenario design are documented in stuck-detection-validation.md.

The VAL10 surface (slice 21) validates rollback reliability across a 10-check matrix covering all four rollback claims:

VAL10-C1 (preview coverage): rollback preview exits 0 for all four target kinds (rollout_plan, rollout_stage, ha_leader_resign, relay_deadletter), producing safety class, trigger conditions, and known limitations
VAL10-C2 (retry success rate): rollback execute strategy=retry on real rollout plans succeeds with 100% success rate across a batch of 5 executions, with each plan transitioning from published to active
VAL10-C3 (rollback success rate): rollback execute strategy=rollback on real rollout plans succeeds with 100% success rate across a batch of 5 executions, with each plan transitioning from published to rolled_back
VAL10-C4 (aggregate rate + audit): aggregate rate across all 10 executes is ≥ 99%; rollback.executed audit events with outcome=success are captured in the retained store

VAL10 validates the CLI execute path and the workplan ≥99% target against the local-SQLite control-plane. It does not test skip_failed, ha_leader_resign via VAL10, automatic rollback, the 30-day soak (workplan GA gate), or the PostgreSQL backend. Success rate formula, JSON field handling, and evidence structure are documented in rollback-reliability-validation.md.

The VAL11 surface (slice 22) validates operator-surface resilience under representative chaos conditions across a 10-check matrix covering all four chaos claims:

VAL11-C1 (kill → client error + no silent data loss): after a CP SIGTERM, client CLI requests exit non-zero with a connection error keyword, and the full pre-kill plan corpus is present after restart
VAL11-C2 (gate-wait survival): a plan in published phase (gate-wait state) retains its phase across the CP kill boundary, confirming the operator’s pending gate decision is not lost
VAL11-C3 (rapid restart resilience): three successive kill+restart cycles do not corrupt the store; new plan creates succeed after the final restart, confirming the write path is operational after repeated restarts
VAL11-C4 (diagnostic and recovery surfaces functional post-chaos): stuck detection, artifact corruption proxy queries, rollback execute, and audit capture all function correctly in and around chaos injection windows

VAL11 validates the control-plane durability and operator-surface resilience using process-level SIGTERM injection only. It does not test iptables-level network partitions (requires root), concurrent creates under kill (inherently racy; replaced by deterministic rapid-restart), SIGKILL WAL recovery, PostgreSQL backend chaos, edge-agent reconnect after partition, or automatic stage promotion under chaos. Chaos mechanism rationale, safety guardrails, and scenario design are documented in chaos-validation.md.

VAL 12 — Fleet Rollout 30-Day Soak is the workplan Gate D long-duration framework and is not a slice of this runner. The existing run_cli_audit_lab.sh is a synchronous single-shot evidence collector; the 30-day soak requires persistent infrastructure, scheduled round execution (cron every 30 minutes), rolling evidence windows, daily aggregation, and a final pass/fail report.

The soak is driven by three separate scripts:

scripts/labs/run_soak_val12_setup.sh — one-time environment provisioning; starts a persistent CP at 127.0.0.1:19000, writes config.env, and runs the first workload round to verify VAL12-01 (framework provisioned) and VAL12-02 (initial round zero errors)
scripts/labs/run_soak_val12_round.sh — single workload round called by cron every 30 minutes; creates 10 plans, runs stuck scan + auto-recovery, scrapes Prometheus metrics, and writes round-summary.json to $SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/
scripts/labs/run_soak_val12_report.sh — daily summary and final report aggregator; reads all round summaries, computes rollback rate, P99 latency, and CP uptime, and checks all 10 VAL12 thresholds

The soak satisfies four claims over 30 days: ≥100 concurrent plans sustained (VAL12-C1), ≥99% rollback success rate (VAL12-C2), P99 ≤500ms maintained under accumulated store state (VAL12-C3), and CP availability ≥99.9% (VAL12-C4). The Gate D minimum-acceptable pass is VAL12-03 (fleet target reached) + VAL12-10 (30-day aggregate rollback rate ≥0.990). Soak environment design, workload schedule, alert thresholds, evidence retention plan, and the final report template are documented in soak-validation.md.

VAL 13 — HA Failover Validation is slice 23 of this runner (run_ha_failover_val13_lab). It extends the existing HA lab infrastructure in run_cli_audit_lab.sh with a dedicated function using fresh Docker containers (val13-pg-primary, val13-ha-net) and isolated ports (18997/18998) to avoid interference with the backup/restore and quorum labs.

VAL13 validates four HA readiness claims:

VAL13-C1 (SIGTERM failover latency): leader election completes within 5 seconds of SIGTERM on the current leader, measured end-to-end from kill signal to follower logging “acquired leadership”
VAL13-C2 (zero data loss): rows written directly to the shared PostgreSQL instance while the original leader held the advisory lock remain readable from that same instance after failover
VAL13-C3 (PG crash recovery): quorum monitor detects PostgreSQL unavailability (quorum_health=lost) and returns to quorum_health=healthy after explicit operator restart of PostgreSQL
VAL13-C4 (unplanned crash / disk-fault proxy): SIGKILL on the leader (no graceful Resign(), simulating OOM or disk crash) results in advisory lock release via TCP RST and a new leader within 5 seconds

VAL13 does NOT validate: streaming-replication failover (covered by run_ha_lab() and run_quorum_lab()), iptables-based network partitions (requires root), concurrent writes under kill, SIGKILL on PostgreSQL (WAL recovery path; covered by VAL13-09 indirectly via the same single-node PG), multi-region or cross-datacenter failover, or automatic rollback trigger under HA failure. Scenario design, measurement method, and pass/fail criteria are documented in ha-failover-validation.md.

VAL 14 — HA Replication Lag Baseline benchmarks PostgreSQL streaming replication lag under the autonomyops HA architecture and derives practical alerting thresholds from observed data. A val14-pg-primary + val14-pg-standby pair is provisioned via pg_basebackup, and a single HA server at port 18999 uses --min-sync-replicas 1 so quorum health tracks standby availability. VAL14 proves these workplan claims:

VAL14-C1 (replication streaming): a PostgreSQL streaming-replication standby is established, confirmed by pg_stat_replication.state = streaming and an LSN gap of 0 at rest
VAL14-C2 (light load drain ≤ 2 s): after a 100-row × 500-byte write batch committed with synchronous_commit=off, lag is sampled during the active drain window and the WAL LSN gap drains to 0 within 2000 ms on local Docker
VAL14-C3 (heavy load drain ≤ 5 s): after a 500-row × 2000-byte (~1 MB) write batch committed with synchronous_commit=off, the LSN gap drains within 5000 ms
VAL14-C4 (threshold derivation): practical alerting thresholds (healthy, degraded, alert) are derived from observed p95 lag using the formula healthy = max(p95×3+1, 10), degraded = max(healthy×10, 100), alert = max(healthy×50, 500), anchoring monitoring configuration to measured behaviour

VAL14 does NOT validate: write-path lag through the HA server HTTP surface (writes go directly to PostgreSQL), streaming-replication switchover or promotion (covered by run_ha_lab() and run_quorum_lab()), PG logical replication, multi-standby topologies, network-partition-induced lag (requires iptables/root), or production-scale throughput on cloud storage. Benchmark design, analysis method, and threshold derivation formula are documented in ha-replication-lag-validation.md.

VAL 15 — Backup/Restore Validation proves the ha backup create/list/restore CLI workflow end-to-end, including integrity verification, timing bounds, and safety-gate enforcement. A dedicated Docker PostgreSQL instance with two fixture tables (~1 MB total) is provisioned for isolation. The HA server is cycled through normal -> maintenance -> normal modes to match the real operator workflow. VAL15 proves these workplan claims:

VAL15-C1 (backup file integrity): ha backup create produces a valid pg_dump custom-format archive; the SHA-256 checksum recorded in the inventory matches an independently computed hash of the produced file
VAL15-C2 (restore correctness): after post-backup mutations (UPDATE + DELETE), ha backup restore reverts the database to its pre-backup state; row counts and spot-check payload values are verified by SQL assertion
VAL15-C3 (timing bounds): backup completes in ≤ 30 s; restore completes in ≤ 60 s on local Docker (conservative thresholds that flag hangs or permission errors without constraining normal operation)
VAL15-C4 (safety gate): ha backup restore without --confirm exits non-zero, confirming the mandatory confirmation flag prevents accidental restores

VAL15 does NOT validate: backup rotation / retention policy (no implementation), cross-PG-version restore compatibility, concurrent writes during backup, backup storage to remote object stores, disaster recovery runbook execution timing, or automatic scheduled backups. Fixture strategy, checksum method, and test sequence are documented in backup-restore-validation.md.

VAL16 (run_split_brain_chaos_val16_lab) uses SQL injection against leadership_state and leader_epochs to trigger risk=detected (epoch divergence + holder mismatch) and risk=possible (unclosed ghost-node epoch rows). A user-table probe row (val16_probe) is checked post-recovery to confirm recovery does not affect data beyond leadership metadata tables. VAL16 proves these workplan claims:

VAL16-C1 (split-brain detection): epoch divergence and ghost-node conditions are reliably detected and reported as risk=detected / risk=possible respectively via the /v1/ha/split-brain API and ha split-brain detect CLI
VAL16-C2 (recovery correctness): ha split-brain recover --strategy promote-leader clears risk=detected and returns the cluster to risk=none without corrupting user data
VAL16-C3 (dry-run safety): manual-reconcile exits 0 and does not write to the database, confirming operators can plan a recovery before committing
VAL16-C4 (self-clearing ghost nodes): unclosed epoch rows that are subsequently resigned clear automatically, returning the cluster to risk=none without operator intervention

VAL16 does NOT validate: real network partitions (requires iptables/root), genuine two-node concurrent-write split-brain, automatic rollback under detected split-brain, multi-region scenarios, or HA server binary restart-triggered epoch divergence. Scenario design, injection SQL, and safety rationale are documented in split-brain-chaos-validation.md.

VAL17 (run_quorum_loss_val17_lab) exercises the QuorumMonitor’s healthy → lost → healthy cycle with timed measurements and write-blocking assertions. Single PG with --min-sync-replicas 0 and --quorum-monitor-interval 500ms; loss induced by docker stop, recovery by docker start. VAL17 proves these workplan claims (Gap HA-004):

VAL17-C1 (loss detection timing): quorum_health=lost is detected within ≤ 30,000 ms (30 s, well under the workplan’s 60 s target) of docker stop with a 500 ms monitor interval
VAL17-C2 (write safety during loss): write_block_active=true and can_accept_protected_writes=false are confirmed in the quorum status JSON during the loss window, proving the WriteGate middleware is engaged
VAL17-C3 (recovery detection timing): quorum_health=healthy is detected within ≤ 30,000 ms after PostgreSQL is restored with docker start, confirming timely recovery detection once the dependency returns
VAL17-C4 (monitor history correctness): last_lost_at, last_restored_at, and detected_loss_count are correctly populated and increment across repeated cycles, confirming the QuorumMonitor’s state-change tracking is reliable

VAL17 does NOT validate: the healthy → degraded path (covered by run_quorum_lab() with --min-sync-replicas 1), network-partition-induced quorum loss (iptables, requires root), HTTP write-gating via rollout endpoint (HA server binary does not expose /v1/rollouts), or multi-region topologies. Timing method, threshold rationale, and scenario design are documented in quorum-loss-validation.md.

VAL 18 — HA 30-Day Soak is the workplan Gate D long-duration HA framework and is not a slice of this runner. The 30-day lifecycle — with persistent Docker infrastructure, cron-scheduled round execution, PID tracking across invocations, node restart recovery, and progressive report generation — cannot be expressed as a run_cli_audit_lab.sh function. Same reasoning applied for VAL12 (fleet soak).

The soak is driven by three separate scripts:

scripts/labs/run_soak_val18_setup.sh — one-time environment provisioning; builds orchestrator_ha_server + autonomy binaries, starts a persistent Docker PostgreSQL instance (val18-pg-primary, host port 5488) for a stable 30-day connection address, creates the val18_probe table, starts HA node1 (19010) + node2 (19011) as background processes, writes config.env, and runs the first health round to verify VAL18-01 (framework provisioned) and VAL18-02 (initial round success). On a normal rerun it reuses the persistent Docker volume/container instead of deleting 30-day soak state
scripts/labs/run_soak_val18_round.sh — single HA health round called by cron every 2 hours; checks PG + HA node liveness (restarts dead processes), identifies the current leader from /v1/ha/status holder_id, checks quorum health + probe row, triggers a timed SIGTERM failover every SOAK_FAILOVER_INTERVAL_ROUNDS rounds (polls follower every 50 ms for holder_id change, measures failover_ms, restarts killed node as follower, verifies probe row on new leader), and writes round-summary.json to $SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/
scripts/labs/run_soak_val18_report.sh — daily summary and checkpoint/final report aggregator; reads all round-summary.json files, computes HA uptime%, failover count, p50/p95/p99 failover_ms, and data continuity rate, and checks all VAL18 thresholds with a Gate D HA assessment. The report uses separate failover-count thresholds for the 7-day checkpoint (>= 1) and the 30-day final Gate D check (>= 3)

The soak satisfies four claims over 30 days: ≥ 3 scheduled leader failovers sustained (VAL18-C1), failover_ms ≤ 10,000 on every failover (VAL18-C2), probe row accessible after every failover (data continuity rate = 1.0, VAL18-C3), and HA uptime ≥ 99.9% (VAL18-C4). The Gate D minimum-acceptable pass requires VAL18-09 (≥ 3 total failovers) + VAL18-10 (data continuity rate = 1.000) + VAL18-07 (max failover timing maintained ≤ 10,000 ms). Soak environment design, failover schedule strategy, observability plan, evidence retention, and the final report template are documented in ha-soak-validation.md.

VAL25 — Fleet Rollout Proof Report¶

VAL25 is a report generator (not a test runner) that reads the evidence produced by the five fleet rollout validation slices (VAL07–VAL11) and emits a consolidated proof report suitable for engineering leads and external reviewers.

It evaluates three readiness levels and clearly separates measured results from proposed targets:

Readiness level	Achievable with VAL07–VAL11?	Additional requirements
Design Partner	YES — if all five slices pass and key targets met	None beyond VAL07–VAL11
GA	NO	PostgreSQL backend + VAL12 30-day soak
Public Production	NO	Everything above GA + security audit + multi-region

Run after completing a full cli-audit-lab run:

bash scripts/labs/run_fleet_rollout_proof_report_val25.sh \
  evidence/cli-audit-lab-YYYY-MM-DD

Output is written to val25/ within the evidence directory. The formal plan, metric definitions, and readiness criteria are documented in fleet-rollout-proof-report-validation.md.

VAL26 — HA Proof Report¶

VAL26 is a report generator (not a test runner) that reads the evidence produced by the five HA validation slices (VAL13–VAL17) and emits a consolidated HA proof report.

It evaluates three readiness levels with measured results clearly separated from proposed targets and derived thresholds:

Readiness level	Achievable with VAL13–VAL17?	Additional requirements
HA Design Partner	YES — if all five slices pass and key targets met	None beyond VAL13–VAL17
HA GA	NO	VAL18 30-day soak + streaming replication promotion
HA Public Production	NO	Everything above GA + multi-AZ + security audit

Run after completing a full cli-audit-lab run:

bash scripts/labs/run_ha_proof_report_val26.sh \
  evidence/cli-audit-lab-YYYY-MM-DD

Output is written to val26/ within the evidence directory. The formal plan, metric definitions, and readiness criteria are documented in ha-proof-report-validation.md.

VAL28 — Cross-Cutting Proof Report¶

VAL28 is a report generator (not a test runner) that reads the evidence produced by six cross-cutting validation slices (VAL01–VAL06) and emits a consolidated report covering the Security, Observability, and Operations surfaces of the AutonomyOps ADK control plane.

It evaluates three readiness levels:

Readiness level	Achievable with VAL01–VAL06?	Additional requirements
Cross-Cutting Design Partner	YES — if all six slices pass and key targets met	None beyond VAL01–VAL06
Cross-Cutting GA	NO	PG-backed audit perf + production OTel collector + no SKIP checks
Public Production Claim	NO	Everything above GA + external security audit + compliance audit

Run after completing a full cli-audit-lab run:

bash scripts/labs/run_crosscut_proof_report_val28.sh \
  evidence/cli-audit-lab-YYYY-MM-DD

Output is written to val28/ within the evidence directory. VAL28 handles the mixed evidence formats: VAL01/VAL02 text reports are parsed via regex; VAL03–VAL06 JSON reports are schema-validated before extraction. The formal plan, metric definitions, and readiness criteria are documented in crosscut-proof-report-validation.md.

VAL29 — v1 Public-Claim Evidence Matrix¶

VAL29 is a meta-aggregator that reads the JSON artifacts from all four proof reports (VAL25/VAL26/VAL27/VAL28) and produces a single capability-level evidence matrix covering every v1 public claim.

The matrix assigns one of five evidence states to each claim:

State	Meaning
`VALIDATED`	Claim fully supported by completed VAL runs
`BETA`	Claim supported with disclosed limitations
`NOT_STARTED`	Soak framework exists; Gate D not yet run
`DEFER`	Not validated; additional work required
`FUTURE_REQUIRED`	Required for Public Production Claim only

Run after all four proof reports have been generated:

bash scripts/labs/run_evidence_matrix_val29.sh \
  evidence/cli-audit-lab-YYYY-MM-DD \
  evidence/

Output is written to val29/ within the cli-audit-lab evidence directory. The formal plan and full matrix definition are documented in evidence-matrix-validation.md.