CLI Audit And Support Bundle Lab

Audience: reviewers and operators who want reproducible local evidence that the PR-17 CLI and retained audit surfaces, the PR-18 support-bundle surface, the PR-19 audit query/export refinements, the PR-20 RBAC role model CLI work, the PR-21 RBAC enforcement plumbing work, the PR-22 metrics CLI surface work, the PR-24 HA quorum status surface work, the PR-25 split-brain detect workflow work, the PR-26 split-brain manual recovery workflow work, the PR-27 config migration tooling work, the PR-29-followup-a RBAC default-on hardening work, the PR-29-followup-c cert revocation transport enforcement work, the PR-29-followup-d database-backed audit persistence work, and the PR-29-followup-e cert-management RBAC coverage completion work against real command runs.

This lab captures actual CLI stdout plus the structured audit lines emitted on stderr by live commands in the current branch, and then queries and exports the retained records written by those same runs:

  1. autonomy rollout plan create

  2. autonomy rollout plan publish

  3. autonomy rollout plan cancel

  4. autonomy ha backup create

  5. autonomy ha backup restore

  6. autonomy ha failover trigger

  7. autonomy cert issue

  8. autonomy cert rotate

  9. autonomy rbac role create

  10. autonomy rbac role list

  11. autonomy rbac role assign

  12. edgectl relay deadletter retry

  13. edgectl relay deadletter purge --force

  14. edgectl relay config set-bandwidth

  15. autonomy audit query

  16. autonomy audit export --format json|csv

  17. autonomy support-bundle generate

  18. Default-on RBAC bootstrap mode denial (no env var set, empty store)

  19. RBAC bootstrap seed via rbac role assign (bootstrap path)

  20. RBAC break-glass bypass (AUTONOMY_RBAC_BREAK_GLASS=1)

  21. RBAC opt-out backward compat (AUTONOMY_RBAC_ENFORCEMENT=0)

  22. Default-on RBAC-enforced autonomy audit query

  23. Default-on RBAC-enforced autonomy audit export

  24. Default-on RBAC-enforced autonomy ha status

  25. Direct /v1/ha/status requests with and without operator identity

  26. autonomy metrics list

  27. autonomy metrics query

  28. autonomy ha quorum status

  29. autonomy ha split-brain detect

  30. autonomy ha split-brain recover

  31. autonomy config migrate

  32. Cert RBAC enforcement — denied and allowed flows for all six autonomy cert subcommands under AUTONOMY_RBAC_ENFORCEMENT=1

1. Gaps Closed by This Lab

Before this lab refresh, the checked-in PR-17 evidence relied on in-package test harnesses. That left five evidence gaps:

  • rollout, HA, and backup audit captures were not tied to a live control-plane process

  • relay audit captures were not clearly tied to the existing daemon-backed edge lab

  • the HA helper lived outside the repo under /tmp, which weakened reproducibility

  • the retained query and export surface was covered by tests only, without checked-in live evidence from the same captured audit dataset

  • support-bundle generation had no checked-in live evidence against a real control-plane, retained audit store, and log file

  • RBAC role create/list/assign had no checked-in live evidence against the retained audit store and local file-backed assignment model

  • RBAC enforcement had no checked-in live evidence proving allowed and denied behavior across CLI and direct HA-status HTTP paths

  • metrics list/query had no checked-in live evidence against a real control-plane metrics endpoint

  • HA quorum status had no checked-in live evidence for healthy, degraded, lost, and restored transitions from the same live HA lab

  • split-brain detection had no checked-in live evidence for healthy, detected, and possible states or for deduplicated ha.split_brain.detected audit emission

  • split-brain recovery had no checked-in live evidence for dry-run planning, executed recovery, post-recovery convergence, or retained ha.split_brain.recovered audit records

  • config migration tooling had no checked-in live evidence for deterministic dry-run output, format selection, file and in-place writes, or fail-closed handling of malformed and invalid legacy input

This lab closes those gaps by:

  • starting a real local SQLite-backed control-plane for rollout verification

  • running the real PostgreSQL primary + standby HA workflow for backup and failover verification

  • reusing the live daemon-backed edge relay lab for retry, purge, and bandwidth

  • building the HA helper from the repo at scripts/labs/orchestrator_ha_server.go

  • retaining audit records under the evidence directory and querying/exporting those same live records through autonomy audit query and autonomy audit export

  • generating a support bundle from the same live HA endpoint, retained audit store, and HA server log used by the rest of the lab

  • capturing live evidence for the PR-19 retained-audit refinements:

    • category, outcome, and source filtering

    • explicit invalid --output rejection

    • invalid start/end time-range rejection

    • invalid export format rejection without truncating an existing file

  • capturing live evidence for PR-20 RBAC role create/list/assign, including:

    • canonicalized role-name persistence

    • idempotent repeat assignment behavior

    • retained auth.role.assigned query proof

  • capturing live evidence for PR-21 RBAC enforcement and PR-29-followup-a default-on hardening, including:

    • default-on bootstrap mode denial with no AUTONOMY_RBAC_ENFORCEMENT set

    • bootstrap seed via rbac role assign with empty store (only allowed bootstrap action)

    • post-bootstrap outsider denial (full enforcement is active)

    • post-bootstrap allowed access after correct role is granted

    • break-glass bypass (AUTONOMY_RBAC_BREAK_GLASS=1) with mandatory auth.break_glass.used audit event

    • break-glass safety: without AUTONOMY_OPERATOR the bypass is still denied

    • opt-out backward compat (AUTONOMY_RBAC_ENFORCEMENT=0)

    • default-on denied audit query and audit export for a non-auditor (no env var)

    • default-on allowed audit query, audit export, and ha status for authorized roles

    • denied direct /v1/ha/status access without operator identity

    • retained auth.access.denied and auth.break_glass.used audit proof

  • capturing live evidence for PR-22 metrics visibility, including:

    • local metrics list text and JSON catalog output

    • live metrics query output against a real /metrics endpoint

    • exact metric-family filtering for rollout counters

    • histogram-family filtering for request-duration buckets, count, and sum

  • capturing live evidence for PR-24 quorum visibility, including:

    • healthy quorum status on a live HA leader

    • degraded quorum when the sync standby is stopped

    • lost quorum when the primary PostgreSQL container is unavailable

    • restored quorum after PostgreSQL connectivity is recovered

    • retained ha.quorum.lost / ha.quorum.restored audit records from the same live run

  • capturing live evidence for PR-25 split-brain visibility, including:

    • healthy split-brain status on the live quorum helper

    • detected split-brain after durable epoch divergence is introduced

    • possible split-brain after a ghost unclosed epoch row is injected

    • deduplicated retained ha.split_brain.detected audit records after repeated identical detected-state polls

  • capturing live evidence for PR-26 split-brain manual recovery, including:

    • manual-reconcile dry-run output against a real detected stale-leader state

    • promote-leader execution against that same live stale-leader state

    • post-recovery risk: none verification from the same helper

    • retained ha.split_brain.recovered audit records for both dry-run and executed recovery

  • capturing live evidence for PR-27 config migration tooling, including:

    • deterministic dry-run output for a full v0 fixture

    • stdout migration output in both YAML and TOML

    • file and in-place writes for migrated configs

    • fail-closed rejection of unsupported schema versions

    • fail-closed rejection of malformed input and invalid migrated configs

  • capturing live evidence for PR-29-followup-d database-backed audit persistence (when AUTONOMY_AUDIT_PG_URL or POSTGRES_URL is set), including:

    • audit_events table write via PGAuditEmitter (parallel to file emitter)

    • autonomy audit query --pg-url reading from PostgreSQL (primary query path)

    • autonomy audit export --pg-url exporting in JSON and CSV from the DB

    • autonomy audit prune --older-than Nd retention enforcement

    • post-prune query confirming deleted rows are absent

    • file-backed query remaining as fallback when no --pg-url is set

  • capturing live evidence for PR-29-followup-e cert-management RBAC coverage, including:

    • all six autonomy cert subcommands denied when operator lacks cert:manage/cert:read

    • cert list and cert check-revocation denied (newly guarded; previously unguarded)

    • cert issue, cert rotate, cert revoke, cert sync-crl denied (require cert:manage)

    • cert list and cert issue allowed after granting the appropriate cert role

    • retained auth.access.denied audit records for all denied cert operations

2. Prerequisites

  • Docker available locally

  • Go 1.25.7

  • PostgreSQL client tools available in the Docker postgres:16 image

  • openssl, curl, and xxd available locally

  • ability to bind localhost TCP ports and a Unix socket

3. Run the Lab

From the repository root:

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_cli_audit_lab.sh

Optional custom evidence directory:

bash scripts/labs/run_cli_audit_lab.sh \
  "$PWD/evidence/pr17-cli-audit-local-$(date +%F)"

4. What the Runner Does

The runner performs twenty-five real verification slices:

  1. Starts autonomy-orchestrator serve on 127.0.0.1:18888 and runs rollout create, publish, and cancel against the live HTTP API, while also exposing /metrics on 127.0.0.1:19090 for the PR-22 metrics captures.

  2. Generates a local CA with openssl and runs the certificate workflow against real certificate files, including:

    • autonomy cert issue

    • autonomy cert rotate

    • autonomy cert revoke

    • autonomy cert check-revocation

    • autonomy cert sync-crl --min-sources 2 against a multi-publisher set

    • a source control-plane with --tls-crl-file

    • a second source control-plane serving the same CRL

    • a follower control-plane with repeated --tls-crl-sync-url and --tls-crl-sync-min-sources 2 proving cross-host CRL pull plus publisher agreement

    • VAL 01 — certificate rotation validation (Phase 8): issues a 2-day node-c cert for expiry-window testing, proves the old cert is accepted over live mTLS before rotation, times the atomic cert rotate to confirm it completes within 300 seconds, confirms the expiry window clears after rotation, and then connects with the new cert without restarting the control-plane — proving continuity for fresh mTLS client connections across the rotation window; also captures a cert.rotated retained audit query and a composite 6-check PASS/FAIL report

    • VAL 02 — trust-chain rejection validation (Phase 9): generates a rogue CA and an expired cert inline, then proves the control-plane consistently rejects all five rejection cases against the same live control-plane: missing client cert, invalid chain (rogue CA with legitimate-looking CN), expired cert, revoked cert (reusing the CRL from Phase 1), and wrong server trust (client-side CA bundle mismatch); captures stderr evidence for each and a composite 5-check PASS/FAIL report

  3. Brings up a disposable PostgreSQL primary + streaming standby, runs the HA helper from scripts/labs/orchestrator_ha_server.go, and verifies:

    • backup create

    • backup restore in maintenance mode

    • manual failover to a second helper

  4. Runs autonomy rbac role create, list, and assign against a local file-backed RBAC store and captures the emitted auth.role.assigned audit line.

  5. Demonstrates default-on RBAC enforcement (no env var required) and verifies the bootstrap, break-glass, and opt-out paths:

    • default-on bootstrap mode denial against a fresh empty RBAC store

    • bootstrap seed via rbac role assign with no enforcement env var

    • post-bootstrap outsider denial (full enforcement)

    • post-bootstrap allowed access after correct role assignment

    • opt-out backward compat via AUTONOMY_RBAC_ENFORCEMENT=0

    • break-glass bypass with AUTONOMY_RBAC_BREAK_GLASS=1 + AUTONOMY_OPERATOR

    • break-glass safety check: no bypass without AUTONOMY_OPERATOR Then restarts the HA helper and verifies default-on enforcement with a populated store:

    • denied direct /v1/ha/status without operator identity

    • allowed direct /v1/ha/status with X-Autonomy-Operator

    • denied autonomy audit query and autonomy audit export for an operator lacking audit_history:read (default-on, no env var)

    • allowed autonomy audit query, autonomy audit export, and autonomy ha status for authorized identities

  6. Runs the existing daemon-backed edge relay lab with --with-bandwidth so retry, purge, and bandwidth configuration audit lines come from a live edged process.

  7. Queries and exports the retained audit records written by those live runs from AUTONOMY_AUDIT_DIR under the evidence bundle, including:

    • full retained query

    • category-filtered auth query

    • category-filtered rollout query

    • source-filtered edge query

    • outcome-filtered success query

    • invalid --output and invalid time-range checks

    • invalid export-format preservation of an existing output file

  8. Generates a live support bundle against:

    • --orchestrator-url http://127.0.0.1:18089

    • the retained audit directory under the evidence bundle

    • the active HA server log

    • a redactable config file created by the runner

  9. Starts a dedicated HA quorum helper on 127.0.0.1:18091 with a fast quorum monitor interval and captures:

    • healthy status while the primary and synchronous standby are available

    • degraded status after stopping the standby

    • lost status after stopping the primary container

    • restored status after the database containers are started again

    • retained ha.quorum.lost and ha.quorum.restored audit records

  10. Reuses the live quorum helper on 127.0.0.1:18091 and captures:

    • healthy split-brain status with matching local and durable epochs

    • detected split-brain after advancing leadership_state.current_epoch away from the helper’s cached epoch

    • manual-reconcile dry-run output for the detected stale-leader case

    • promote-leader execution output for the same detected stale-leader case

    • post-recovery risk: none verification after the stale elector clears its local leader claim

    • possible split-brain after inserting a second unclosed leader_epochs row

    • deduplicated retained ha.split_brain.detected audit records after repeated identical detected-state polls

    • retained ha.split_brain.recovered audit records from the same live run

  11. Runs autonomy config migrate against checked-in v0 fixtures and captures:

    • dry-run output without writing files

    • migrated YAML and TOML stdout output

    • migrated YAML and TOML file output

    • in-place migration with before/after mode and checksum captures

    • unsupported-version rejection

    • malformed-input rejection

    • invalid-migrated-config rejection

  12. Runs the database-backed audit persistence slice when AUTONOMY_AUDIT_PG_URL or POSTGRES_URL is set (skipped cleanly when absent), capturing:

    • DB-backed audit query text output (primary query path via audit_events)

    • DB-backed category-filtered query (cert category, JSON format)

    • DB-backed JSON export and CSV export from audit_events

    • audit prune --older-than 90d with no-op output (no qualifying rows)

    • audit prune --older-than 1d deleting the seeded 1h and 2h rows

    • post-prune query confirming only the most recent row remains

  13. Runs the cert RBAC enforcement slice (run_cert_rbac_lab), capturing:

    • all six autonomy cert subcommands denied for an operator without cert:manage or cert:read (confirmed by the presence of “cert:manage” in the error output)

    • cert issue allowed for cert-admin after granting a custom cert-operator role holding cert:manage

    • cert list and cert check-revocation allowed for cert-reader after granting a custom cert-reader role holding cert:read

    • autonomy audit query --category auth capturing retained auth.access.denied records from the same denied-access runs

    • file-backed fallback query captured alongside DB query for comparison

  14. Runs the RBAC permission enforcement validation slice (run_rbac_val03_lab), proving the three VAL03 claims across a 14-check matrix:

    • 5 DENY checks: unassigned and under-privileged identities are blocked with RBAC error messages before any network call is made

    • 5 ALLOW checks: identities whose role includes the required permission proceed past the guard and reach the control-plane successfully

    • 3 NOT_GUARDED checks: commands without an RBAC guard (rbac role list, rollout plan list, support-bundle generate) succeed or fail for non-RBAC reasons regardless of the operator’s assignment

    • 1 PRESENT check: the retained audit store contains auth.access.denied records from the denial runs Uses the HA server still running on 127.0.0.1:18090 (started by slice 5) for ALLOW-path ha status probes; if the helper is unavailable those three checks are recorded as SKIP instead of being counted as PASS. Produces both a human-readable val03-report.txt and a machine-readable val03-report.json.

  15. Runs the audit completeness validation slice (run_audit_completeness_val04_lab), proving the four VAL04 claims across a 10-check matrix:

    • VAL04-01: retained store is non-empty after all prior phases have run

    • VAL04-02: all 6 audit categories (rollout, ha, cert, relay, auth, rollback) return at least one record from the retained store

    • VAL04-03..08: every record returned in each category query contains all 6 mandatory schema fields (event_name, category, action, outcome, source, timestamp)

    • VAL04-09: a full retained-store query with --limit 0 --output json succeeds and completes within a 2000 ms latency bound (actual typical elapsed < 100 ms)

    • VAL04-10: all 25 of the 25 defined wired event types are present in the retained store, proving complete wired-surface coverage for the current lab contract Produces both a human-readable val04-report.txt and a machine-readable val04-report.json.

  16. Runs the OTel integration validation slice (run_otel_val05_lab), proving the four VAL05 claims across a 9-check matrix:

    • VAL05-01..03 (Prometheus): the control-plane /metrics endpoint returns HTTP 200, all expected metric families are present, and cp_http_requests_total, cp_http_request_duration_seconds_count, cp_rollout_plans_total, plus cp_events_ingested_total show non-zero observations after lab traffic and an explicit POST /v1/events ingest

    • VAL05-04..06 (WAL pipeline): the telemetry WAL is non-empty after telemetry_emit_helper runs, telemetry export produces non-empty JSONL, and the JSONL contains all mandatory event fields

    • VAL05-07 (OTLP delivery): telemetry flush to a local autonomy telemetry sink exits 0 and the sink prints at least one received N log records payload line

    • VAL05-08..09 (correlation IDs): the known trace_id and span_id from the helper events appear in the JSONL export, and traceId / spanId appear in the OTLP sink output Uses an isolated temp WAL directory (does not touch the runtime WAL). Produces both a human-readable val05-report.txt and a machine-readable val05-report.json.

  17. Runs the support-bundle validation slice (run_support_bundle_val06_lab), proving the four VAL06 claims across a 10-check matrix:

    • VAL06-01: support-bundle generate exits 0 and writes a non-empty .tar.gz archive

    • VAL06-02: generation completes within 30 seconds

    • VAL06-03: the three always-present core files (manifest.json, system_info.json, build_info.json) appear in the archive listing

    • VAL06-04: manifest.json records all 6 collector names (system_info, build_info, config, ha_status, audit_recent, logs) regardless of their individual status

    • VAL06-05: system_info.json contains all 5 required fields (os, arch, go_version, hostname, collected_at)

    • VAL06-06: audit_recent.json is a non-empty JSON array with ≥ 1 record from the retained audit store

    • VAL06-07: config_redacted.yaml contains fleet_salt: <REDACTED> and the original salt value is absent

    • VAL06-08: config_redacted.yaml contains REDACTED in the postgres URL and the original password (val06-secret-pass) is absent

    • VAL06-09: no PEM block (-----BEGIN) appears in any archive entry

    • VAL06-10: degraded mode — generating with --orchestrator-url pointing to a non-existent server exits 0; manifest.json records ha_status: "failed" (not a fatal error) Creates two bundles: a normal bundle for checks VAL06-01..09 and a degraded bundle for VAL06-10. Produces both a human-readable val06-report.txt and a machine-readable val06-report.json.

  18. Runs the rollout latency baseline slice (run_rollout_latency_val07_lab), proving the four VAL07 claims across a 9-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18992 with an isolated SQLite data directory that is removed and recreated on each run (no prior state) and measures latency using curl -w '%{time_total}' with Python percentile computation:

    • VAL07-01: dedicated control plane starts and GET /v1/health returns 200

    • VAL07-02..04 (plan-create latency): 20 sequential POST /v1/rollouts requests; p50 ≤ 100 ms, p95 ≤ 300 ms, p99 ≤ 500 ms — the 500 ms p99 bound is the primary workplan latency target for rollout plan creation

    • VAL07-05 (plan-list latency): 20 sequential GET /v1/rollouts requests with 20 plans in store; p99 ≤ 500 ms

    • VAL07-06..07 (concurrent): 5 parallel plan creates all return 2xx and total wall-clock time is ≤ 2000 ms, proving operator-facing responsiveness is not blocked by single-writer SQLite serialisation

    • VAL07-08: zero non-2xx responses across all 45 benchmark requests

    • VAL07-09: cp_http_requests_total Prometheus counter is non-zero after the benchmark run Produces both a human-readable val07-report.txt and a machine-readable val07-report.json (includes plan_create_ms and plan_list_ms latency objects).

  19. Runs the rollout throughput slice (run_rollout_throughput_val08_lab), proving the four VAL08 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18993 with an isolated SQLite data directory that is removed and recreated on each run, then runs four concurrency tiers (N=1, 10, 50, 100 workers) each creating 5 plans sequentially using background bash subshells:

    • VAL08-01: dedicated control plane starts and GET /v1/health returns 200

    • VAL08-02..05 (concurrency tiers): at N=1, 10, 50, and 100 concurrent workers all plans are accepted with zero errors; VAL08-05 (N=100, 500 total plans) is the primary workplan target (≥100 concurrent device rollouts)

    • VAL08-06 (wall clock): the N=100 scenario completes within 30 s

    • VAL08-07 (throughput scaling): throughput at N=100 is ≥ throughput at N=1, confirming concurrency does not regress the write path

    • VAL08-08: aggregate error count across all four scenarios is zero

    • VAL08-09: GET /v1/rollouts after all scenarios returns ≥ (grand_total − errors) plans, confirming durable storage

    • VAL08-10: cp_http_requests_total Prometheus counter is non-zero Produces both a human-readable val08-report.txt and a machine-readable val08-report.json (includes throughput object with plans/sec at each tier).

  20. Runs the stuck rollout detection slice (run_stuck_detection_val09_lab), proving the four VAL09 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18994 with an isolated SQLite data directory that is removed and recreated on each run. Staleness injection uses threshold_seconds=3 plus a 4-second sleep — no SQLite manipulation required:

    • VAL09-01: dedicated control plane starts and GET /v1/health returns 200

    • VAL09-02 (empty baseline): scan on empty store returns stuck_count=0

    • VAL09-03 (fresh plans): 5 plans created; immediate scan returns stuck_count=0 (plans younger than threshold)

    • VAL09-04 (stale detection): after sleeping 4 s, rescan returns stuck_count=5 — all 5 plans detected as stuck

    • VAL09-05 (diagnosis): all 5 stuck plans carry the exact expected diagnosis string ("zero activations nodes may not be receiving the plan or artifact distribution is incomplete")

    • VAL09-06 (paused excluded): pausing plan-b removes it from the stuck scan (paused is not an active phase)

    • VAL09-07 (terminal excluded): cancelling plan-c removes it from the stuck scan (terminal phase)

    • VAL09-08 (retry recovery): POST recover strategy=retry on plan-d returns new_phase=active; updated_at is refreshed

    • VAL09-09 (rollback recovery): POST recover strategy=rollback on plan-e returns new_phase=rolled_back

    • VAL09-10 (post-recovery clean): final scan confirms plan-a still stuck, plan-b/c/d/e absent (paused/terminal/retry-refreshed) Produces both a human-readable val09-report.txt and a machine-readable val09-report.json.

  21. Runs the rollback reliability slice (run_rollback_reliability_val10_lab), proving the four VAL10 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18995. Runs rollback preview for all four target kinds (read-only, no CP needed) then exercises the CLI execute path with 5 + 5 = 10 plan creates and dispatches:

    • VAL10-01: all 4 preview commands exit 0

    • VAL10-02: rollout_plan JSON preview has safety_class=terminal, orchestrated=true, valid strategies retry and rollback

    • VAL10-03: relay_deadletter JSON preview has orchestrated=false and edgectl instructions in manual_path

    • VAL10-04: 5 × rollback execute strategy=retry on real plans — all exit 0 (success_rate=1.000)

    • VAL10-05: 5 × rollback execute strategy=rollback on real plans — all exit 0 (success_rate=1.000)

    • VAL10-06: --output json execute has outcome, new_state, kind fields

    • VAL10-07: execute on nonexistent plan exits non-zero

    • VAL10-08: execute on relay_deadletter (not orchestrated) exits non-zero and prints edgectl instructions

    • VAL10-09: audit query --event-type rollback.preview.requested returns ≥ 4 actor-scoped events from this slice of the retained store

    • VAL10-10: aggregate success rate across 10 executes is ≥ 0.990 (workplan target: ≥99%); at least 10 actor-scoped rollback.executed success events from this slice are retained Produces both a human-readable val10-report.txt and a machine-readable val10-report.json (includes success_rate object with per-strategy and aggregate rates).

  22. Runs the chaos validation slice (run_chaos_val11_lab), proving the four VAL11 claims across a 10-check matrix. Starts a fresh dedicated control plane on 127.0.0.1:18996 with an isolated SQLite data directory that is reused across kill/restart cycles within the function (durability requires the same data dir). Chaos injection is via SIGTERM — no root/iptables required:

    • VAL11-01: dedicated control plane starts and GET /v1/health returns 200

    • VAL11-02 (CP kill → client error): after SIGTERM, autonomy rollback execute exits non-zero with a connection-refused / dial error, confirming the CLI does not silently swallow CP unavailability

    • VAL11-03 (data durability): after CP restart against the same SQLite data dir, GET /v1/rollouts returns ≥ 10 plans, confirming the full pre-kill corpus survived

    • VAL11-04 (rapid restart resilience): 3 additional kill+restart cycles; final plan count ≥ pre-rapid count; new plan create returns 201 — confirms repeated restarts do not corrupt the store or break the write path

    • VAL11-05 (gate-wait survival): plan val11-gate-1 (created in published phase before the kill) retains phase=published after restart, confirming the gate-wait state is not lost across the CP kill boundary

    • VAL11-06 (device-unresponsive proxy): 3 proxy plans created, sleep 3s (threshold=2s); stuck scan returns stuck_count 3 with non-empty diagnosis strings, proving the operator can detect unresponsive devices

    • VAL11-07 (artifact corruption proxy): plan with invalid artifact_ref and target_lock_fingerprint is accepted with HTTP 201 and is queryable by the operator via GET /v1/rollouts/val11-corrupt-1

    • VAL11-08 (corrupt plan rollback): rollback execute strategy=rollback on the corrupt-artifact plan exits 0, proving the operator can roll back a plan with suspicious artifact metadata

    • VAL11-09 (bulk cascade recovery): 3 × rollback execute strategy=retry on the device-unresponsive proxy plans — all exit 0 (cascade_ok=3)

    • VAL11-10 (audit integrity): audit query --event-type rollback.executed scoped to the chaos actor and slice start-time returns ≥ 1 success events, confirming audit capture is not disrupted by CP kill/restart cycles Produces both a human-readable val11-report.txt and a machine-readable val11-report.json.

  23. Runs the HA failover validation slice (run_ha_failover_val13_lab), proving the HA leader-election subsystem under unplanned failure conditions. Two orchestrator_ha_server nodes (ports 18997/18998) share a dedicated Docker PostgreSQL instance. The 10-check matrix covers:

    • VAL13-01 (baseline): node-1 acquires leadership first; node-2 is a follower

    • VAL13-02 (pre-kill write): SQL probe row inserted while node-1 is leader

    • VAL13-03 (SIGTERM failover): failover time measured from kill to node-2 “acquired leadership” log entry; threshold ≤ 5000 ms

    • VAL13-04 (zero data loss): probe row note=pre-kill readable from the shared PG instance after failover, proving shared-database durability

    • VAL13-05 (post-failover leader active): node-2 ha status confirms it holds leadership after node-1 exits

    • VAL13-06 (PG crash detected): docker stop on PG primary → node-2 /v1/ha/quorum transitions to quorum_health=lost

    • VAL13-07 (PG crash recovery): manual docker startquorum_health=healthy restored; probe row still intact

    • VAL13-08 (rapid kill cycles): 3 × alternating SIGTERM failover cycles, each measured ≤ 5000 ms, followed by leader-status and probe-row checks

    • VAL13-09 (SIGKILL disk-fault proxy): SIGKILL on leader (no graceful Resign, simulates OOM/disk crash) → node-2 takes over ≤ 5000 ms

    • VAL13-10 (post-chaos stability): single write-ready leader confirmed after all failure scenarios

    Produces val13-report.txt and val13-report.json.

  24. Runs the HA replication lag baseline slice (run_ha_replication_lag_val14_lab), benchmarking PostgreSQL streaming replication lag under the autonomyops HA architecture. A val14-pg-primary + val14-pg-standby Docker pair is provisioned via pg_basebackup; a single HA server at port 18999 uses --min-sync-replicas 1 so quorum health tracks standby availability. The 10-check matrix covers:

    • VAL14-01 (replication streaming): pg_stat_replication.state = streaming confirmed after standby provisioning

    • VAL14-02 (idle LSN gap zero): (pg_current_wal_lsn() - write_lsn)::bigint = 0 at rest — no unreplicated WAL

    • VAL14-03 (light load drain): 100 rows × 500 bytes inserted with synchronous_commit=off; lag is sampled during the active drain window and WAL LSN gap drains to 0 within 2000 ms

    • VAL14-04 (heavy load drain): 500 rows × 2000 bytes (~1 MB WAL) with synchronous_commit=off; LSN gap drains within 5000 ms

    • VAL14-05 (post-drain gap closed): LSN gap confirmed 0 after the heavy drain sequence completes

    • VAL14-06 (HA replication endpoint): /v1/health/replication responds 200 with a valid JSON body

    • VAL14-07 (standby stop degraded): docker stop val14-pg-standbyquorum_health=degraded within 30 s

    • VAL14-08 (standby start healthy): docker start val14-pg-standbyquorum_health=healthy restored

    • VAL14-09 (catch-up after restart): after generating WAL while the standby is offline, the backlog drains to 0 within 10 s of standby restart, proving the replication slot catches up correctly

    • VAL14-10 (threshold report): practical alerting thresholds derived from observed p95 lag using the formula healthy = max(p95×3+1, 10), degraded = max(healthy×10, 100), alert = max(healthy×50, 500) — all three thresholds must be positive

    Produces val14-report.txt and val14-report.json.

  25. Runs the backup/restore validation slice (run_backup_restore_val15_lab), proving the ha backup create/list/restore workflow end-to-end with integrity verification, timing bounds, and error-path safety checks. A dedicated Docker PostgreSQL instance (val15-pg-primary) is provisioned with two fixture tables (val15_small: 100 rows x 200 bytes; val15_medium: 1,000 rows x 1,000 bytes). The HA server at port 19001 is restarted between phases to transition between normal mode (backup) and maintenance mode (restore). The 10-check matrix covers:

    • VAL15-01 (backup created): ha backup create exits 0; .dump file exists with size > 0

    • VAL15-02 (backup file valid): pg_restore -l on the .dump file exits 0 and the TOC contains TABLE DATA entries

    • VAL15-03 (backup metadata correct): ha backup list --output json shows backup_id with status=completed

    • VAL15-04 (checksum verified): SHA-256 of the .dump file computed independently matches the checksum=<hex> field in the CLI output

    • VAL15-05 (backup timing bound): ha backup create wall time ≤ 30,000 ms

    • VAL15-06 (restore correct): after mutating the tables post-backup (UPDATE first 50 rows; DELETE half the medium table), ha backup restore --confirm reverts the data; post-restore counts are small=100, medium=1000 with correct spot-check payload values

    • VAL15-07 (restore timing bound): ha backup restore wall time ≤ 60,000 ms

    • VAL15-08 (multi-backup inventory): a second ha backup create followed by ha backup list returns count 2 with both backup IDs present

    • VAL15-09 (restore requires confirm): ha backup restore without --confirm exits non-zero and mentions --confirm in the error output (safety gate enforcement)

    • VAL15-10 (audit events captured): audit query --event-type ha.backup.created and --event-type ha.backup.restored each return ≥ 1 event from the shared audit store

    Produces val15-report.txt and val15-report.json.

  26. Runs the split-brain chaos validation slice (run_split_brain_chaos_val16_lab), proving the split-brain detection and recovery subsystem under SQL-injected fault conditions. A dedicated Docker PostgreSQL instance (val16-pg-primary) is provisioned with a single HA server at port 19002 (--min-sync-replicas 0, --campaign 500ms). Injection is performed directly via psql against leadership_state and leader_epochs — no iptables or second HA node required. The 10-check matrix covers:

    • VAL16-01 (baseline risk none): /v1/ha/split-brain returns risk=none before any injection

    • VAL16-02 (epoch inject detected): after UPDATE leadership_state SET current_epoch = current_epoch + 99, holder_id = 'val16-injected-node', the API returns risk=detected

    • VAL16-03 (detection idempotent): a second API call returns the same risk=detected without side effects

    • VAL16-04 (dry-run reconcile ok): ha split-brain recover --strategy manual-reconcile exits 0 and a follow-up /v1/ha/split-brain read still reports risk=detected (planning path, no DB writes)

    • VAL16-05 (promote leader recovers): ha split-brain recover --strategy promote-leader exits 0 and the API subsequently returns risk=none

    • VAL16-06 (data integrity after recovery): a user-table probe row (val16_probe.note = 'pre-inject') is untouched after promote-leader recovery, confirming recovery is scoped to metadata tables only

    • VAL16-07 (ghost node possible): inserting two leader_epochs rows with resigned_at IS NULL and foreign holder_id values raises risk=possible

    • VAL16-08 (ghost node self-clears): stamping resigned_at on the injected rows causes the API to return risk=none without any HA server restart

    • VAL16-09 (audit events captured): audit query --event-type ha.split_brain.detected --start-time <slice_start> and --event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start> each return ≥ 1 event from this slice in the shared audit store

    • VAL16-10 (post chaos stability): /v1/ha/status confirms holder_id contains cp-val16-node after all chaos scenarios complete

    Produces val16-report.txt and val16-report.json.

  27. Runs the quorum loss validation slice (run_quorum_loss_val17_lab), proving quorum loss detection timing, write-blocking safety behavior, and recovery detection against the workplan’s ≤ 60 s target (Gap HA-004). A dedicated Docker PostgreSQL instance (val17-pg-primary) is provisioned with a single HA server at port 19003 (--min-sync-replicas 0, --quorum-monitor-interval 500ms). Quorum loss is induced by docker stop val17-pg-primary; recovery by docker start. Timing is measured in milliseconds using python3 time.time()*1000 wrapped around each docker stop/start call. The 10-check matrix covers:

    • VAL17-01 (baseline quorum healthy): /v1/ha/quorum returns quorum_health=healthy and write_block_active=false before any fault

    • VAL17-02 (loss detected): quorum_health=lost after docker stop

    • VAL17-03 (loss detection timing bound): loss_ms 30,000 (30 s threshold for workplan ≤ 60 s target with --quorum-monitor-interval 500ms)

    • VAL17-04 (write blocked during loss): write_block_active=true and can_accept_protected_writes=false in the quorum status JSON during loss

    • VAL17-05 (loss reason populated): quorum_loss_reason non-empty during loss

    • VAL17-06 (recovery detected): quorum_health=healthy after docker start

    • VAL17-07 (recovery timing bound): recovery_ms 30,000

    • VAL17-08 (write unblocked after recovery): write_block_active=false; can_accept_protected_writes=true; last_lost_at and last_restored_at timestamps populated; detected_loss_count 1

    • VAL17-09 (second cycle count increments): after a second loss/recovery completes, detected_loss_count 2

    • VAL17-10 (audit events captured): audit query --event-type ha.quorum.lost --start-time <val17_start_time> and --event-type ha.quorum.restored --start-time <val17_start_time> each return ≥ 1 event

    Produces val17-report.txt and val17-report.json.

5. Generated Artifacts

The evidence bundle is split by surface:

  • autonomy/ for rollout and cert captures

  • metrics/ for the live PR-22 metrics list/query captures

  • ha/ for live HA backup and failover captures

  • quorum/ for live HA quorum status and transition captures

  • split-brain/ for live split-brain detection captures

  • rbac/ for local RBAC store and assignment captures

  • relay/ for the live edge relay captures

  • retained/ for the file-backed audit store plus query/export captures

  • support-bundle/ for the generated archive plus extracted proof files

  • config-migrate/ for the live PR-27 config migration captures

  • cert_rbac/ for the PR-29-followup-e cert RBAC enforcement captures

  • val03/ for the VAL03 RBAC permission enforcement validation captures

  • val04/ for the VAL04 audit completeness validation captures

  • val05/ for the VAL05 OTel integration validation captures

  • val06/ for the VAL06 support-bundle validation captures

  • val07/ for the VAL07 rollout latency baseline captures

  • val08/ for the VAL08 rollout throughput validation captures

  • val09/ for the VAL09 stuck rollout detection validation captures

  • val10/ for the VAL10 rollback reliability validation captures

  • val11/ for the VAL11 chaos validation captures

Representative files:

  • autonomy/rollout-plan-create.txt

  • autonomy/audit-rollout-plan-create.log

  • autonomy/cert-rotate.txt

  • autonomy/audit-cert-rotate.log

  • autonomy/cert-rotation-list-expiring.txt — VAL01-1: 2-day cert appears in expiry window

  • autonomy/cert-rotation-prerotate-health.json — VAL01-2: old cert accepted over live mTLS

  • autonomy/cert-rotation-before-dates.txt — openssl serial/dates before rotation (baseline)

  • autonomy/cert-rotation-timing.txt — VAL01-3: elapsed seconds + pass=true vs 300s bound

  • autonomy/cert-rotation-rotate.txtcert rotate stdout

  • autonomy/cert-rotation-audit-rotate.log — slog cert.rotated event from rotation

  • autonomy/cert-rotation-after-dates.txt — openssl serial/dates after rotation

  • autonomy/cert-rotation-list-after.txt — VAL01-4: no certificates matched in expiry window

  • autonomy/cert-rotation-postrotate-health.json — VAL01-5: rotated client cert accepted without restart

  • autonomy/cert-rotation-audit-events.json — VAL01-6: retained cert.rotated query

  • autonomy/cert-rotation-val01-report.txt — composite 6/6 PASS report + serial assertion

  • autonomy/cert-rejection-missing-client-cert.stderr — VAL02-1: TLS cert required error

  • autonomy/cert-rejection-invalid-chain.stderr — VAL02-2: chain verification failure (rogue CA)

  • autonomy/cert-rejection-expired-cert.stderr — VAL02-3: cert expired rejection

  • autonomy/cert-rejection-revoked.stderr — VAL02-4: CRL rejection (node-a, same gate as Phase 5)

  • autonomy/cert-rejection-wrong-server-trust.stderr — VAL02-5: client-side verify failure

  • autonomy/cert-rejection-val02-report.txt — composite 5/5 PASS report + right_ca_wrong_cn note

  • metrics/orchestrator-metrics-raw.txt

  • metrics/metrics-list.txt

  • metrics/metrics-list.json

  • metrics/metrics-query-all.txt

  • metrics/metrics-query-rollout-plans.json

  • metrics/metrics-query-http-duration.txt

  • rbac/rbac-role-create.txt

  • rbac/rbac-role-list-before-assign.txt

  • rbac/rbac-role-assign.txt

  • rbac/audit-rbac-role-assign.log

  • rbac/rbac-role-assign-repeat.txt

  • rbac/rbac-role-list-after-assign.txt

  • rbac/rbac-role-list.json

  • rbac/assignments.json

  • rbac/rbac-default-on-denied.stderr — bootstrap mode denial with no env var

  • rbac/rbac-bootstrap-seed.txt — bootstrap role assign output

  • rbac/audit-rbac-bootstrap-seed.log — auth.bootstrap.access audit event

  • rbac/rbac-post-bootstrap-insufficient.stderr — outsider denied after bootstrap

  • rbac/rbac-post-bootstrap-allowed.json — allowed after bootstrap + role grant

  • rbac/rbac-enforcement-disabled.json — opt-out AUTONOMY_RBAC_ENFORCEMENT=0

  • rbac/rbac-break-glass-allowed.json — break-glass bypassed action

  • rbac/audit-rbac-break-glass.log — auth.break_glass.used audit event

  • rbac/audit-break-glass-events.json — retained break-glass events

  • rbac/rbac-break-glass-no-operator.stderr — break-glass safety: denied without operator

  • rbac/rbac-role-assign-operator.txt

  • rbac/rbac-ha-status-denied.stderr — default-on denial (no env var)

  • rbac/rbac-ha-status-allowed.txt

  • rbac/rbac-audit-query-denied.stderr — default-on denial (no env var)

  • rbac/rbac-audit-query-allowed.json

  • rbac/rbac-audit-export-denied.stderr — default-on denial (no env var)

  • rbac/rbac-audit-export-allowed.json

  • rbac/retained-auth-access-denied.json

  • ha/ha-status-no-header.headers

  • ha/ha-status-no-header.json

  • ha/ha-status-with-header.headers

  • ha/ha-status-with-header.json

  • ha/ha-backup-create.txt

  • ha/audit-ha-backup-create.log

  • ha/ha-backup-restore.txt

  • ha/audit-ha-backup-restore.log

  • ha/ha-failover-trigger.txt

  • ha/audit-ha-failover-trigger.log

  • quorum/ha-quorum-healthy.txt

  • quorum/ha-quorum-degraded.txt

  • quorum/ha-quorum-lost.json

  • quorum/ha-quorum-restored.json

  • quorum/audit-ha-quorum-lost.json

  • quorum/audit-ha-quorum-restored.json

  • split-brain/ha-split-brain-healthy.txt

  • split-brain/ha-split-brain-detected.json

  • split-brain/ha-split-brain-detected-repeat.json

  • split-brain/ha-split-brain-recover-dry-run.txt

  • split-brain/ha-split-brain-recover-execute.txt

  • split-brain/ha-split-brain-recovered.txt

  • split-brain/ha-split-brain-recovered.json

  • split-brain/ha-status-after-recovery.json

  • split-brain/ha-split-brain-possible.txt

  • split-brain/audit-ha-split-brain-detected-after-dedupe.json

  • split-brain/audit-ha-split-brain-detected-all.json

  • split-brain/audit-ha-split-brain-recovered-after-execute.json

  • split-brain/audit-ha-split-brain-recovered-all.json

  • split-brain/ha-split-brain-summary.json

  • relay/relay-deadletter-retry.txt

  • relay/audit-relay-deadletter-retry.log

  • relay/relay-bandwidth-set.txt

  • relay/audit-relay-bandwidth-set.log

  • retained/audit-query-all.txt

  • retained/audit-query-category-auth.json

  • retained/audit-query-category-rollout.txt

  • retained/audit-query-ha-backup-created.json

  • retained/audit-query-source-edge.json

  • retained/audit-query-outcome-success.txt

  • retained/audit-query-invalid-output.stderr

  • retained/audit-query-invalid-range.stderr

  • retained/audit-export-all.json

  • retained/audit-export-all.csv

  • retained/audit-export-invalid-format.stderr

  • retained/audit-export-invalid-format-target-before.sha256

  • retained/audit-export-invalid-format-target-after.sha256

  • retained/audit-export-invalid-format-target.txt

  • retained/retained-file-list.txt

  • support-bundle/autonomy-support-bundle-live.tar.gz

  • support-bundle/support-bundle-generate.log

  • support-bundle/support-bundle-contents.txt

  • support-bundle/manifest.json

  • support-bundle/config_redacted.yaml

  • support-bundle/ha_status.json

  • support-bundle/audit_recent.json

  • support-bundle/logs-autonomy.log

  • support-bundle/support-bundle-summary.txt

  • config-migrate/config-migrate-dry-run.txt

  • config-migrate/config-migrate-stdout.yaml

  • config-migrate/config-migrate-stdout.toml

  • config-migrate/config-migrated.yaml

  • config-migrate/config-migrated.toml

  • config-migrate/config-migrated-in-place.yaml

  • config-migrate/config-migrate-in-place-before.stat

  • config-migrate/config-migrate-in-place-after.stat

  • config-migrate/config-migrate-unsupported.stderr

  • config-migrate/config-migrate-invalid-input.stderr

  • config-migrate/config-migrate-invalid-v0.stderr

  • db_audit/schema-dry-run.txt (DB-backed; skipped when no PG URL)

  • db_audit/query-file-all.txt — file-backed baseline query (fallback path)

  • db_audit/query-db-all.txt — DB-backed query text (primary path)

  • db_audit/query-db-cert.json — DB-backed query filtered to cert category

  • db_audit/export-db-all.json — DB-backed JSON export

  • db_audit/export-db-all.csv — DB-backed CSV export

  • db_audit/prune-90d.txt — prune with 90d cutoff (no-op output)

  • db_audit/prune-1d.txt — prune with 1d cutoff (deletes seeded old rows)

  • db_audit/query-db-after-prune.json — post-prune query confirming remaining rows

  • cert_rbac/denied-issue.txt — cert issue denied (cert:manage in error)

  • cert_rbac/denied-rotate.txt — cert rotate denied

  • cert_rbac/denied-revoke.txt — cert revoke denied

  • cert_rbac/denied-list.txt — cert list denied (newly guarded; cert:manage in error)

  • cert_rbac/denied-check-revocation.txt — cert check-revocation denied (newly guarded)

  • cert_rbac/denied-sync-crl.txt — cert sync-crl denied

  • cert_rbac/allowed-issue.txt — successful cert issue under cert:manage

  • cert_rbac/allowed-list.txt — successful cert list under cert:read

  • cert_rbac/allowed-check-revocation.txt — successful read-only non-revoked check under cert:read

  • cert_rbac/audit-denied-events.json — retained auth.access.denied records for cert operations

  • val03/setup-seed-auditor.txt — VAL03 bootstrap auditor assignment

  • val03/setup-assign-operator.txt — VAL03 operator assignment

  • val03/setup-assign-analyst.txt — VAL03 analyst assignment

  • val03/setup-server-assign-auditor.txt — mirrored server-side auditor assignment

  • val03/setup-server-assign-operator.txt — mirrored server-side operator assignment

  • val03/setup-server-assign-analyst.txt — mirrored server-side analyst assignment

  • val03/val03-01-ha-status-deny.stderr — unassigned denied ha status (rbac: pattern)

  • val03/val03-02-ha-status-operator-allow.txt — operator allowed ha status

  • val03/val03-03-ha-status-analyst-allow.txt — analyst allowed ha status

  • val03/val03-04-ha-status-auditor-allow.txt — auditor allowed ha status

  • val03/val03-05-audit-query-operator-deny.stderr — operator denied audit query

  • val03/val03-06-audit-query-analyst-deny.stderr — analyst denied audit query

  • val03/val03-07-audit-query-auditor-allow.json — auditor allowed audit query

  • val03/val03-08-rbac-role-list-unassigned.txt — unassigned rbac role list (not guarded)

  • val03/val03-09-rollout-plan-list-unassigned.stderr — unassigned rollout plan list (connection error, not RBAC)

  • val03/val03-10-rbac-role-create-operator-deny.stderr — operator denied rbac role create

  • val03/val03-11-rbac-role-create-analyst-deny.stderr — analyst denied rbac role create

  • val03/val03-12-rbac-role-create-auditor-allow.txt — auditor allowed rbac role create

  • val03/val03-13-support-bundle-unassigned.stderr — unassigned support-bundle generate progress with bundle written: (top-level command not guarded)

  • val03/val03-support-bundle.tar.gz — bundle generated by the unguarded support-bundle check

  • val03/val03-14-access-denied-events.json — retained denial tuples from this VAL03 slice

  • val03/val03-report.txt — composite 14-check PASS/FAIL report

  • val03/val03-report.json — machine-readable JSON report

  • val04/val04-store-inventory.txt — retained store JSONL file count

  • val04/val04-category-rollout.json — all rollout-category records from retained store

  • val04/val04-category-ha.json — all ha-category records

  • val04/val04-category-cert.json — all cert-category records

  • val04/val04-category-relay.json — all relay-category records

  • val04/val04-category-auth.json — all auth-category records

  • val04/val04-category-rollback.json — all rollback-category records

  • val04/val04-category-summary.txtcategory=<cat> count=<N> lines for all 6 categories

  • val04/val04-schema-check.txt — per-category schema field check results (PASS/FAIL + MISSING lines)

  • val04/val04-latency.txtquery_ok, query_elapsed_ms, bound_ms=2000, pass=true/false

  • val04/val04-coverage-all.json — full retained store JSON (all records, --limit 0)

  • val04/val04-coverage-report.txt — PRESENT/ABSENT per event type, final found/expected/threshold/pass

  • val04/val04-report.txt — composite 10-check PASS/FAIL report

  • val04/val04-report.json — machine-readable JSON report with latency and coverage counts

  • val05/val05-prometheus-status.txt — HTTP status code for /metrics endpoint

  • val05/val05-prometheus-raw.txt — full Prometheus text exposition from orchestrator

  • val05/val05-prometheus-families.txt — PRESENT/ABSENT per required metric family

  • val05/val05-events-ingest.json — explicit POST /v1/events response used to exercise cp_events_ingested_total

  • val05/val05-prometheus-observations.txt — sample lines for 4 non-zero observation checks

  • val05/val05-emit-helper.txttelemetry_emit_helper output (emitted 3 events to )

  • val05/val05-wal-status.jsontelemetry status --json output {total, exported, pending}

  • val05/val05-wal-inventory.txt — WAL dir, file count, total events, pass flag

  • val05/val05-export.jsonl — JSONL output from telemetry export (3 events)

  • val05/val05-export-fields.txt — PRESENT/ABSENT per mandatory JSONL event field

  • val05/val05-sink-output.txt — OTLP payloads received by autonomy telemetry sink

  • val05/val05-flush-stdout.txttelemetry flush: OK N events sent to

  • val05/val05-flush-summary.txtflush_ok, sink_lines, sink_payloads, pass flag

  • val05/val05-traceid-jsonl.txttrace_id_found=<value>, span_id_found=<value>, or ABSENT

  • val05/val05-traceid-otlp.txttraceId_found=true/false, spanId_found=true/false

  • val05/val05-report.txt — composite 9-check PASS/FAIL report

  • val05/val05-report.json — machine-readable JSON report

  • val06/val06-bundle.tar.gz — normal support bundle archive under inspection

  • val06/val06-generate-stdout.txt — normal stdout capture (typically empty; progress is emitted on stderr)

  • val06/val06-generate.log — stderr including generating support bundle, per-collector progress, and bundle written: confirmation

  • val06/val06-timing.txtgenerate_ok, elapsed_s, bound_s=30, pass_timing

  • val06/val06-contents.txt — archive file listing from tar -tzf

  • val06/val06-manifest.json — extracted manifest.json with all 6 collector statuses

  • val06/val06-system-info.json — extracted system_info.json

  • val06/val06-config-redacted.yaml — extracted config_redacted.yaml for redaction inspection

  • val06/val06-audit-recent.json — extracted audit_recent.json (50 most recent records)

  • val06/val06-core-files.txtPRESENT/ABSENT per required core file

  • val06/val06-manifest-check.txtcollector=<name> status=<status> for each of the 6 collectors

  • val06/val06-sysinfo-check.txtPRESENT/ABSENT per required system_info field

  • val06/val06-audit-check.txtaudit_recent_count, pass

  • val06/val06-redaction-salt.txtfleet_salt_placeholder, fleet_salt_actual_absent, pass

  • val06/val06-redaction-pg.txtpg_redacted_present, pg_secret_absent, pass

  • val06/val06-privkey-check.txtprivkey_hits=0, pass

  • val06/val06-bundle-degraded.tar.gz — degraded bundle (ha_status failed)

  • val06/val06-degraded-manifest.json — extracted manifest from degraded bundle

  • val06/val06-degraded-check.txtbundle_exit_ok=true, ha_status_status=failed, pass

  • val06/val06-report.txt — composite 10-check PASS/FAIL report

  • val06/val06-report.json — machine-readable JSON report with elapsed_s and per-check statuses

  • val07/val07-cp.log — dedicated VAL07 control-plane startup and per-request logs

  • val07/val07-health.txthealth_code=200, pass

  • val07/val07-create-raw.txt — 20 lines: <http_code> <time_total_s> from sequential creates

  • val07/val07-create-percentiles.txtexpected_n=20, n=20, sample_complete=true, p50_ms, p95_ms, p99_ms, min_ms, max_ms

  • val07/val07-list-raw.txt — 20 lines: <http_code> <time_total_s> from sequential lists

  • val07/val07-list-percentiles.txtexpected_n=20, n=20, sample_complete=true, p50_ms, p95_ms, p99_ms, min_ms, max_ms

  • val07/val07-concurrent-raw.txt — 5 lines from concurrent creates

  • val07/val07-concurrent-summary.txtconcurrent_n=5, conc_ok, conc_errors, wall_ms, bound_ms=2000, wall_pass

  • val07/val07-error-summary.txttotal_requests=45, error_count, pass

  • val07/val07-metrics-raw.txt — Prometheus text exposition from dedicated VAL07 control plane

  • val07/val07-prometheus-check.txtcp_http_requests_total count, pass

  • val07/val07-report.txt — composite 9-check PASS/FAIL report with p50/p95/p99 values

  • val07/val07-report.json — machine-readable JSON report with plan_create_ms and plan_list_ms latency objects

  • val08/val08-cp.log — dedicated VAL08 control-plane startup and per-request logs

  • val08/val08-health.txtstatus=ok, pass=true

  • val08/scenario-n1/scenario-report.txtn_workers=1, total_plans=5, ok=5, errors=0, throughput_plans_per_sec

  • val08/scenario-n10/scenario-report.txtn_workers=10, total_plans=50, ok=50, errors=0, throughput_plans_per_sec

  • val08/scenario-n50/scenario-report.txtn_workers=50, total_plans=250, ok=250, errors=0, throughput_plans_per_sec

  • val08/scenario-n100/scenario-report.txtn_workers=100, total_plans=500, ok=500, errors=0, throughput_plans_per_sec

  • val08/scenario-n{1,10,50,100}/worker-*.txt — per-worker error counts (one file per worker; all should contain 0)

  • val08/val08-wall-clock-n100.txtelapsed_ms, bound_ms=30000, pass

  • val08/val08-throughput-scaling.txttput_n1, tput_n10, tput_n50, tput_n100, scaling_pass

  • val08/val08-error-aggregate.txttotal_errors=0, pass=true

  • val08/val08-list-consistency.txtgrand_total_created=805, expected_min, list_count, pass

  • val08/val08-metrics-raw.txt — Prometheus text exposition from dedicated VAL08 control plane

  • val08/val08-prometheus-check.txtcp_http_requests_total count, pass

  • val08/val08-report.txt — composite 10-check PASS/FAIL report with throughput-by-tier table

  • val08/val08-report.json — machine-readable JSON report with throughput object and per-check statuses

  • val09/val09-cp.log — dedicated VAL09 control-plane startup and per-request logs

  • val09/val09-health.txtstatus=ok, pass=true

  • val09/val09-plans-created.txt — 5 plan IDs with HTTP 201 create codes

  • val09/val09-scan-empty.json — stuck scan on empty store (stuck_count=0)

  • val09/val09-baseline-check.txtstuck_count=0, expected=0, pass=true

  • val09/val09-scan-fresh.json — stuck scan immediately after creation (stuck_count=0)

  • val09/val09-fresh-check.txtstuck_count=0, expected=0, pass=true

  • val09/val09-scan-stale.json — stuck scan after sleep; full stuck_plans array with diagnosis fields

  • val09/val09-stale-check.txtstuck_count=5, expected=5, pass=true

  • val09/val09-diagnosis-check.txttotal=5, diagnosis_populated=5, pass=true

  • val09/val09-pause-planb.json — pause response for plan-b (phase=paused)

  • val09/val09-scan-after-pause.json — stuck scan after pause (plan-b absent)

  • val09/val09-pause-check.txtpaused_plan=val09-plan-b, in_stuck_scan=no, pass=true

  • val09/val09-cancel-planc.json — cancel response for plan-c

  • val09/val09-scan-after-cancel.json — stuck scan after cancel (plan-c absent)

  • val09/val09-cancel-check.txtcancelled_plan=val09-plan-c, in_stuck_scan=no, pass=true

  • val09/val09-recover-retry-pland.json — retry recovery response (new_phase=active)

  • val09/val09-retry-check.txtnew_phase=active, pass=true

  • val09/val09-recover-rollback-plane.json — rollback recovery response (new_phase=rolled_back)

  • val09/val09-rollback-check.txtnew_phase=rolled_back, pass=true

  • val09/val09-scan-final.json — final stuck scan (plan-a present; b/c/d/e absent)

  • val09/val09-final-check.txt — per-plan presence flags + pass=true

  • val09/val09-report.txt — composite 10-check PASS/FAIL report

  • val09/val09-report.json — machine-readable JSON report with scan counts and per-check statuses

  • val10/val10-cp.log — dedicated VAL10 control-plane startup and per-request logs

  • val10/val10-health.txtstatus=ok, pass=true

  • val10/val10-preview-rollout_plan.txt — text safety profile for rollout_plan target

  • val10/val10-preview-rollout_stage.txt — text safety profile for rollout_stage target

  • val10/val10-preview-ha_leader_resign.txt — text safety profile for ha_leader_resign target

  • val10/val10-preview-relay_deadletter.txt — text safety profile for relay_deadletter target

  • val10/val10-preview-check.txtpreview_errors=0, pass=true

  • val10/val10-preview-rollout_plan.json — JSON preview (safety_class=terminal, orchestrated=true)

  • val10/val10-preview-rollout_plan-check.txt — field checks, pass=true

  • val10/val10-preview-relay_deadletter.json — JSON preview (orchestrated=false)

  • val10/val10-preview-relay-check.txtorchestrated=false, manual_path_has_edgectl=true, pass=true

  • val10/val10-retry-plans-created.txt — 5 × plan_id=val10-retry-N  code=201

  • val10/retry/execute-retry-{1..5}.txt — per-plan retry execute output (outcome=success  previous=published  new=active)

  • val10/val10-retry-rate.txtstrategy=retry  ok=5  fail=0  total=5  success_rate=1.000  pass=true

  • val10/val10-rollback-plans-created.txt — 5 × plan_id=val10-rollback-N  code=201

  • val10/rollback/execute-rollback-{1..5}.txt — per-plan rollback execute output (outcome=success  previous=published  new=rolled_back)

  • val10/val10-rollback-rate.txtstrategy=rollback  ok=5  fail=0  total=5  success_rate=1.000  pass=true

  • val10/val10-execute-json.json — JSON output from retry execute (Outcome, NewState, Kind fields)

  • val10/val10-execute-json-check.txt — field presence checks, pass=true

  • val10/val10-execute-nonexistent.txt — error output for nonexistent plan

  • val10/val10-nonexistent-check.txtexit_code=1, pass=true

  • val10/val10-execute-relay-not-orchestrated.txtedgectl instructions in error output

  • val10/val10-relay-not-orchestrated-check.txtexit_code!=0, has_edgectl_instructions=1, pass=true

  • val10/val10-audit-preview-events.jsonrollback.preview.requested events from audit store

  • val10/val10-audit-preview-check.txtrollback.preview.requested_count 4 with actor/start-time scope, pass=true

  • val10/val10-audit-execute-events.jsonrollback.executed events from audit store

  • val10/val10-aggregate-rate.txtagg_ok=10  agg_total=10  agg_success_rate=1.000  pass=true

  • val10/val10-report.txt — composite 10-check PASS/FAIL report with success rate table

  • val10/val10-report.json — machine-readable JSON with success_rate object and per-check statuses

  • val11/val11-cp.log — chaos CP stdout/stderr across all start/stop cycles (append mode)

  • val11/val11-health.txtstatus=ok, pass=true

  • val11/val11-plans-created.txt — 9 plan create HTTP codes (dur-1..5, gate-1, corrupt-1, dev-1..3)

  • val11/val11-scan-stuck.json — stuck scan result (threshold=2s, after sleep=3s)

  • val11/val11-stuck-check.txtstuck_count≥3, diagnosis_ok=true, pass=true

  • val11/val11-corrupt-plan-create.txt — HTTP 201 for corrupt-artifact plan

  • val11/val11-corrupt-plan.json — GET /v1/rollouts/val11-corrupt-1 response

  • val11/val11-corrupt-check.txtcreate_code=201, get_ok=true with plan.metadata.id=val11-corrupt-1, pass=true

  • val11/val11-execute-rollback-corrupt.txt — rollback execute output for val11-corrupt-1

  • val11/val11-rollback-corrupt-check.txtexit_ok=true, pass=true

  • val11/cascade/execute-retry-dev-{1..3}.txt — cascade retry output per plan

  • val11/val11-cascade-check.txtcascade_ok=3, cascade_fail=0, pass=true

  • val11/val11-kill-client-error.txt — CLI output after CP kill

  • val11/val11-kill-check.txtexit_nonzero=true, has_connection_error=true, pass=true

  • val11/val11-durability-list.json — GET /v1/rollouts after first restart

  • val11/val11-durability-check.txtlist_count≥10, pass=true

  • val11/val11-gate-plan.json — GET /v1/rollouts/val11-gate-1 after restart

  • val11/val11-gate-check.txtphase=published, pass=true

  • val11/val11-rapid-restart.txt — 3× kill/restart cycle log + final list count + new plan code

  • val11/val11-audit-executed.jsonrollback.executed events from retained store

  • val11/val11-audit-check.txtrollback_executed_success_count≥1 with actor/start-time scope, pass=true

  • val11/val11-report.txt — composite 10-check PASS/FAIL chaos report

  • val11/val11-report.json — machine-readable JSON chaos report

  • val13/ for the VAL13 HA failover validation captures

  • val13/probe-table-create.txt — output of CREATE TABLE val13_probe

  • val13/val13-node1-s0.log — node-1 HA server initial session log

  • val13/val13-node2-s0.log — node-2 HA server initial session log

  • val13/val13-node1-s1.log — node-1 HA server restart-1 log

  • val13/val13-node1-s2.log — node-1 HA server restart-2 log

  • val13/val13-node2-s1.log — node-2 HA server restart-1 log

  • val13/val13-node2-s2.log — node-2 HA server restart-2 log

  • val13/val13-01-node1-status.txtha status for node-1 at baseline

  • val13/val13-01-node2-status.txtha status for node-2 at baseline (follower)

  • val13/val13-02-pre-kill-write.txt — SQL INSERT output before kill

  • val13/val13-03-failover-timing.txtfailover_ms=<N>  signal=TERM

  • val13/val13-03-node2-status-after.txtha status immediately after node-2 takes over

  • val13/val13-04-data-probe.txtSELECT note result (should contain pre-kill)

  • val13/val13-05-write-ready.txtha status confirming node-2 is active leader

  • val13/val13-06-quorum-lost.json/v1/ha/quorum JSON showing quorum_health=lost

  • val13/val13-07-quorum-healthy.json/v1/ha/quorum JSON showing quorum_health=healthy

  • val13/val13-07-data-after-pg-restart.txt — probe row intact after PG stop/start

  • val13/val13-08-cycle1.txt — rapid kill cycle 1 timing

  • val13/val13-08-cycle2.txt — rapid kill cycle 2 timing

  • val13/val13-08-cycle3.txt — rapid kill cycle 3 timing

  • val13/val13-08-rapid-summary.txtcycle1_ms=N  cycle2_ms=N  cycle3_ms=N

  • val13/val13-08-post-cycle-status.txtha status after the rapid cycles

  • val13/val13-08-data-after-rapid.txt — probe row still intact after the rapid cycles

  • val13/val13-09-sigkill-timing.txtfailover_ms=<N>  signal=KILL

  • val13/val13-10-final-status.txtha status for surviving leader after all tests

  • val13/val13-report.txt — composite 10-check PASS/FAIL HA failover report

  • val13/val13-report.json — machine-readable JSON HA failover report

  • val14/ for the VAL14 HA replication lag baseline captures

  • val14/val14-pg-setup.txt — role create, pg_hba update, table create, synchronous config

  • val14/val14-pg-basebackup.txtpg_basebackup progress + replication slot creation

  • val14/val14-ha-server.log — HA server log (leader election + quorum monitor)

  • val14/val14-01-replication.txtpg_stat_replication output (state=streaming)

  • val14/val14-02-idle-lsn-gap.txt — LSN gap at rest (should be 0)

  • val14/val14-02-idle-lag-samples.txt — 10 idle write_lag_ms samples

  • val14/val14-light-write.txt — INSERT output for 100-row light write

  • val14/val14-03-light-lag-samples.txt — 5 lag samples captured before the light-load drain fully completes

  • val14/val14-03-light-result.txtlight_drain_ms=<N>  threshold=2000

  • val14/val14-heavy-write.txt — INSERT output for 500-row heavy write

  • val14/val14-04-heavy-lag-samples.txt — 5 lag samples during/after heavy write

  • val14/val14-04-heavy-result.txtheavy_drain_ms=<N>  threshold=5000

  • val14/val14-05-post-drain-samples.txt — 5 lag samples after heavy drain

  • val14/val14-05-drain-result.txtpost_drain_gap=<N>

  • val14/val14-06-ha-endpoint.json/v1/health/replication JSON response

  • val14/val14-07-quorum-degraded.json/v1/ha/quorum JSON showing quorum_health=degraded

  • val14/val14-08-quorum-healthy.json/v1/ha/quorum JSON showing quorum_health=healthy

  • val14/val14-09-offline-write.txt — INSERT output for the backlog generated while standby is offline

  • val14/val14-09-post-restart-replication.txtpg_stat_replication after standby restart

  • val14/val14-09-catchup-result.txtcatchup_drain_ms=<N>  ok=true/false

  • val14/val14-report.txt — composite 10-check report with lag measurements and derived thresholds

  • val14/val14-report.json — machine-readable JSON replication lag baseline report

  • val15/ for the VAL15 backup/restore validation captures

  • val15/val15-pg-setup.txt — Docker container IP, postgres URL, table create and fixture load output

  • val15/val15-ha-server.log — HA server log (normal mode, backup phase)

  • val15/val15-ha-server-maint.log — HA server log (maintenance mode, restore phase)

  • val15/val15-ha-server-post.log — HA server log (normal mode, multi-backup and error-path phase)

  • val15/val15-01-backup-create.txtha backup create stdout+stderr (backup_id, checksum, size_bytes)

  • val15/val15-01-backup-file-stat.txtstat of the .dump file

  • val15/val15-02-backup-toc.txtpg_restore -l table-of-contents of the backup archive

  • val15/val15-03-backup-list.txtha backup list --output json after first backup

  • val15/val15-03-metadata-check.txt — assertion result: backup_id found with status=completed

  • val15/val15-04-checksum-verify.txtcli_checksum=<hex> + file_checksum=<hex> comparison

  • val15/val15-05-backup-timing.txtbackup_ms=<N>

  • val15/val15-05-db-before.txt — row counts before backup (small=100 medium=1000)

  • val15/val15-06-post-backup-mutation.txt — SQL UPDATE/DELETE output confirming post-backup mutation

  • val15/val15-06-restore.txtha backup restore stdout+stderr

  • val15/val15-06-db-after-restore.txt — row counts after restore

  • val15/val15-06-data-check.txt — SQL spot-check query output (100|1000|t|t)

  • val15/val15-06-integrity-result.txt — assertion result: restore_correct=true small=100 medium=1000

  • val15/val15-07-restore-timing.txtrestore_ms=<N>

  • val15/val15-08-backup-create-2.txt — second ha backup create output

  • val15/val15-08-backup-list-multi.txtha backup list --output json showing both backup IDs

  • val15/val15-08-inventory-check.txt — assertion result: multi_backup_count=2

  • val15/val15-09-restore-no-confirm.txtha backup restore without --confirm (expected error mentioning --confirm)

  • val15/val15-10-audit-backup-created.jsonaudit query --event-type ha.backup.created result

  • val15/val15-10-audit-backup-restored.jsonaudit query --event-type ha.backup.restored result

  • val15/val15-10-audit-check.txt — assertion result: event counts for both audit event types

  • val15/backups/backup-val15-a.dump — pg_dump custom-format archive (first backup)

  • val15/backups/backup-val15-b.dump — pg_dump custom-format archive (second backup)

  • val15/val15-report.txt — composite 10-check PASS/FAIL report with timing values

  • val15/val15-report.json — machine-readable JSON report with backup_ms, restore_ms, pass_count

  • val16/ for the VAL16 split-brain chaos validation captures

  • val16/val16-ha-server.log — HA server log (single normal session throughout)

  • val16/val16-probe-setup.txt — Docker psql output: CREATE TABLE + INSERT for probe row

  • val16/val16-01-baseline.json/v1/ha/split-brain JSON at baseline (risk=none)

  • val16/val16-01-baseline.txtha split-brain detect CLI output at baseline

  • val16/val16-02-epoch-inject.txt — SQL UPDATE output (epoch divergence injection)

  • val16/val16-02-detected.json/v1/ha/split-brain JSON after injection (risk=detected)

  • val16/val16-02-detected.txtha split-brain detect CLI output after injection

  • val16/val16-03-detect-repeat.json — second /v1/ha/split-brain call (idempotency check)

  • val16/val16-04-recover-dry-run.txtha split-brain recover --strategy manual-reconcile stdout+stderr

  • val16/val16-04-risk-after-dry-run.json/v1/ha/split-brain JSON after dry-run (risk unchanged)

  • val16/val16-04-risk-check.txt — assertion result: risk_after_dry_run=detected

  • val16/val16-05-recover-execute.txtha split-brain recover --strategy promote-leader stdout+stderr

  • val16/val16-05-recovered.json/v1/ha/split-brain JSON after promote-leader (risk=none)

  • val16/val16-05-recovered.txtha split-brain detect CLI output after recovery

  • val16/val16-06-probe-after-recovery.txt — SQL SELECT: note from val16_probe WHERE id=1

  • val16/val16-07-ghost-inject.txt — SQL INSERT output (ghost-node epoch rows)

  • val16/val16-07-possible.json/v1/ha/split-brain JSON after ghost injection (risk=possible)

  • val16/val16-07-possible.txtha split-brain detect CLI output for risk=possible

  • val16/val16-08-ghost-clear.txt — SQL UPDATE output (stamp resigned_at on ghost rows)

  • val16/val16-08-cleared.json/v1/ha/split-brain JSON after clearing (risk=none)

  • val16/val16-09-audit-detected.jsonaudit query --event-type ha.split_brain.detected --start-time <slice_start> JSON result

  • val16/val16-09-audit-recovered.jsonaudit query --event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start> JSON result

  • val16/val16-09-audit-check.txt — assertion result: slice-scoped event counts for both audit event types

  • val16/val16-10-final-status.json/v1/ha/status JSON confirming cp-val16-node holds leadership

  • val16/val16-10-stability-check.txt — assertion result: holder_id check

  • val16/val16-report.txt — composite 10-check PASS/FAIL report

  • val16/val16-report.json — machine-readable JSON report with pass_count and scenario results

  • val17/ for the VAL17 quorum loss validation captures

  • val17/val17-pg-setup.txt — Docker container IP and PG URL used by HA server

  • val17/val17-ha-server.log — HA server log (single session throughout all phases)

  • val17/val17-01-baseline.json/v1/ha/quorum JSON at baseline (quorum_health=healthy)

  • val17/val17-01-baseline.txtha quorum status CLI output at baseline

  • val17/val17-01-baseline-check.txt — assertion: quorum_health=healthy write_block_active=False

  • val17/val17-02-quorum-lost.json/v1/ha/quorum JSON after PG stop (quorum_health=lost)

  • val17/val17-02-quorum-lost.txtha quorum status CLI output during loss

  • val17/val17-03-loss-timing.txtloss_ms=<N>

  • val17/val17-04-write-block-check.txt — assertion: write_block_active=True can_accept_protected_writes=False

  • val17/val17-05-loss-reason.txt — assertion: quorum_loss_reason=<message>

  • val17/val17-06-quorum-recovered.json/v1/ha/quorum JSON after PG restart (quorum_health=healthy)

  • val17/val17-06-quorum-recovered.txtha quorum status CLI output after recovery

  • val17/val17-07-recovery-timing.txtrecovery_ms=<N>

  • val17/val17-08-recovery-check.txt — assertion: write_block_active, can_accept_protected_writes, timestamps, detected_loss_count

  • val17/val17-09-second-loss.json/v1/ha/quorum JSON during second loss

  • val17/val17-09-second-recovery.json/v1/ha/quorum JSON after second recovery

  • val17/val17-09-count-check.txt — assertion: detected_loss_count=2 (second cycle confirmed)

  • val17/val17-10-audit-lost.jsonaudit query --event-type ha.quorum.lost JSON result

  • val17/val17-10-audit-restored.jsonaudit query --event-type ha.quorum.restored JSON result

  • val17/val17-10-audit-check.txt — assertion: lost_events=N restored_events=M

  • val17/val17-report.txt — composite 10-check PASS/FAIL report with loss_ms and recovery_ms

  • val17/val17-report.json — machine-readable JSON with loss_ms, recovery_ms, pass_count

6. Expected Results

Every audit log in the bundle should include the canonical fields:

  • audit.event

  • audit.category

  • audit.action

  • audit.outcome

  • audit.resource

  • audit.resource_type

  • audit.source=cli

Expected event mapping:

  • audit-rollout-plan-create.logrollout.plan.created

  • audit-rollout-plan-publish.logrollout.plan.published

  • audit-rollout-plan-cancel.logrollout.plan.cancelled

  • audit-ha-backup-create.logha.backup.created

  • audit-ha-backup-restore.logha.backup.restored

  • audit-ha-failover-trigger.logha.failover.triggered and ha.failover.completed

  • audit-cert-issue.logcert.issued

  • audit-cert-rotate.logcert.rotated

  • rbac/audit-rbac-role-assign.logauth.role.assigned

  • relay/audit-relay-deadletter-retry.logrelay.deadletter.retried

  • relay/audit-relay-deadletter-purge.logrelay.deadletter.purged

  • relay/audit-relay-bandwidth-set.logrelay.bandwidth.configured

The paired stdout files should show that each command completed successfully in the same live run.

The metrics surface should also show:

  • metrics/orchestrator-metrics-raw.txt contains the real Prometheus exposition emitted by the live control-plane started by the lab

  • metrics/metrics-list.txt and metrics/metrics-list.json contain the same metric catalog in text and JSON form

  • metrics/metrics-query-all.txt contains live samples such as cp_http_requests_total, cp_health_checks_total, and cp_rollout_plans_total

  • metrics/metrics-query-rollout-plans.json contains the filtered rollout phase counters generated by the rollout actions in the same run

  • metrics/metrics-query-http-duration.txt contains the histogram family, including _bucket, _count, and _sum

The RBAC surface should also show:

  • rbac/rbac-role-create.txt creates a custom lowercase role derived from the canonicalized input name

  • rbac/rbac-role-list-before-assign.txt shows the custom role with 0 assignments

  • rbac/rbac-role-assign.txt assigns the role to the trimmed subject identity

  • rbac/rbac-role-assign-repeat.txt shows the idempotent repeat-assignment no-op

  • rbac/rbac-role-list-after-assign.txt shows the role with 1 assignment

  • rbac/assignments.json persists the normalized role and subject

  • retained/audit-query-category-auth.json returns the retained auth.role.assigned record from the same live run

  • rbac/rbac-role-assign-operator.txt assigns the predefined operator role used by the PR-21 allow-path checks

  • rbac/rbac-audit-query-denied.stderr and rbac/rbac-audit-export-denied.stderr show fail-closed denial for an operator who lacks audit_history:read

  • rbac/rbac-audit-query-allowed.json and rbac/rbac-audit-export-allowed.json show the authorized retained auth view for reviewer@example.com

  • rbac/rbac-ha-status-denied.stderr shows fail-closed denial for an unassigned operator on ha status

  • rbac/rbac-ha-status-allowed.txt shows successful HA status output for fleet-op@example.com

  • ha/ha-status-no-header.headers and ha/ha-status-no-header.json show the server-side 403 denial when /v1/ha/status is called without operator identity under enforcement

  • ha/ha-status-with-header.headers and ha/ha-status-with-header.json show the server-side 200 success path when the authorized operator header is set

  • rbac/retained-auth-access-denied.json returns the retained auth.access.denied records from the same live enforcement run

The retained surface should also show:

  • retained/retained-file-list.txt includes the daily JSONL file under retained/store/

  • retained/audit-query-all.txt returns a mixed set of rollout, HA, cert, and relay records from the live run

  • retained/audit-query-ha-backup-created.json returns the filtered ha.backup.created records

  • retained/audit-query-category-rollout.txt returns only rollout-category records from the same live retained dataset

  • retained/audit-query-source-edge.json returns only source=edge relay records from the same live retained dataset

  • retained/audit-query-outcome-success.txt returns the success-only retained dataset

  • retained/audit-query-invalid-output.stderr shows unknown --output values fail closed with the supported value list

  • retained/audit-query-invalid-range.stderr shows malformed time ranges fail closed instead of returning a silent empty result

  • retained/audit-export-all.json and retained/audit-export-all.csv contain the same retained dataset in export form

  • retained/audit-export-invalid-format.stderr shows unsupported export formats fail before writing

  • retained/audit-export-invalid-format-target-before.sha256 and retained/audit-export-invalid-format-target-after.sha256 match, proving the invalid-format failure did not truncate the existing target file

The support-bundle surface should also show:

  • support-bundle/support-bundle-generate.log records the live collection run

  • support-bundle/support-bundle-contents.txt lists the expected archive members

  • support-bundle/manifest.json marks system_info, build_info, config, ha_status, audit_recent, and logs as ok

  • support-bundle/config_redacted.yaml contains <REDACTED> for fleet_salt and REDACTED in the postgres_url password position

  • support-bundle/ha_status.json contains the live HA snapshot from the running helper

  • support-bundle/audit_recent.json contains retained records from the same lab run

  • support-bundle/logs-autonomy.log contains the tailed HA server log

  • support-bundle/support-bundle-sha256.txt records the resulting archive hash

The database-backed audit surface (when AUTONOMY_AUDIT_PG_URL is set) should also show:

  • db_audit/query-db-all.txt contains the three seeded rows (rollout, cert, ha) ordered newest-first with event_name, actor, resource, outcome, and source

  • db_audit/query-db-cert.json contains exactly one record with category=cert

  • db_audit/export-db-all.json and db_audit/export-db-all.csv contain the same seeded dataset as query-db-all.txt in export format

  • db_audit/prune-90d.txt shows deleted=0 (no rows older than 90 days in the lab)

  • db_audit/prune-1d.txt shows deleted=2 (the 1h and 2h rows are removed)

  • db_audit/query-db-after-prune.json contains exactly one row (the most recent)

  • db_audit/query-file-all.txt contains records from the file store in parallel, proving the file emitter remained active alongside the DB emitter

The VAL 01 zero-downtime rotation surface should also show:

  • autonomy/cert-rotation-list-expiring.txt contains either expiring or node-c, confirming the 2-day cert falls inside the --expiring-within-days 5 window

  • autonomy/cert-rotation-prerotate-health.json contains "status":"ok", confirming the old cert was accepted over live mTLS before rotation

  • autonomy/cert-rotation-timing.txt contains pass=true and rotation_elapsed_seconds=<N> where N is well below the 300-second bound, proving the rotation operation itself is effectively instantaneous

  • autonomy/cert-rotation-rotate.txt contains rotated  identity=node-c.edge.local cert=... valid_days=90, confirming the operation succeeded with the default 90-day renewal

  • autonomy/cert-rotation-list-after.txt contains no certificates matched, confirming the 90-day replacement cert is not in the 5-day expiry window

  • autonomy/cert-rotation-postrotate-health.json contains "status":"ok", proving the rotated client cert was accepted without restarting the control-plane

  • autonomy/cert-rotation-audit-events.json contains a record with cert.rotated, confirming the event was retained in the audit store

  • autonomy/cert-rotation-before-dates.txt and autonomy/cert-rotation-after-dates.txt have different serial= values, confirming a new keypair was issued

  • autonomy/cert-rotation-val01-report.txt reports 6/6 checks PASS with serials_differ=true

The VAL 02 trust-chain rejection surface should also show:

  • autonomy/cert-rejection-missing-client-cert.stderr is non-empty and the paired .stdout is empty, and stderr matches the missing-client-cert handshake pattern, confirming the control-plane requires a client certificate

  • autonomy/cert-rejection-invalid-chain.stderr matches the invalid-chain pattern, confirming a cert signed by a rogue CA is rejected even when the CN matches a legitimate node identity

  • autonomy/cert-rejection-expired-cert.stderr matches the expired-cert pattern, confirming expired certificates are rejected (validity period enforcement)

  • autonomy/cert-rejection-revoked.stderr matches the revoked-cert pattern, and the existing autonomy/cert-revocation-rejected-events.json retained audit evidence still proves the VerifyPeerCertificate callback path

  • autonomy/cert-rejection-wrong-server-trust.stderr matches the server-trust-verify pattern, confirming mTLS is bidirectional: the client cannot connect when it cannot verify the server’s cert chain

  • autonomy/cert-rejection-val02-report.txt reports 5/5 checks PASS and includes the right_ca_wrong_cn note confirming that a cert from the trusted CA with an unexpected CN is accepted at the TLS layer (identity-layer authorization is RBAC-based, not CN-based)

The cert RBAC surface should also show:

  • cert_rbac/denied-list.txt contains the string “cert:manage”, confirming that cert list (previously unguarded) now requires RBAC authorization

  • cert_rbac/denied-check-revocation.txt contains the string “cert:manage”, confirming that cert check-revocation (previously unguarded) now requires RBAC authorization

  • cert_rbac/denied-issue.txt, denied-rotate.txt, denied-revoke.txt, and denied-sync-crl.txt each contain “cert:manage”, confirming consistent coverage

  • cert_rbac/allowed-issue.txt contains the successful issued  identity=... line and does NOT contain an RBAC denial, confirming mutation success under cert:manage

  • cert_rbac/allowed-list.txt contains the listed node-a.edge.local row and does NOT contain an RBAC denial, confirming read-only success under cert:read

  • cert_rbac/allowed-check-revocation.txt contains not_revoked, confirming read-only revocation inspection succeeds under cert:read

  • cert_rbac/audit-denied-events.json contains auth.access.denied records with permission fields referencing cert:manage and cert:read | cert:manage, confirming denial is audited before the error is returned

The VAL03 RBAC enforcement surface should also show:

  • val03/val03-01-ha-status-deny.stderr contains rbac: and the fleet:read permission name, confirming the guard fires before any HTTP call when the operator has no assignment

  • val03/val03-05-audit-query-operator-deny.stderr and val03/val03-06-audit-query-analyst-deny.stderr each contain rbac: with audit_history:read, confirming the operator and analyst roles both lack the audit permission

  • val03/val03-10-rbac-role-create-operator-deny.stderr and val03/val03-11-rbac-role-create-analyst-deny.stderr each contain rbac: with rbac:manage, confirming neither operator nor analyst can create roles

  • val03/val03-02-ha-status-operator-allow.txt, val03-03-ha-status-analyst-allow.txt, and val03-04-ha-status-auditor-allow.txt each contain HA status JSON (or SKIP if the HA server was unavailable), confirming all three predefined roles include fleet:read and that the mirrored server-side RBAC store authorizes the same identities

  • val03/val03-07-audit-query-auditor-allow.json contains auth-category audit records, confirming audit_history:read in the auditor role allows the query

  • val03/val03-12-rbac-role-create-auditor-allow.txt contains created role "val03-test-role", confirming rbac:manage in the auditor role allows custom role creation

  • val03/val03-08-rbac-role-list-unassigned.txt does NOT contain rbac: operator and lists known roles such as operator and auditor, confirming rbac role list has no RBAC guard

  • val03/val03-09-rollout-plan-list-unassigned.stderr contains a connection error, NOT rbac:, confirming rollout plan list has no RBAC guard

  • val03/val03-13-support-bundle-unassigned.stderr contains normal bundle generation progress plus bundle written:, and val03/val03-support-bundle.tar.gz is non-empty, confirming support-bundle generate has no top-level RBAC guard even if optional nested collectors emit RBAC warnings for guarded HA sub-requests

  • val03/val03-14-access-denied-events.json contains auth.access.denied records for the five expected VAL03 deny tuples, confirming every denial from the VAL03 DENY checks was written to the retained audit store before the error was returned

  • val03/val03-report.txt reports the final pass=<N> skip=<M> fail=<K> total=14 summary with zero failures

  • val03/val03-report.json contains pass_count, skip_count, and per-check status values so HA unavailability is recorded as SKIP rather than PASS

The VAL04 audit completeness surface should also show:

  • val04/val04-store-inventory.txt reports store_jsonl_files > 0, confirming the retained store is non-empty after all prior lab phases

  • val04/val04-category-summary.txt reports count > 0 for all 6 categories (rollout, ha, cert, relay, auth, rollback), confirming every category is populated

  • val04/val04-schema-check.txt reports PASS for all 6 category schema checks with no MISSING field lines, confirming every returned record carries the mandatory audit fields

  • val04/val04-latency.txt reports query_ok=true and pass=true with query_elapsed_ms ≤ 2000, confirming the full retained store query both succeeded and stayed within the latency bound

  • val04/val04-coverage-report.txt reports PRESENT for all 25 of the 25 wired event types, with pass=true on the final summary line; any ABSENT line is now a real validation failure because the runner is expected to exercise the full wired event surface deterministically

  • val04/val04-report.txt reports pass=10 fail=0 total=10 summary

  • val04/val04-report.json contains pass_count=10, coverage_found=25, latency_ms ≤ 2000, and per-check status values

The VAL05 OTel integration surface should also show:

  • val05/val05-prometheus-status.txt reports http_code=200, confirming the control-plane Prometheus endpoint is reachable

  • val05/val05-prometheus-families.txt reports PRESENT for all 4 required metric families (cp_http_requests_total, cp_http_request_duration_seconds, cp_rollout_plans_total, cp_events_ingested_total)

  • val05/val05-events-ingest.json shows a successful POST /v1/events response, confirming that VAL05 itself exercised the event-ingestion path

  • val05/val05-prometheus-observations.txt reports non-zero sample values for cp_http_requests_total, cp_http_request_duration_seconds_count, cp_rollout_plans_total, and cp_events_ingested_total, confirming that lab traffic plus the explicit ingest produced real observations

  • val05/val05-emit-helper.txt contains emitted 3 events to, confirming the telemetry.Emitter → WAL write path succeeded

  • val05/val05-wal-status.json contains "total":3 and "pending":3, confirming events are persisted and not yet flushed

  • val05/val05-export.jsonl contains exactly 3 lines with "kind", "ts", "seq", "written_at", and "attrs" fields present in each line

  • val05/val05-flush-stdout.txt contains telemetry flush: OK 3 events sent to http://127.0.0.1:14318, confirming end-to-end OTLP delivery

  • val05/val05-flush-summary.txt reports sink_payloads > 0, confirming the sink printed at least one actual OTLP payload receipt line rather than only its startup banner

  • val05/val05-traceid-jsonl.txt reports trace_id_found=4bf92f3577b34da6a3ce929d0e0e4736 and span_id_found=00f067aa0ba902b7, confirming trace/span propagation through WAL → JSONL

  • val05/val05-traceid-otlp.txt reports traceId_found=true and spanId_found=true, confirming trace/span propagation in the OTLP/HTTP path

  • val05/val05-report.txt reports pass=9 fail=0 total=9 summary

  • val05/val05-report.json contains pass_count=9 and per-check status values

The VAL06 support-bundle surface should also show:

  • val06/val06-timing.txt reports generate_ok=true and elapsed_s ≤ 30, confirming the bundle was created within the time bound

  • val06/val06-core-files.txt reports PRESENT for all three core files (manifest.json, system_info.json, build_info.json)

  • val06/val06-manifest-check.txt reports a status line for each of the 6 collectors (system_info, build_info, config, ha_status, audit_recent, logs), confirming all are recorded in the manifest

  • val06/val06-sysinfo-check.txt reports PRESENT for all 5 required fields (os, arch, go_version, hostname, collected_at)

  • val06/val06-audit-check.txt reports audit_recent_count > 0, confirming the bundle captured records from the retained audit store

  • val06/val06-redaction-salt.txt reports fleet_salt_placeholder=true and fleet_salt_actual_absent=true, confirming the known test salt was replaced with <REDACTED> and the original value does not appear

  • val06/val06-redaction-pg.txt reports pg_redacted_present=true and pg_secret_absent=true, confirming the postgres password was replaced with REDACTED in the URL and the original password does not appear

  • val06/val06-privkey-check.txt reports privkey_hits=0, confirming no PEM block (-----BEGIN) appears anywhere in the bundle archive

  • val06/val06-degraded-check.txt reports bundle_exit_ok=true and ha_status_status=failed, confirming graceful degradation when the control-plane URL is unreachable

  • val06/val06-report.txt reports pass=10 fail=0 total=10 summary

  • val06/val06-report.json contains pass_count=10 and per-check status values

The VAL07 rollout latency surface should also show:

  • val07/val07-health.txt reports health_code=200, confirming the dedicated VAL07 control plane started and is reachable before the benchmark begins

  • val07/val07-create-percentiles.txt reports p50_ms, p95_ms, and p99_ms all within their respective bounds (100/300/500 ms), with n=20 and sample_complete=true confirming a full successful sample was collected

  • val07/val07-list-percentiles.txt reports p99_ms ≤ 500, confirming the list path (with 20 existing plans) is within the same latency target; it also reports n=20 and sample_complete=true

  • val07/val07-concurrent-summary.txt reports conc_ok=5 and conc_errors=0, confirming all 5 parallel creates succeeded; wall_ms ≤ 2000, confirming that single-writer SQLite serialisation does not make concurrent operator requests unacceptably slow

  • val07/val07-error-summary.txt reports error_count=0 across all 45 benchmark requests (20 creates + 20 lists + 5 concurrent creates)

  • val07/val07-prometheus-check.txt reports cp_http_requests_total > 0, confirming the Prometheus instrumentation on the VAL07 control plane is wired and received observations from the benchmark traffic

  • val07/val07-report.txt reports pass=9 fail=0 total=9 summary

  • val07/val07-report.json contains pass_count=9, plan_create_ms latency object, and per-check status values

The VAL08 rollout throughput surface should also show:

  • val08/val08-health.txt reports status=ok, confirming the dedicated VAL08 control plane started and is reachable before the throughput run begins

  • val08/scenario-n100/scenario-report.txt reports ok=500 and errors=0, proving the primary workplan target (≥100 concurrent device rollouts without errors) is met

  • val08/val08-wall-clock-n100.txt reports elapsed_ms 30000, confirming 500 concurrent plan creates complete within the 30-second bound

  • val08/val08-throughput-scaling.txt reports tput_n100 tput_n1, confirming that issuing 100 concurrent worker streams does not regress throughput below the single-worker serial rate; the SQLite single-writer model is expected to produce a plateau (near-equal throughput) rather than linear scaling, which is acceptable

  • val08/val08-error-aggregate.txt reports total_errors=0 across all 805 plans created (N=1+10+50+100, 5 plans each)

  • val08/val08-list-consistency.txt reports list_count 805, confirming that all created plans are durably stored and returned across paginated list results

  • val08/val08-prometheus-check.txt reports cp_http_requests_total > 0, confirming Prometheus instrumentation received observations from the throughput traffic

  • val08/val08-report.txt reports pass=10 fail=0 total=10 summary

  • val08/val08-report.json contains pass_count=10, throughput object with n1/n10/n50/n100 plans/sec values, and per-check status values

The VAL09 stuck detection surface should also show:

  • val09/val09-health.txt reports status=ok, confirming the dedicated VAL09 control plane started before any stuck checks run

  • val09/val09-baseline-check.txt reports stuck_count=0 on an empty store, confirming the detection function handles the zero-plan case without error

  • val09/val09-fresh-check.txt reports stuck_count=0 immediately after creating 5 plans, confirming freshly-created plans are not falsely reported as stuck before the 3-second threshold elapses

  • val09/val09-stale-check.txt reports stuck_count=5 after the 4-second sleep, confirming all 5 published-phase plans exceed the threshold and are detected as stuck

  • val09/val09-diagnosis-check.txt reports diagnosis_populated=5 and diagnosis_exact=5, confirming every stuck plan carries the exact expected "zero activations" diagnosis string

  • val09/val09-pause-check.txt reports in_stuck_scan=no for val09-plan-b, confirming paused plans are excluded from the active-phase scan

  • val09/val09-cancel-check.txt reports in_stuck_scan=no for val09-plan-c, confirming terminal plans are excluded from the active-phase scan

  • val09/val09-retry-check.txt reports new_phase=active, confirming the retry recovery strategy transitions the plan to the active phase and refreshes updated_at (removing it from the stuck list at VAL09-10)

  • val09/val09-rollback-check.txt reports new_phase=rolled_back, confirming the rollback recovery strategy transitions the plan to the terminal phase

  • val09/val09-final-check.txt reports plan_a_present=true, plan_b_absent=true, plan_c_absent=true, plan_d_absent=true, plan_e_absent=true, pass=true — the final scan correctly surfaces only the one plan that received no operator action

  • val09/val09-report.txt reports pass=10 fail=0 total=10 summary

  • val09/val09-report.json contains pass_count=10 and per-check status values with scan_stale_count=5 and scan_final_count=1

The VAL10 rollback reliability surface should also show:

  • val10/val10-preview-check.txt reports preview_errors=0, confirming all four rollback preview targets exit 0 and produce safety profiles

  • val10/val10-preview-rollout_plan-check.txt reports safety_class=terminal, orchestrated=true, and valid_strategies=['retry', 'rollback'], confirming the rollout_plan preview JSON schema is correct

  • val10/val10-preview-relay-check.txt reports orchestrated=false and manual_path_has_edgectl=true, confirming relay_deadletter is correctly surfaced as a manual-only target with edgectl instructions

  • val10/val10-retry-rate.txt reports ok=5  fail=0  success_rate=1.000, confirming all 5 retry executes on real plans succeed

  • val10/val10-rollback-rate.txt reports ok=5  fail=0  success_rate=1.000, confirming all 5 rollback executes on real plans succeed

  • val10/retry/execute-retry-*.txt each show outcome=success  previous=published new=active, confirming the retry strategy transitions plans to active phase

  • val10/rollback/execute-rollback-*.txt each show outcome=success previous=published  new=rolled_back, confirming rollback transitions to terminal

  • val10/val10-execute-json-check.txt reports all three JSON output fields present (outcome, new_state, kind), confirming the --output json format is stable

  • val10/val10-nonexistent-check.txt reports non-zero exit code, confirming the CLI surfaces CP 404 errors as non-zero exit rather than silently succeeding

  • val10/val10-relay-not-orchestrated-check.txt reports non-zero exit code and edgectl instructions present, confirming manual-only targets are blocked from execute with actionable guidance

  • val10/val10-audit-preview-check.txt reports rollback.preview.requested_count 4 with actor/start-time scope, confirming this slice’s preview commands emit audit records to the retained store

  • val10/val10-aggregate-rate.txt reports agg_success_rate=1.000 (10/10), plus at least 10 retained success events scoped to this slice, satisfying the workplan target of ≥99% rollback success rate

  • val10/val10-report.txt reports pass=10 fail=0 total=10 summary

  • val10/val10-report.json contains pass_count=10, success_rate.aggregate.rate=1.000, and per-check status values

The VAL11 chaos surface should also show:

  • val11/val11-health.txt reports status=ok confirming the dedicated chaos CP started cleanly on port 18996

  • val11/val11-kill-check.txt reports exit_nonzero=true and has_connection_error=true, confirming the CLI surfaces CP unavailability as a non-zero exit with an actionable message rather than silently succeeding or hanging

  • val11/val11-durability-check.txt reports list_count 10, confirming the full pre-kill plan corpus committed before the CP kill are recoverable from SQLite’s WAL after process restart

  • val11/val11-rapid-restart.txt shows all 3 kill+restart cycles completing with cp_ready=true, list_count_final list_count_before, and new_plan_code=201, confirming the write path is fully operational after repeated restarts

  • val11/val11-gate-check.txt reports phase=published, confirming a plan in the gate-wait state is not lost or corrupted by a CP kill — the operator’s pending gate decision survives

  • val11/val11-stuck-check.txt reports stuck_count 3 and diagnosis_ok=true, confirming the stuck-detection surface correctly identifies the device-unresponsive proxy plans and populates operator-visible diagnosis strings

  • val11/val11-corrupt-check.txt reports create_code=201 and get_ok=true with plan.metadata.id=val11-corrupt-1, confirming the CP accepts and stores plans with unconventional artifact references without rejecting them at ingestion (validation is the edge agent’s responsibility)

  • val11/val11-rollback-corrupt-check.txt reports exit_ok=true, confirming the operator can roll back a suspect plan regardless of its artifact metadata

  • val11/val11-cascade-check.txt reports cascade_ok=3  cascade_fail=0, confirming all 3 device-unresponsive proxy plans are recoverable via retry in a single operator pass

  • val11/val11-audit-check.txt reports rollback_executed_success_count 1 with actor/start-time scope, confirming the audit capture pipeline is not disrupted by CP kill/restart cycles — events emitted during chaos recovery sessions are retained in the shared audit store

  • val11/val11-report.txt reports pass=10 fail=0 total=10 summary

  • val11/val11-report.json contains pass_count=10 and per-check status values

The HA failover surface should also show:

  • val13/val13-node1-s0.log contains acquired leadership, confirming node-1 won the initial leader election via advisory lock Campaign

  • val13/val13-01-node1-status.txt contains cp-val13-node1 confirming that node-1 reports itself as the active leader at baseline

  • val13/val13-03-failover-timing.txt reports failover_ms=<N> where N ≤ 5000; the measured latency from SIGTERM to node-2 logging “acquired leadership”

  • val13/val13-04-data-probe.txt contains exactly pre-kill, confirming the shared PostgreSQL instance retained the probe row across leader failover

  • val13/val13-06-quorum-lost.json contains "quorum_health":"lost", confirming the quorum monitor detects PG unavailability within the polling window

  • val13/val13-07-quorum-healthy.json contains "quorum_health":"healthy", confirming quorum health returns after the explicit docker start recovery step

  • val13/val13-08-rapid-summary.txt reports all three cycle timings ≤ 5000 ms, and val13-08-post-cycle-status.txt plus val13-08-data-after-rapid.txt confirm the rapid cycles end with a stable leader and intact probe row

  • val13/val13-09-sigkill-timing.txt reports failover_ms=<N> where N ≤ 5000 after SIGKILL (no graceful Resign), validating that advisory lock release via TCP RST is fast enough to meet the HA readiness threshold

  • val13/val13-report.txt reports pass=10 fail=0 total=10

  • val13/val13-report.json contains pass_count=10, sigterm_failover_ms, sigkill_failover_ms, and rapid_cycle_ms array values as measurement evidence

The replication lag baseline surface should show:

  • val14/val14-01-replication.txt contains streaming in the state column of pg_stat_replication, confirming the standby is connected and receiving WAL

  • val14/val14-02-idle-lsn-gap.txt contains 0, confirming no unacknowledged WAL at rest

  • val14/val14-03-light-result.txt shows light_drain_ms ≤ 2000 ms for the 100-row × 500-byte write workload

  • val14/val14-04-heavy-result.txt shows heavy_drain_ms ≤ 5000 ms for the 500-row × 2000-byte (~1 MB) write workload

  • val14/val14-07-quorum-degraded.json shows quorum_health=degraded after docker stop val14-pg-standby

  • val14/val14-08-quorum-healthy.json shows quorum_health=healthy after docker start val14-pg-standby

  • val14/val14-09-offline-write.txt shows the write burst issued while the standby was offline, and val14/val14-09-catchup-result.txt shows ok=true and catchup_drain_ms ≤ 10000 ms after standby restart

  • val14/val14-report.txt shows pass=10  fail=0  total=10 and lists derived threshold values: healthy_thresh_ms, degraded_thresh_ms, alert_thresh_ms

  • val14/val14-report.json contains pass_count=10, idle_p95_ms, light_p95_ms, heavy_p99_ms, light_drain_ms, heavy_drain_ms, healthy_thresh_ms, degraded_thresh_ms, and alert_thresh_ms as measurement and threshold evidence

The backup/restore validation surface should show:

  • val15/val15-01-backup-create.txt contains created  backup_id=backup-val15-a with non-empty checksum= and positive size= values

  • val15/val15-02-backup-toc.txt exits without error and contains TABLE DATA entries for val15_small and val15_medium

  • val15/val15-03-metadata-check.txt contains backup_id=backup-val15-a status=completed

  • val15/val15-04-checksum-verify.txt shows cli_checksum and file_checksum fields with matching 64-character hex strings

  • val15/val15-05-backup-timing.txt shows backup_ms ≤ 30,000

  • val15/val15-06-data-check.txt contains 100|1000|t|t — row counts and payload spot-checks confirming tables were restored to pre-backup state

  • val15/val15-06-integrity-result.txt contains restore_correct=true small=100 medium=1000

  • val15/val15-07-restore-timing.txt shows restore_ms ≤ 60,000

  • val15/val15-08-inventory-check.txt contains multi_backup_count=2

  • val15/val15-09-restore-no-confirm.txt shows an error message about missing --confirm; the CLI must exit non-zero

  • val15/val15-10-audit-check.txt contains created_events=N restored_events=M with both N ≥ 1 and M ≥ 1

  • val15/val15-report.txt shows pass=10  fail=0  total=10

  • val15/val15-report.json contains pass_count=10, backup_ms, and restore_ms as baseline timing evidence

The split-brain chaos validation surface should show:

  • val16/val16-01-baseline.json contains "risk": "none" before any injection

  • val16/val16-02-detected.json contains "risk": "detected" after epoch divergence injection

  • val16/val16-03-detect-repeat.json also contains "risk": "detected", confirming idempotency

  • val16/val16-04-recover-dry-run.txt exits without error and includes planning/recommendation output from manual-reconcile

  • val16/val16-04-risk-after-dry-run.json still contains "risk": "detected" (dry-run does not write to DB)

  • val16/val16-04-risk-check.txt reports risk_after_dry_run=detected

  • val16/val16-05-recover-execute.txt exits successfully and val16/val16-05-recovered.json contains "risk": "none" after promote-leader

  • val16/val16-06-probe-after-recovery.txt contains pre-inject, confirming user data is untouched by recovery

  • val16/val16-07-possible.json contains "risk": "possible" after ghost-node epoch injection

  • val16/val16-08-cleared.json contains "risk": "none" after resigned_at is stamped on ghost rows

  • val16/val16-09-audit-check.txt contains detected_events=N recovered_events=M with both N ≥ 1 and M ≥ 1, scoped to this slice by start-time and actor=val16-operator for the recovery event

  • val16/val16-10-final-status.json confirms holder_id contains cp-val16-node

  • val16/val16-report.txt shows pass=10  fail=0  total=10

  • val16/val16-report.json contains pass_count=10 and per-check results

The quorum loss validation surface should show:

  • val17/val17-01-baseline.json contains "quorum_health": "healthy" and "write_block_active": false before any fault

  • val17/val17-02-quorum-lost.json contains "quorum_health": "lost" after docker stop val17-pg-primary

  • val17/val17-03-loss-timing.txt shows loss_ms ≤ 30,000

  • val17/val17-04-write-block-check.txt confirms write_block_active=True and can_accept_protected_writes=False during loss

  • val17/val17-05-loss-reason.txt shows a non-empty quorum_loss_reason (e.g. "database connection unavailable")

  • val17/val17-06-quorum-recovered.json contains "quorum_health": "healthy" after docker start val17-pg-primary

  • val17/val17-07-recovery-timing.txt shows recovery_ms ≤ 30,000

  • val17/val17-08-recovery-check.txt confirms write_block_active=False, can_accept_protected_writes=True, non-empty last_lost_at and last_restored_at, and detected_loss_count 1

  • val17/val17-09-count-check.txt contains detected_loss_count=2 (second cycle confirmed) after the second recovery succeeds

  • val17/val17-10-audit-check.txt contains lost_events=N restored_events=M with both N ≥ 1 and M ≥ 1

  • val17/val17-report.txt shows pass=10  fail=0  total=10

  • val17/val17-report.json contains pass_count=10, loss_ms, and recovery_ms as baseline timing evidence against workplan ≤ 60 s target

The config migration surface should also show:

  • config-migrate/config-migrate-dry-run.txt reports the planned v0-to-v1 changes without writing any output file

  • config-migrate/config-migrate-stdout.yaml and config-migrate/config-migrate-stdout.toml show deterministic migrated output in both supported formats

  • config-migrate/config-migrated.yaml and config-migrate/config-migrated.toml show successful file output

  • config-migrate/config-migrated-in-place.yaml differs from the pre-migrate checksum while preserving the target file mode captured in the paired stat files

  • config-migrate/config-migrate-unsupported.stderr names both supported schema versions for unsupported input

  • config-migrate/config-migrate-invalid-input.stderr shows malformed input fails closed instead of producing a synthetic v1 skeleton

  • config-migrate/config-migrate-invalid-v0.stderr shows invalid migrated configs fail validation before any output is written

7. Current Evidence Bundle

Reference local run:

  • evidence/pr17-cli-audit-local-2026-03-17/README.md

  • evidence/pr18-support-bundle-local-2026-03-18/README.md

  • evidence/pr20-rbac-local-2026-03-18/README.md

  • evidence/pr22-metrics-local-2026-03-18/README.md

  • evidence/pr27-config-migration-local-2026-03-19/README.md

8. Scope Boundary

This lab proves the current PR-17 and PR-18 scope honestly:

  • canonical audit schema

  • CLI-side emission at the wired action sites

  • reproducible local evidence from real command invocations

  • support-bundle generation against a live control-plane, retained audit store, and log file

  • RBAC role create/list/assign against the local file-backed model plus retained audit capture

  • RBAC enforcement on the current read surfaces, including the server-side /v1/ha/status path and retained denial auditing

  • metrics catalog visibility and point-in-time metric queries against a live control-plane metrics endpoint

  • config migration tooling against checked-in v0 fixtures, including safe dry-run output, format selection, fail-closed invalid input handling, and atomic in-place replacement

The VAL 01 surface (Phase 8 of the cert lab) proves bounded certificate rotation with continuity for new client connections:

  • A 2-day cert for node-c.edge.local is accepted over live mTLS before rotation

  • autonomy cert rotate completes within the 300-second bound (actual: sub-second)

  • The rotated 90-day cert is accepted over live mTLS without restarting the control-plane, proving continuity for a fresh client connection using the same cert/key file paths after atomic replacement

  • The cert.rotated audit event is retained and queryable via audit query

  • Serial numbers differ before and after rotation, proving a new keypair was issued

  • All six checks captured in cert-rotation-val01-report.txt as a composite PASS/FAIL

It does not prove CA rotation, server-certificate hot reload, uninterrupted in-flight request continuity, or coordinated multi-node rotation. See the deferred coverage matrix in cert-rotation-validation.md for the exact status of each excluded area.

The VAL 02 surface (Phase 9 of the cert lab) proves consistent rejection across all five trust-chain failure categories:

  • Missing client certificate is rejected (RequireAndVerifyClientCert active)

  • Certificate from a rogue CA is rejected by chain verification — even when the CN matches a known legitimate node, proving rejection is CA-anchor-based, not CN-based

  • Expired certificate is rejected by the validity period check in Go’s chain verification

  • Revoked certificate is rejected by the VerifyPeerCertificate CRL callback

  • Wrong CA bundle on the client causes server cert verification failure, proving mTLS is bidirectional

  • A cert from the trusted CA with an unexpected CN is accepted at the TLS layer (documented in the report as right_ca_wrong_cn: expected accepted) — confirming that identity-layer authorization is RBAC-based, not cert CN-based

See cert-rejection-validation.md for the full VAL 02 validation plan, scenario matrix, pass/fail criteria, and report template.

The PR-29-followup-e surface proves cert-management RBAC coverage:

  • All six autonomy cert subcommands (issue, rotate, revoke, list, check-revocation, sync-crl) require cert:manage or, for read-only operations, cert:read

  • cert list and cert check-revocation were unguarded before PR-29-followup-e; they are now guarded with newRBACGuard().CheckAny([]string{"cert:read","cert:manage"}, ...)

  • cert:read is a new recognized permission included in the auditor predefined role; cert:manage requires a custom role

  • RBAC denial for any cert command emits auth.access.denied before returning the error; no separate audit-on-denial code is needed — rbacGuard.emitDenied() fires automatically

The PR-29-followup-d surface proves the database-backed audit query path:

  • audit_events table in the pgstore PostgreSQL schema (append-only, INV-AUDIT-01)

  • PGAuditEmitter writing records at Class 3 (best-effort) durability — write failures are logged and counted but never propagated to the audited operation

  • InitPGAuditEmitter(db) upgrading the package-level emitter to MultiEmitter (slog + PG + file) after a successful pgstore connection

  • autonomy audit query --pg-url / AUTONOMY_AUDIT_PG_URL as the primary operator query surface when PostgreSQL is available

  • autonomy audit export --pg-url for JSON and CSV export from the DB

  • autonomy audit prune --older-than Nd for operator-initiated retention enforcement against the audit_events table

  • QueryAuditEvents as a read-only function safe to run on any replica

  • file-based audit.FileEmitter preserved in parallel as the fallback / compat mode when no --pg-url is provided

It does not yet claim background retention jobs, OCSP-style online status queries, or multi-tenant audit isolation.

The VAL03 surface (slice 14) proves RBAC permission enforcement across a 14-check matrix covering all three enforcement claims:

  • VAL03-C1 (unauthorized blocked): ha status (fleet:read), audit query (audit_history:read), and rbac role create (rbac:manage) are each denied for identities whose role does not include the required permission, with auth.access.denied emitted before any network call

  • VAL03-C2 (authorized succeeds): all three permissions are exercised on the allow path: fleet:read for operator, analyst, and auditor; audit_history:read for auditor; rbac:manage for auditor. The VAL03 identities are mirrored into the HA helper’s server-side RBAC store so the ha status allow-path checks exercise both client-side and server-side authorization. If the HA helper is unavailable, the three ha status allow-path checks are recorded as SKIP, not PASS

  • VAL03-C3 (unguarded unrestricted): rbac role list, rollout plan list, and support-bundle generate succeed or fail for non-RBAC reasons regardless of the operator’s assignment, confirming those commands have no guard

  • VAL03-C4 (denial audit visibility): the retained audit query is narrowed to the current VAL03 time window and must contain the five expected actor/action/permission denial tuples from this slice itself

VAL03 covers representative commands from the guarded surface; it does not re-exercise the bootstrap, break-glass, opt-out, or cert RBAC paths already covered by run_rbac_enforcement_lab and run_cert_rbac_lab. See rbac-enforcement-validation.md for the full VAL03 validation plan, guard coverage map, pass/fail criteria, and report template.

The VAL04 surface (slice 15) proves audit completeness across a 10-check matrix covering all four completeness claims:

  • VAL04-C1 (store populated): the retained file-backed store is non-empty and all 6 audit categories contain at least one record after all prior lab phases have run

  • VAL04-C2 (schema complete): every audit category’s records contain all 6 mandatory fields (event_name, category, action, outcome, source, timestamp), confirming no field is silently dropped by any emit call

  • VAL04-C3 (queryable within latency bound): a full retained-store query with --limit 0 --output json succeeds and completes in ≤ 2000 ms, confirming operational usability at lab corpus sizes

  • VAL04-C4 (event-type coverage): all 25 of the 25 defined wired event types appear in the retained store; absent events are listed explicitly in val04-coverage-report.txt and any absence is a validation failure

VAL04 validates against the 25 wired event types. The 6 deferred event types (rollout.gate.approved, rollout.recovered, rollout.stuck.detected, auth.login.succeeded, auth.login.failed, relay.deadletter.inspected) are excluded — their absence is expected and correct. See audit-completeness-validation.md for the full VAL04 validation plan, wired event inventory, pass/fail criteria, and report template.

The VAL05 surface (slice 16) proves OTel integration across a 9-check matrix covering all four integration claims:

  • VAL05-C1 (Prometheus metrics): the control-plane /metrics endpoint returns HTTP 200, all 4 required metric families are present, and cp_http_requests_total / cp_rollout_plans_total have non-zero values after lab traffic — confirming that Prometheus instrumentation is wired and receiving real observations

  • VAL05-C2 (WAL durability): events emitted via telemetry.NewEmitter are persisted to the local WAL and readable by telemetry status and telemetry export; this is validated with an isolated test WAL populated by telemetry_emit_helper (a small lab binary), because no CLI command exists to emit adapter-side telemetry events directly

  • VAL05-C3 (JSONL export): telemetry export --out produces non-empty JSONL with all mandatory event fields (kind, ts, seq, written_at, attrs) — confirming the offline pipeline can surface events to downstream consumers that do not use OTLP

  • VAL05-C4 (correlation ID propagation): trace_id / span_id set at emit time survive through the WAL → JSONL path (as trace_id / span_id) and the WAL → OTLP flush path (as traceId / spanId in the OTLP log record), confirming that the custom OTLP encoding correctly propagates correlation context for consumers such as Jaeger and Grafana Tempo

VAL05 validates the two implemented observability paths (Prometheus metrics and WAL/OTLP events). It does not validate the OTel Go SDK (not used), automatic traceparent header extraction (not implemented), slog trace context injection (not implemented), or the edge Prometheus metrics (edge process not started by this lab). See otel-integration-validation.md for the full VAL05 validation plan, architecture notes, pass/fail criteria, and report template.

The VAL06 surface (slice 17) proves support-bundle correctness across a 10-check matrix covering all four bundle claims:

  • VAL06-C1 (generation succeeds): support-bundle generate exits 0 and produces a non-empty .tar.gz archive within 30 seconds — confirming the collector pipeline runs to completion and the archive-write path is functional at lab corpus sizes

  • VAL06-C2 (diagnostic coverage): the archive contains all three always- present core files (manifest.json, system_info.json, build_info.json) and manifest.json records all 6 collector names regardless of their individual outcome; system_info.json contains all 5 required runtime fields; audit_recent.json has at least 1 record from the retained store, proving end-to-end connectivity between the bundle and the audit subsystem

  • VAL06-C3 (secrets redacted): config_redacted.yaml replaces the known test fleet_salt with <REDACTED> and the postgres URL password with REDACTED; original values are verified absent; no PEM block (-----BEGIN) appears anywhere in the extracted archive

  • VAL06-C4 (graceful degradation): generating a bundle with a non- existent --orchestrator-url exits 0 and manifest.json records ha_status as "failed", confirming the non-fatal collector pattern is preserved for optional data sources

VAL06 validates the CLI surface and archive structure. It does not test bundle ingestion by external tools, RBAC guarding of the command (confirmed unguarded by VAL03-C3), per-field value correctness of system_info.json, or the DB-backed audit path (requires a live PostgreSQL instance). See support-bundle-validation.md for the full VAL06 validation plan, bundle architecture, collector status definitions, pass/fail criteria, and report template.

The VAL07 surface (slice 18) establishes a rollout latency baseline across a 9-check matrix covering all four latency claims:

  • VAL07-C1 (control plane reachable): a dedicated fresh control-plane instance starts on 127.0.0.1:18992 with an isolated SQLite data directory and responds to GET /v1/health with 200, establishing a clean starting point for the benchmark

  • VAL07-C2 (plan-create latency): 20 sequential POST /v1/rollouts requests are timed with curl -w '%{time_total}' and Python percentiles are computed; p50 ≤ 100 ms, p95 ≤ 300 ms, and p99 ≤ 500 ms prove the primary workplan target (rollout plan creation < 500ms p99) is met in the local- SQLite environment

  • VAL07-C3 (plan-list latency): 20 sequential GET /v1/rollouts requests against a store containing 20 plans; p99 ≤ 500 ms proves the read path is within the same bound after realistic state accumulation

  • VAL07-C4 (concurrent responsiveness): 5 parallel POST /v1/rollouts requests all return 2xx and complete within a 2000 ms wall clock, proving that the single-writer SQLite connection serialises concurrent creates without returning errors or making the operator API unacceptably slow

VAL07 is a local-lab latency baseline. The bounds (100/300/500 ms) are generous for in-process loopback SQLite and are designed to detect regressions (e.g. an accidentally synchronous fsync, a missing index on the list path) rather than to measure production PostgreSQL performance. Benchmark methodology, sample size rationale, and environment assumptions are documented in rollout-latency-validation.md.

The VAL08 surface (slice 19) validates concurrent fleet rollout throughput across a 10-check matrix covering all four throughput claims:

  • VAL08-C1 (N=100 zero errors): 100 concurrent workers each creating 5 plans produce zero errors, proving the workplan target (≥100 concurrent device rollouts) is met in the local-SQLite environment

  • VAL08-C2 (durable storage): all 805 created plans (across four concurrency tiers) appear across paginated GET /v1/rollouts?limit=100 responses, confirming the serialised SQLite writer commits every plan before returning 201

  • VAL08-C3 (wall-clock bound): the N=100 scenario (500 plans) completes within 30 seconds, establishing a safe upper bound for operator-facing throughput at design-partner fleet sizes

  • VAL08-C4 (no concurrency regression): throughput at N=100 is ≥ throughput at N=1, confirming that the single-writer SQLite connection serialises concurrent creates without causing a performance regression

VAL08 validates the control-plane write path under concurrent load at lab scale. It does not test PostgreSQL backend throughput (requires a live PG instance), network-constrained relay delivery, edge-agent reconciliation latency, or fleet sizes beyond 1,000 devices (the proposed workplan maximum). Scenario matrix design, SQLite serialisation notes, and pass/fail thresholds are documented in rollout-throughput-validation.md.

The VAL09 surface (slice 20) validates stuck rollout detection and recovery across a 10-check matrix covering all four stuck-detection claims:

  • VAL09-C1 (detection accuracy): plans in active phases with updated_at older than the threshold appear in GET /v1/rollouts/stuck with non-empty diagnosis strings — validated by detecting all 5 test plans after a 4-second sleep against a 3-second threshold

  • VAL09-C2 (exclusion correctness): plans in paused or terminal phases are excluded from stuck detection regardless of updated_at staleness — validated by pausing plan-b and cancelling plan-c, then confirming both are absent from subsequent scans

  • VAL09-C3 (retry recovery): recover strategy=retry transitions the plan to active and refreshes updated_at, removing it from future stuck scans — validated by the plan-d flow and confirmed at VAL09-10

  • VAL09-C4 (rollback recovery): recover strategy=rollback transitions the plan to the rolled_back terminal phase, removing it from all subsequent scans — validated by the plan-e flow and confirmed at VAL09-10

VAL09 validates the detection and recovery surfaces against lab-scale plans in published phase with no edge-agent activity. It does not test automatic periodic stuck scanning (not yet implemented), stuck detection across HA replicas, the skip_failed recovery strategy (which requires a plan with an open stage in stage_in_progress phase), or the rollout.stuck.detected audit event path (slog-only; not yet wired to the retained audit store). Staleness injection method, diagnosis logic, and scenario design are documented in stuck-detection-validation.md.

The VAL10 surface (slice 21) validates rollback reliability across a 10-check matrix covering all four rollback claims:

  • VAL10-C1 (preview coverage): rollback preview exits 0 for all four target kinds (rollout_plan, rollout_stage, ha_leader_resign, relay_deadletter), producing safety class, trigger conditions, and known limitations

  • VAL10-C2 (retry success rate): rollback execute strategy=retry on real rollout plans succeeds with 100% success rate across a batch of 5 executions, with each plan transitioning from published to active

  • VAL10-C3 (rollback success rate): rollback execute strategy=rollback on real rollout plans succeeds with 100% success rate across a batch of 5 executions, with each plan transitioning from published to rolled_back

  • VAL10-C4 (aggregate rate + audit): aggregate rate across all 10 executes is ≥ 99%; rollback.executed audit events with outcome=success are captured in the retained store

VAL10 validates the CLI execute path and the workplan ≥99% target against the local-SQLite control-plane. It does not test skip_failed, ha_leader_resign via VAL10, automatic rollback, the 30-day soak (workplan GA gate), or the PostgreSQL backend. Success rate formula, JSON field handling, and evidence structure are documented in rollback-reliability-validation.md.

The VAL11 surface (slice 22) validates operator-surface resilience under representative chaos conditions across a 10-check matrix covering all four chaos claims:

  • VAL11-C1 (kill → client error + no silent data loss): after a CP SIGTERM, client CLI requests exit non-zero with a connection error keyword, and the full pre-kill plan corpus is present after restart

  • VAL11-C2 (gate-wait survival): a plan in published phase (gate-wait state) retains its phase across the CP kill boundary, confirming the operator’s pending gate decision is not lost

  • VAL11-C3 (rapid restart resilience): three successive kill+restart cycles do not corrupt the store; new plan creates succeed after the final restart, confirming the write path is operational after repeated restarts

  • VAL11-C4 (diagnostic and recovery surfaces functional post-chaos): stuck detection, artifact corruption proxy queries, rollback execute, and audit capture all function correctly in and around chaos injection windows

VAL11 validates the control-plane durability and operator-surface resilience using process-level SIGTERM injection only. It does not test iptables-level network partitions (requires root), concurrent creates under kill (inherently racy; replaced by deterministic rapid-restart), SIGKILL WAL recovery, PostgreSQL backend chaos, edge-agent reconnect after partition, or automatic stage promotion under chaos. Chaos mechanism rationale, safety guardrails, and scenario design are documented in chaos-validation.md.

VAL 12 — Fleet Rollout 30-Day Soak is the workplan Gate D long-duration framework and is not a slice of this runner. The existing run_cli_audit_lab.sh is a synchronous single-shot evidence collector; the 30-day soak requires persistent infrastructure, scheduled round execution (cron every 30 minutes), rolling evidence windows, daily aggregation, and a final pass/fail report.

The soak is driven by three separate scripts:

  • scripts/labs/run_soak_val12_setup.sh — one-time environment provisioning; starts a persistent CP at 127.0.0.1:19000, writes config.env, and runs the first workload round to verify VAL12-01 (framework provisioned) and VAL12-02 (initial round zero errors)

  • scripts/labs/run_soak_val12_round.sh — single workload round called by cron every 30 minutes; creates 10 plans, runs stuck scan + auto-recovery, scrapes Prometheus metrics, and writes round-summary.json to $SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/

  • scripts/labs/run_soak_val12_report.sh — daily summary and final report aggregator; reads all round summaries, computes rollback rate, P99 latency, and CP uptime, and checks all 10 VAL12 thresholds

The soak satisfies four claims over 30 days: ≥100 concurrent plans sustained (VAL12-C1), ≥99% rollback success rate (VAL12-C2), P99 ≤500ms maintained under accumulated store state (VAL12-C3), and CP availability ≥99.9% (VAL12-C4). The Gate D minimum-acceptable pass is VAL12-03 (fleet target reached) + VAL12-10 (30-day aggregate rollback rate ≥0.990). Soak environment design, workload schedule, alert thresholds, evidence retention plan, and the final report template are documented in soak-validation.md.

VAL 13 — HA Failover Validation is slice 23 of this runner (run_ha_failover_val13_lab). It extends the existing HA lab infrastructure in run_cli_audit_lab.sh with a dedicated function using fresh Docker containers (val13-pg-primary, val13-ha-net) and isolated ports (18997/18998) to avoid interference with the backup/restore and quorum labs.

VAL13 validates four HA readiness claims:

  • VAL13-C1 (SIGTERM failover latency): leader election completes within 5 seconds of SIGTERM on the current leader, measured end-to-end from kill signal to follower logging “acquired leadership”

  • VAL13-C2 (zero data loss): rows written directly to the shared PostgreSQL instance while the original leader held the advisory lock remain readable from that same instance after failover

  • VAL13-C3 (PG crash recovery): quorum monitor detects PostgreSQL unavailability (quorum_health=lost) and returns to quorum_health=healthy after explicit operator restart of PostgreSQL

  • VAL13-C4 (unplanned crash / disk-fault proxy): SIGKILL on the leader (no graceful Resign(), simulating OOM or disk crash) results in advisory lock release via TCP RST and a new leader within 5 seconds

VAL13 does NOT validate: streaming-replication failover (covered by run_ha_lab() and run_quorum_lab()), iptables-based network partitions (requires root), concurrent writes under kill, SIGKILL on PostgreSQL (WAL recovery path; covered by VAL13-09 indirectly via the same single-node PG), multi-region or cross-datacenter failover, or automatic rollback trigger under HA failure. Scenario design, measurement method, and pass/fail criteria are documented in ha-failover-validation.md.

VAL 14 — HA Replication Lag Baseline benchmarks PostgreSQL streaming replication lag under the autonomyops HA architecture and derives practical alerting thresholds from observed data. A val14-pg-primary + val14-pg-standby pair is provisioned via pg_basebackup, and a single HA server at port 18999 uses --min-sync-replicas 1 so quorum health tracks standby availability. VAL14 proves these workplan claims:

  • VAL14-C1 (replication streaming): a PostgreSQL streaming-replication standby is established, confirmed by pg_stat_replication.state = streaming and an LSN gap of 0 at rest

  • VAL14-C2 (light load drain ≤ 2 s): after a 100-row × 500-byte write batch committed with synchronous_commit=off, lag is sampled during the active drain window and the WAL LSN gap drains to 0 within 2000 ms on local Docker

  • VAL14-C3 (heavy load drain ≤ 5 s): after a 500-row × 2000-byte (~1 MB) write batch committed with synchronous_commit=off, the LSN gap drains within 5000 ms

  • VAL14-C4 (threshold derivation): practical alerting thresholds (healthy, degraded, alert) are derived from observed p95 lag using the formula healthy = max(p95×3+1, 10), degraded = max(healthy×10, 100), alert = max(healthy×50, 500), anchoring monitoring configuration to measured behaviour

VAL14 does NOT validate: write-path lag through the HA server HTTP surface (writes go directly to PostgreSQL), streaming-replication switchover or promotion (covered by run_ha_lab() and run_quorum_lab()), PG logical replication, multi-standby topologies, network-partition-induced lag (requires iptables/root), or production-scale throughput on cloud storage. Benchmark design, analysis method, and threshold derivation formula are documented in ha-replication-lag-validation.md.

VAL 15 — Backup/Restore Validation proves the ha backup create/list/restore CLI workflow end-to-end, including integrity verification, timing bounds, and safety-gate enforcement. A dedicated Docker PostgreSQL instance with two fixture tables (~1 MB total) is provisioned for isolation. The HA server is cycled through normal -> maintenance -> normal modes to match the real operator workflow. VAL15 proves these workplan claims:

  • VAL15-C1 (backup file integrity): ha backup create produces a valid pg_dump custom-format archive; the SHA-256 checksum recorded in the inventory matches an independently computed hash of the produced file

  • VAL15-C2 (restore correctness): after post-backup mutations (UPDATE + DELETE), ha backup restore reverts the database to its pre-backup state; row counts and spot-check payload values are verified by SQL assertion

  • VAL15-C3 (timing bounds): backup completes in ≤ 30 s; restore completes in ≤ 60 s on local Docker (conservative thresholds that flag hangs or permission errors without constraining normal operation)

  • VAL15-C4 (safety gate): ha backup restore without --confirm exits non-zero, confirming the mandatory confirmation flag prevents accidental restores

VAL15 does NOT validate: backup rotation / retention policy (no implementation), cross-PG-version restore compatibility, concurrent writes during backup, backup storage to remote object stores, disaster recovery runbook execution timing, or automatic scheduled backups. Fixture strategy, checksum method, and test sequence are documented in backup-restore-validation.md.

VAL16 (run_split_brain_chaos_val16_lab) uses SQL injection against leadership_state and leader_epochs to trigger risk=detected (epoch divergence + holder mismatch) and risk=possible (unclosed ghost-node epoch rows). A user-table probe row (val16_probe) is checked post-recovery to confirm recovery does not affect data beyond leadership metadata tables. VAL16 proves these workplan claims:

  • VAL16-C1 (split-brain detection): epoch divergence and ghost-node conditions are reliably detected and reported as risk=detected / risk=possible respectively via the /v1/ha/split-brain API and ha split-brain detect CLI

  • VAL16-C2 (recovery correctness): ha split-brain recover --strategy promote-leader clears risk=detected and returns the cluster to risk=none without corrupting user data

  • VAL16-C3 (dry-run safety): manual-reconcile exits 0 and does not write to the database, confirming operators can plan a recovery before committing

  • VAL16-C4 (self-clearing ghost nodes): unclosed epoch rows that are subsequently resigned clear automatically, returning the cluster to risk=none without operator intervention

VAL16 does NOT validate: real network partitions (requires iptables/root), genuine two-node concurrent-write split-brain, automatic rollback under detected split-brain, multi-region scenarios, or HA server binary restart-triggered epoch divergence. Scenario design, injection SQL, and safety rationale are documented in split-brain-chaos-validation.md.

VAL17 (run_quorum_loss_val17_lab) exercises the QuorumMonitor’s healthy lost healthy cycle with timed measurements and write-blocking assertions. Single PG with --min-sync-replicas 0 and --quorum-monitor-interval 500ms; loss induced by docker stop, recovery by docker start. VAL17 proves these workplan claims (Gap HA-004):

  • VAL17-C1 (loss detection timing): quorum_health=lost is detected within ≤ 30,000 ms (30 s, well under the workplan’s 60 s target) of docker stop with a 500 ms monitor interval

  • VAL17-C2 (write safety during loss): write_block_active=true and can_accept_protected_writes=false are confirmed in the quorum status JSON during the loss window, proving the WriteGate middleware is engaged

  • VAL17-C3 (recovery detection timing): quorum_health=healthy is detected within ≤ 30,000 ms after PostgreSQL is restored with docker start, confirming timely recovery detection once the dependency returns

  • VAL17-C4 (monitor history correctness): last_lost_at, last_restored_at, and detected_loss_count are correctly populated and increment across repeated cycles, confirming the QuorumMonitor’s state-change tracking is reliable

VAL17 does NOT validate: the healthy degraded path (covered by run_quorum_lab() with --min-sync-replicas 1), network-partition-induced quorum loss (iptables, requires root), HTTP write-gating via rollout endpoint (HA server binary does not expose /v1/rollouts), or multi-region topologies. Timing method, threshold rationale, and scenario design are documented in quorum-loss-validation.md.

VAL 18 — HA 30-Day Soak is the workplan Gate D long-duration HA framework and is not a slice of this runner. The 30-day lifecycle — with persistent Docker infrastructure, cron-scheduled round execution, PID tracking across invocations, node restart recovery, and progressive report generation — cannot be expressed as a run_cli_audit_lab.sh function. Same reasoning applied for VAL12 (fleet soak).

The soak is driven by three separate scripts:

  • scripts/labs/run_soak_val18_setup.sh — one-time environment provisioning; builds orchestrator_ha_server + autonomy binaries, starts a persistent Docker PostgreSQL instance (val18-pg-primary, host port 5488) for a stable 30-day connection address, creates the val18_probe table, starts HA node1 (19010) + node2 (19011) as background processes, writes config.env, and runs the first health round to verify VAL18-01 (framework provisioned) and VAL18-02 (initial round success). On a normal rerun it reuses the persistent Docker volume/container instead of deleting 30-day soak state

  • scripts/labs/run_soak_val18_round.sh — single HA health round called by cron every 2 hours; checks PG + HA node liveness (restarts dead processes), identifies the current leader from /v1/ha/status holder_id, checks quorum health + probe row, triggers a timed SIGTERM failover every SOAK_FAILOVER_INTERVAL_ROUNDS rounds (polls follower every 50 ms for holder_id change, measures failover_ms, restarts killed node as follower, verifies probe row on new leader), and writes round-summary.json to $SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/

  • scripts/labs/run_soak_val18_report.sh — daily summary and checkpoint/final report aggregator; reads all round-summary.json files, computes HA uptime%, failover count, p50/p95/p99 failover_ms, and data continuity rate, and checks all VAL18 thresholds with a Gate D HA assessment. The report uses separate failover-count thresholds for the 7-day checkpoint (>= 1) and the 30-day final Gate D check (>= 3)

The soak satisfies four claims over 30 days: ≥ 3 scheduled leader failovers sustained (VAL18-C1), failover_ms 10,000 on every failover (VAL18-C2), probe row accessible after every failover (data continuity rate = 1.0, VAL18-C3), and HA uptime ≥ 99.9% (VAL18-C4). The Gate D minimum-acceptable pass requires VAL18-09 (≥ 3 total failovers) + VAL18-10 (data continuity rate = 1.000) + VAL18-07 (max failover timing maintained ≤ 10,000 ms). Soak environment design, failover schedule strategy, observability plan, evidence retention, and the final report template are documented in ha-soak-validation.md.

VAL25 — Fleet Rollout Proof Report

VAL25 is a report generator (not a test runner) that reads the evidence produced by the five fleet rollout validation slices (VAL07–VAL11) and emits a consolidated proof report suitable for engineering leads and external reviewers.

It evaluates three readiness levels and clearly separates measured results from proposed targets:

Readiness level

Achievable with VAL07–VAL11?

Additional requirements

Design Partner

YES — if all five slices pass and key targets met

None beyond VAL07–VAL11

GA

NO

PostgreSQL backend + VAL12 30-day soak

Public Production

NO

Everything above GA + security audit + multi-region

Run after completing a full cli-audit-lab run:

bash scripts/labs/run_fleet_rollout_proof_report_val25.sh \
  evidence/cli-audit-lab-YYYY-MM-DD

Output is written to val25/ within the evidence directory. The formal plan, metric definitions, and readiness criteria are documented in fleet-rollout-proof-report-validation.md.

VAL26 — HA Proof Report

VAL26 is a report generator (not a test runner) that reads the evidence produced by the five HA validation slices (VAL13–VAL17) and emits a consolidated HA proof report.

It evaluates three readiness levels with measured results clearly separated from proposed targets and derived thresholds:

Readiness level

Achievable with VAL13–VAL17?

Additional requirements

HA Design Partner

YES — if all five slices pass and key targets met

None beyond VAL13–VAL17

HA GA

NO

VAL18 30-day soak + streaming replication promotion

HA Public Production

NO

Everything above GA + multi-AZ + security audit

Run after completing a full cli-audit-lab run:

bash scripts/labs/run_ha_proof_report_val26.sh \
  evidence/cli-audit-lab-YYYY-MM-DD

Output is written to val26/ within the evidence directory. The formal plan, metric definitions, and readiness criteria are documented in ha-proof-report-validation.md.

VAL28 — Cross-Cutting Proof Report

VAL28 is a report generator (not a test runner) that reads the evidence produced by six cross-cutting validation slices (VAL01–VAL06) and emits a consolidated report covering the Security, Observability, and Operations surfaces of the AutonomyOps ADK control plane.

It evaluates three readiness levels:

Readiness level

Achievable with VAL01–VAL06?

Additional requirements

Cross-Cutting Design Partner

YES — if all six slices pass and key targets met

None beyond VAL01–VAL06

Cross-Cutting GA

NO

PG-backed audit perf + production OTel collector + no SKIP checks

Public Production Claim

NO

Everything above GA + external security audit + compliance audit

Run after completing a full cli-audit-lab run:

bash scripts/labs/run_crosscut_proof_report_val28.sh \
  evidence/cli-audit-lab-YYYY-MM-DD

Output is written to val28/ within the evidence directory. VAL28 handles the mixed evidence formats: VAL01/VAL02 text reports are parsed via regex; VAL03–VAL06 JSON reports are schema-validated before extraction. The formal plan, metric definitions, and readiness criteria are documented in crosscut-proof-report-validation.md.

VAL29 — v1 Public-Claim Evidence Matrix

VAL29 is a meta-aggregator that reads the JSON artifacts from all four proof reports (VAL25/VAL26/VAL27/VAL28) and produces a single capability-level evidence matrix covering every v1 public claim.

The matrix assigns one of five evidence states to each claim:

State

Meaning

VALIDATED

Claim fully supported by completed VAL runs

BETA

Claim supported with disclosed limitations

NOT_STARTED

Soak framework exists; Gate D not yet run

DEFER

Not validated; additional work required

FUTURE_REQUIRED

Required for Public Production Claim only

Run after all four proof reports have been generated:

bash scripts/labs/run_evidence_matrix_val29.sh \
  evidence/cli-audit-lab-YYYY-MM-DD \
  evidence/

Output is written to val29/ within the cli-audit-lab evidence directory. The formal plan and full matrix definition are documented in evidence-matrix-validation.md.