CLI Audit And Support Bundle Lab¶
Audience: reviewers and operators who want reproducible local evidence that the PR-17 CLI and retained audit surfaces, the PR-18 support-bundle surface, the PR-19 audit query/export refinements, the PR-20 RBAC role model CLI work, the PR-21 RBAC enforcement plumbing work, the PR-22 metrics CLI surface work, the PR-24 HA quorum status surface work, the PR-25 split-brain detect workflow work, the PR-26 split-brain manual recovery workflow work, the PR-27 config migration tooling work, the PR-29-followup-a RBAC default-on hardening work, the PR-29-followup-c cert revocation transport enforcement work, the PR-29-followup-d database-backed audit persistence work, and the PR-29-followup-e cert-management RBAC coverage completion work against real command runs.
This lab captures actual CLI stdout plus the structured audit lines emitted on stderr by live commands in the current branch, and then queries and exports the retained records written by those same runs:
autonomy rollout plan createautonomy rollout plan publishautonomy rollout plan cancelautonomy ha backup createautonomy ha backup restoreautonomy ha failover triggerautonomy cert issueautonomy cert rotateautonomy rbac role createautonomy rbac role listautonomy rbac role assignedgectl relay deadletter retryedgectl relay deadletter purge --forceedgectl relay config set-bandwidthautonomy audit queryautonomy audit export --format json|csvautonomy support-bundle generateDefault-on RBAC bootstrap mode denial (no env var set, empty store)
RBAC bootstrap seed via
rbac role assign(bootstrap path)RBAC break-glass bypass (
AUTONOMY_RBAC_BREAK_GLASS=1)RBAC opt-out backward compat (
AUTONOMY_RBAC_ENFORCEMENT=0)Default-on RBAC-enforced
autonomy audit queryDefault-on RBAC-enforced
autonomy audit exportDefault-on RBAC-enforced
autonomy ha statusDirect
/v1/ha/statusrequests with and without operator identityautonomy metrics listautonomy metrics queryautonomy ha quorum statusautonomy ha split-brain detectautonomy ha split-brain recoverautonomy config migrateCert RBAC enforcement — denied and allowed flows for all six
autonomy certsubcommands underAUTONOMY_RBAC_ENFORCEMENT=1
1. Gaps Closed by This Lab¶
Before this lab refresh, the checked-in PR-17 evidence relied on in-package test harnesses. That left five evidence gaps:
rollout, HA, and backup audit captures were not tied to a live control-plane process
relay audit captures were not clearly tied to the existing daemon-backed edge lab
the HA helper lived outside the repo under
/tmp, which weakened reproducibilitythe retained query and export surface was covered by tests only, without checked-in live evidence from the same captured audit dataset
support-bundle generation had no checked-in live evidence against a real control-plane, retained audit store, and log file
RBAC role create/list/assign had no checked-in live evidence against the retained audit store and local file-backed assignment model
RBAC enforcement had no checked-in live evidence proving allowed and denied behavior across CLI and direct HA-status HTTP paths
metrics list/query had no checked-in live evidence against a real control-plane metrics endpoint
HA quorum status had no checked-in live evidence for healthy, degraded, lost, and restored transitions from the same live HA lab
split-brain detection had no checked-in live evidence for healthy, detected, and possible states or for deduplicated
ha.split_brain.detectedaudit emissionsplit-brain recovery had no checked-in live evidence for dry-run planning, executed recovery, post-recovery convergence, or retained
ha.split_brain.recoveredaudit recordsconfig migration tooling had no checked-in live evidence for deterministic dry-run output, format selection, file and in-place writes, or fail-closed handling of malformed and invalid legacy input
This lab closes those gaps by:
starting a real local SQLite-backed control-plane for rollout verification
running the real PostgreSQL primary + standby HA workflow for backup and failover verification
reusing the live daemon-backed edge relay lab for retry, purge, and bandwidth
building the HA helper from the repo at
scripts/labs/orchestrator_ha_server.goretaining audit records under the evidence directory and querying/exporting those same live records through
autonomy audit queryandautonomy audit exportgenerating a support bundle from the same live HA endpoint, retained audit store, and HA server log used by the rest of the lab
capturing live evidence for the PR-19 retained-audit refinements:
category, outcome, and source filtering
explicit invalid
--outputrejectioninvalid start/end time-range rejection
invalid export format rejection without truncating an existing file
capturing live evidence for PR-20 RBAC role create/list/assign, including:
canonicalized role-name persistence
idempotent repeat assignment behavior
retained
auth.role.assignedquery proof
capturing live evidence for PR-21 RBAC enforcement and PR-29-followup-a default-on hardening, including:
default-on bootstrap mode denial with no
AUTONOMY_RBAC_ENFORCEMENTsetbootstrap seed via
rbac role assignwith empty store (only allowed bootstrap action)post-bootstrap outsider denial (full enforcement is active)
post-bootstrap allowed access after correct role is granted
break-glass bypass (
AUTONOMY_RBAC_BREAK_GLASS=1) with mandatoryauth.break_glass.usedaudit eventbreak-glass safety: without
AUTONOMY_OPERATORthe bypass is still deniedopt-out backward compat (
AUTONOMY_RBAC_ENFORCEMENT=0)default-on denied
audit queryandaudit exportfor a non-auditor (no env var)default-on allowed
audit query,audit export, andha statusfor authorized rolesdenied direct
/v1/ha/statusaccess without operator identityretained
auth.access.deniedandauth.break_glass.usedaudit proof
capturing live evidence for PR-22 metrics visibility, including:
local
metrics listtext and JSON catalog outputlive
metrics queryoutput against a real/metricsendpointexact metric-family filtering for rollout counters
histogram-family filtering for request-duration buckets, count, and sum
capturing live evidence for PR-24 quorum visibility, including:
healthy quorum status on a live HA leader
degraded quorum when the sync standby is stopped
lost quorum when the primary PostgreSQL container is unavailable
restored quorum after PostgreSQL connectivity is recovered
retained
ha.quorum.lost/ha.quorum.restoredaudit records from the same live run
capturing live evidence for PR-25 split-brain visibility, including:
healthy split-brain status on the live quorum helper
detected split-brain after durable epoch divergence is introduced
possible split-brain after a ghost unclosed epoch row is injected
deduplicated retained
ha.split_brain.detectedaudit records after repeated identical detected-state polls
capturing live evidence for PR-26 split-brain manual recovery, including:
manual-reconcile dry-run output against a real detected stale-leader state
promote-leader execution against that same live stale-leader state
post-recovery
risk: noneverification from the same helperretained
ha.split_brain.recoveredaudit records for both dry-run and executed recovery
capturing live evidence for PR-27 config migration tooling, including:
deterministic dry-run output for a full v0 fixture
stdout migration output in both YAML and TOML
file and in-place writes for migrated configs
fail-closed rejection of unsupported schema versions
fail-closed rejection of malformed input and invalid migrated configs
capturing live evidence for PR-29-followup-d database-backed audit persistence (when AUTONOMY_AUDIT_PG_URL or POSTGRES_URL is set), including:
audit_eventstable write viaPGAuditEmitter(parallel to file emitter)autonomy audit query --pg-urlreading from PostgreSQL (primary query path)autonomy audit export --pg-urlexporting in JSON and CSV from the DBautonomy audit prune --older-than Ndretention enforcementpost-prune query confirming deleted rows are absent
file-backed query remaining as fallback when no
--pg-urlis set
capturing live evidence for PR-29-followup-e cert-management RBAC coverage, including:
all six
autonomy certsubcommands denied when operator lackscert:manage/cert:readcert listandcert check-revocationdenied (newly guarded; previously unguarded)cert issue,cert rotate,cert revoke,cert sync-crldenied (requirecert:manage)cert listandcert issueallowed after granting the appropriate cert roleretained
auth.access.deniedaudit records for all denied cert operations
2. Prerequisites¶
Docker available locally
Go
1.25.7PostgreSQL client tools available in the Docker
postgres:16imageopenssl,curl, andxxdavailable locallyability to bind localhost TCP ports and a Unix socket
3. Run the Lab¶
From the repository root:
export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local
bash scripts/labs/run_cli_audit_lab.sh
Optional custom evidence directory:
bash scripts/labs/run_cli_audit_lab.sh \
"$PWD/evidence/pr17-cli-audit-local-$(date +%F)"
4. What the Runner Does¶
The runner performs twenty-five real verification slices:
Starts
autonomy-orchestrator serveon127.0.0.1:18888and runs rollout create, publish, and cancel against the live HTTP API, while also exposing/metricson127.0.0.1:19090for the PR-22 metrics captures.Generates a local CA with
openssland runs the certificate workflow against real certificate files, including:autonomy cert issueautonomy cert rotateautonomy cert revokeautonomy cert check-revocationautonomy cert sync-crl --min-sources 2against a multi-publisher seta source control-plane with
--tls-crl-filea second source control-plane serving the same CRL
a follower control-plane with repeated
--tls-crl-sync-urland--tls-crl-sync-min-sources 2proving cross-host CRL pull plus publisher agreementVAL 01 — certificate rotation validation (Phase 8): issues a 2-day
node-ccert for expiry-window testing, proves the old cert is accepted over live mTLS before rotation, times the atomiccert rotateto confirm it completes within 300 seconds, confirms the expiry window clears after rotation, and then connects with the new cert without restarting the control-plane — proving continuity for fresh mTLS client connections across the rotation window; also captures acert.rotatedretained audit query and a composite 6-check PASS/FAIL reportVAL 02 — trust-chain rejection validation (Phase 9): generates a rogue CA and an expired cert inline, then proves the control-plane consistently rejects all five rejection cases against the same live control-plane: missing client cert, invalid chain (rogue CA with legitimate-looking CN), expired cert, revoked cert (reusing the CRL from Phase 1), and wrong server trust (client-side CA bundle mismatch); captures stderr evidence for each and a composite 5-check PASS/FAIL report
Brings up a disposable PostgreSQL primary + streaming standby, runs the HA helper from
scripts/labs/orchestrator_ha_server.go, and verifies:backup create
backup restore in maintenance mode
manual failover to a second helper
Runs
autonomy rbac role create,list, andassignagainst a local file-backed RBAC store and captures the emittedauth.role.assignedaudit line.Demonstrates default-on RBAC enforcement (no env var required) and verifies the bootstrap, break-glass, and opt-out paths:
default-on bootstrap mode denial against a fresh empty RBAC store
bootstrap seed via
rbac role assignwith no enforcement env varpost-bootstrap outsider denial (full enforcement)
post-bootstrap allowed access after correct role assignment
opt-out backward compat via
AUTONOMY_RBAC_ENFORCEMENT=0break-glass bypass with
AUTONOMY_RBAC_BREAK_GLASS=1+AUTONOMY_OPERATORbreak-glass safety check: no bypass without
AUTONOMY_OPERATORThen restarts the HA helper and verifies default-on enforcement with a populated store:denied direct
/v1/ha/statuswithout operator identityallowed direct
/v1/ha/statuswithX-Autonomy-Operatordenied
autonomy audit queryandautonomy audit exportfor an operator lackingaudit_history:read(default-on, no env var)allowed
autonomy audit query,autonomy audit export, andautonomy ha statusfor authorized identities
Runs the existing daemon-backed edge relay lab with
--with-bandwidthso retry, purge, and bandwidth configuration audit lines come from a liveedgedprocess.Queries and exports the retained audit records written by those live runs from
AUTONOMY_AUDIT_DIRunder the evidence bundle, including:full retained query
category-filtered auth query
category-filtered rollout query
source-filtered edge query
outcome-filtered success query
invalid
--outputand invalid time-range checksinvalid export-format preservation of an existing output file
Generates a live support bundle against:
--orchestrator-url http://127.0.0.1:18089the retained audit directory under the evidence bundle
the active HA server log
a redactable config file created by the runner
Starts a dedicated HA quorum helper on
127.0.0.1:18091with a fast quorum monitor interval and captures:healthy status while the primary and synchronous standby are available
degraded status after stopping the standby
lost status after stopping the primary container
restored status after the database containers are started again
retained
ha.quorum.lostandha.quorum.restoredaudit records
Reuses the live quorum helper on
127.0.0.1:18091and captures:healthy split-brain status with matching local and durable epochs
detected split-brain after advancing
leadership_state.current_epochaway from the helper’s cached epochmanual-reconciledry-run output for the detected stale-leader casepromote-leaderexecution output for the same detected stale-leader casepost-recovery
risk: noneverification after the stale elector clears its local leader claimpossible split-brain after inserting a second unclosed
leader_epochsrowdeduplicated retained
ha.split_brain.detectedaudit records after repeated identical detected-state pollsretained
ha.split_brain.recoveredaudit records from the same live run
Runs
autonomy config migrateagainst checked-in v0 fixtures and captures:dry-run output without writing files
migrated YAML and TOML stdout output
migrated YAML and TOML file output
in-place migration with before/after mode and checksum captures
unsupported-version rejection
malformed-input rejection
invalid-migrated-config rejection
Runs the database-backed audit persistence slice when
AUTONOMY_AUDIT_PG_URLorPOSTGRES_URLis set (skipped cleanly when absent), capturing:DB-backed
audit querytext output (primary query path viaaudit_events)DB-backed category-filtered query (cert category, JSON format)
DB-backed JSON export and CSV export from
audit_eventsaudit prune --older-than 90dwith no-op output (no qualifying rows)audit prune --older-than 1ddeleting the seeded 1h and 2h rowspost-prune query confirming only the most recent row remains
Runs the cert RBAC enforcement slice (
run_cert_rbac_lab), capturing:all six
autonomy certsubcommands denied for an operator withoutcert:manageorcert:read(confirmed by the presence of “cert:manage” in the error output)cert issueallowed forcert-adminafter granting a customcert-operatorrole holdingcert:managecert listandcert check-revocationallowed forcert-readerafter granting a customcert-readerrole holdingcert:readautonomy audit query --category authcapturing retainedauth.access.deniedrecords from the same denied-access runsfile-backed fallback query captured alongside DB query for comparison
Runs the RBAC permission enforcement validation slice (
run_rbac_val03_lab), proving the three VAL03 claims across a 14-check matrix:5 DENY checks: unassigned and under-privileged identities are blocked with RBAC error messages before any network call is made
5 ALLOW checks: identities whose role includes the required permission proceed past the guard and reach the control-plane successfully
3 NOT_GUARDED checks: commands without an RBAC guard (
rbac role list,rollout plan list,support-bundle generate) succeed or fail for non-RBAC reasons regardless of the operator’s assignment1 PRESENT check: the retained audit store contains
auth.access.deniedrecords from the denial runs Uses the HA server still running on127.0.0.1:18090(started by slice 5) for ALLOW-pathha statusprobes; if the helper is unavailable those three checks are recorded asSKIPinstead of being counted as PASS. Produces both a human-readableval03-report.txtand a machine-readableval03-report.json.
Runs the audit completeness validation slice (
run_audit_completeness_val04_lab), proving the four VAL04 claims across a 10-check matrix:VAL04-01: retained store is non-empty after all prior phases have run
VAL04-02: all 6 audit categories (rollout, ha, cert, relay, auth, rollback) return at least one record from the retained store
VAL04-03..08: every record returned in each category query contains all 6 mandatory schema fields (
event_name,category,action,outcome,source,timestamp)VAL04-09: a full retained-store query with
--limit 0 --output jsonsucceeds and completes within a 2000 ms latency bound (actual typical elapsed < 100 ms)VAL04-10: all 25 of the 25 defined wired event types are present in the retained store, proving complete wired-surface coverage for the current lab contract Produces both a human-readable
val04-report.txtand a machine-readableval04-report.json.
Runs the OTel integration validation slice (
run_otel_val05_lab), proving the four VAL05 claims across a 9-check matrix:VAL05-01..03 (Prometheus): the control-plane
/metricsendpoint returns HTTP 200, all expected metric families are present, andcp_http_requests_total,cp_http_request_duration_seconds_count,cp_rollout_plans_total, pluscp_events_ingested_totalshow non-zero observations after lab traffic and an explicitPOST /v1/eventsingestVAL05-04..06 (WAL pipeline): the telemetry WAL is non-empty after
telemetry_emit_helperruns,telemetry exportproduces non-empty JSONL, and the JSONL contains all mandatory event fieldsVAL05-07 (OTLP delivery):
telemetry flushto a localautonomy telemetry sinkexits 0 and the sink prints at least onereceived N log recordspayload lineVAL05-08..09 (correlation IDs): the known
trace_idandspan_idfrom the helper events appear in the JSONL export, andtraceId/spanIdappear in the OTLP sink output Uses an isolated temp WAL directory (does not touch the runtime WAL). Produces both a human-readableval05-report.txtand a machine-readableval05-report.json.
Runs the support-bundle validation slice (
run_support_bundle_val06_lab), proving the four VAL06 claims across a 10-check matrix:VAL06-01:
support-bundle generateexits 0 and writes a non-empty.tar.gzarchiveVAL06-02: generation completes within 30 seconds
VAL06-03: the three always-present core files (
manifest.json,system_info.json,build_info.json) appear in the archive listingVAL06-04:
manifest.jsonrecords all 6 collector names (system_info,build_info,config,ha_status,audit_recent,logs) regardless of their individual statusVAL06-05:
system_info.jsoncontains all 5 required fields (os,arch,go_version,hostname,collected_at)VAL06-06:
audit_recent.jsonis a non-empty JSON array with ≥ 1 record from the retained audit storeVAL06-07:
config_redacted.yamlcontainsfleet_salt: <REDACTED>and the original salt value is absentVAL06-08:
config_redacted.yamlcontainsREDACTEDin the postgres URL and the original password (val06-secret-pass) is absentVAL06-09: no PEM block (
-----BEGIN) appears in any archive entryVAL06-10: degraded mode — generating with
--orchestrator-urlpointing to a non-existent server exits 0;manifest.jsonrecordsha_status: "failed"(not a fatal error) Creates two bundles: a normal bundle for checks VAL06-01..09 and a degraded bundle for VAL06-10. Produces both a human-readableval06-report.txtand a machine-readableval06-report.json.
Runs the rollout latency baseline slice (
run_rollout_latency_val07_lab), proving the four VAL07 claims across a 9-check matrix. Starts a fresh dedicated control plane on127.0.0.1:18992with an isolated SQLite data directory that is removed and recreated on each run (no prior state) and measures latency usingcurl -w '%{time_total}'with Python percentile computation:VAL07-01: dedicated control plane starts and
GET /v1/healthreturns 200VAL07-02..04 (plan-create latency): 20 sequential
POST /v1/rolloutsrequests; p50 ≤ 100 ms, p95 ≤ 300 ms, p99 ≤ 500 ms — the 500 ms p99 bound is the primary workplan latency target for rollout plan creationVAL07-05 (plan-list latency): 20 sequential
GET /v1/rolloutsrequests with 20 plans in store; p99 ≤ 500 msVAL07-06..07 (concurrent): 5 parallel plan creates all return 2xx and total wall-clock time is ≤ 2000 ms, proving operator-facing responsiveness is not blocked by single-writer SQLite serialisation
VAL07-08: zero non-2xx responses across all 45 benchmark requests
VAL07-09:
cp_http_requests_totalPrometheus counter is non-zero after the benchmark run Produces both a human-readableval07-report.txtand a machine-readableval07-report.json(includesplan_create_msandplan_list_mslatency objects).
Runs the rollout throughput slice (
run_rollout_throughput_val08_lab), proving the four VAL08 claims across a 10-check matrix. Starts a fresh dedicated control plane on127.0.0.1:18993with an isolated SQLite data directory that is removed and recreated on each run, then runs four concurrency tiers (N=1, 10, 50, 100 workers) each creating 5 plans sequentially using background bash subshells:VAL08-01: dedicated control plane starts and
GET /v1/healthreturns 200VAL08-02..05 (concurrency tiers): at N=1, 10, 50, and 100 concurrent workers all plans are accepted with zero errors; VAL08-05 (N=100, 500 total plans) is the primary workplan target (
≥100 concurrent device rollouts)VAL08-06 (wall clock): the N=100 scenario completes within 30 s
VAL08-07 (throughput scaling): throughput at N=100 is ≥ throughput at N=1, confirming concurrency does not regress the write path
VAL08-08: aggregate error count across all four scenarios is zero
VAL08-09:
GET /v1/rolloutsafter all scenarios returns ≥ (grand_total − errors) plans, confirming durable storageVAL08-10:
cp_http_requests_totalPrometheus counter is non-zero Produces both a human-readableval08-report.txtand a machine-readableval08-report.json(includesthroughputobject with plans/sec at each tier).
Runs the stuck rollout detection slice (
run_stuck_detection_val09_lab), proving the four VAL09 claims across a 10-check matrix. Starts a fresh dedicated control plane on127.0.0.1:18994with an isolated SQLite data directory that is removed and recreated on each run. Staleness injection usesthreshold_seconds=3plus a 4-second sleep — no SQLite manipulation required:VAL09-01: dedicated control plane starts and
GET /v1/healthreturns 200VAL09-02 (empty baseline): scan on empty store returns
stuck_count=0VAL09-03 (fresh plans): 5 plans created; immediate scan returns
stuck_count=0(plans younger than threshold)VAL09-04 (stale detection): after sleeping 4 s, rescan returns
stuck_count=5— all 5 plans detected as stuckVAL09-05 (diagnosis): all 5 stuck plans carry the exact expected
diagnosisstring ("zero activations — nodes may not be receiving the plan or artifact distribution is incomplete")VAL09-06 (paused excluded): pausing plan-b removes it from the stuck scan (
pausedis not an active phase)VAL09-07 (terminal excluded): cancelling plan-c removes it from the stuck scan (terminal phase)
VAL09-08 (retry recovery):
POST recover strategy=retryon plan-d returnsnew_phase=active;updated_atis refreshedVAL09-09 (rollback recovery):
POST recover strategy=rollbackon plan-e returnsnew_phase=rolled_backVAL09-10 (post-recovery clean): final scan confirms plan-a still stuck, plan-b/c/d/e absent (paused/terminal/retry-refreshed) Produces both a human-readable
val09-report.txtand a machine-readableval09-report.json.
Runs the rollback reliability slice (
run_rollback_reliability_val10_lab), proving the four VAL10 claims across a 10-check matrix. Starts a fresh dedicated control plane on127.0.0.1:18995. Runsrollback previewfor all four target kinds (read-only, no CP needed) then exercises the CLI execute path with 5 + 5 = 10 plan creates and dispatches:VAL10-01: all 4 preview commands exit 0
VAL10-02: rollout_plan JSON preview has
safety_class=terminal,orchestrated=true, valid strategiesretryandrollbackVAL10-03: relay_deadletter JSON preview has
orchestrated=falseandedgectlinstructions inmanual_pathVAL10-04: 5 ×
rollback execute strategy=retryon real plans — all exit 0 (success_rate=1.000)VAL10-05: 5 ×
rollback execute strategy=rollbackon real plans — all exit 0 (success_rate=1.000)VAL10-06:
--output jsonexecute hasoutcome,new_state,kindfieldsVAL10-07: execute on nonexistent plan exits non-zero
VAL10-08: execute on
relay_deadletter(not orchestrated) exits non-zero and printsedgectlinstructionsVAL10-09:
audit query --event-type rollback.preview.requestedreturns ≥ 4 actor-scoped events from this slice of the retained storeVAL10-10: aggregate success rate across 10 executes is ≥ 0.990 (workplan target: ≥99%); at least 10 actor-scoped
rollback.executedsuccess events from this slice are retained Produces both a human-readableval10-report.txtand a machine-readableval10-report.json(includessuccess_rateobject with per-strategy and aggregate rates).
Runs the chaos validation slice (
run_chaos_val11_lab), proving the four VAL11 claims across a 10-check matrix. Starts a fresh dedicated control plane on127.0.0.1:18996with an isolated SQLite data directory that is reused across kill/restart cycles within the function (durability requires the same data dir). Chaos injection is via SIGTERM — no root/iptables required:VAL11-01: dedicated control plane starts and
GET /v1/healthreturns 200VAL11-02 (CP kill → client error): after SIGTERM,
autonomy rollback executeexits non-zero with a connection-refused / dial error, confirming the CLI does not silently swallow CP unavailabilityVAL11-03 (data durability): after CP restart against the same SQLite data dir,
GET /v1/rolloutsreturns ≥ 10 plans, confirming the full pre-kill corpus survivedVAL11-04 (rapid restart resilience): 3 additional kill+restart cycles; final plan count ≥ pre-rapid count; new plan create returns 201 — confirms repeated restarts do not corrupt the store or break the write path
VAL11-05 (gate-wait survival): plan
val11-gate-1(created inpublishedphase before the kill) retainsphase=publishedafter restart, confirming the gate-wait state is not lost across the CP kill boundaryVAL11-06 (device-unresponsive proxy): 3 proxy plans created, sleep 3s (threshold=2s); stuck scan returns
stuck_count ≥ 3with non-emptydiagnosisstrings, proving the operator can detect unresponsive devicesVAL11-07 (artifact corruption proxy): plan with invalid
artifact_refandtarget_lock_fingerprintis accepted with HTTP 201 and is queryable by the operator viaGET /v1/rollouts/val11-corrupt-1VAL11-08 (corrupt plan rollback):
rollback execute strategy=rollbackon the corrupt-artifact plan exits 0, proving the operator can roll back a plan with suspicious artifact metadataVAL11-09 (bulk cascade recovery): 3 ×
rollback execute strategy=retryon the device-unresponsive proxy plans — all exit 0 (cascade_ok=3)VAL11-10 (audit integrity):
audit query --event-type rollback.executedscoped to the chaos actor and slice start-time returns ≥ 1 success events, confirming audit capture is not disrupted by CP kill/restart cycles Produces both a human-readableval11-report.txtand a machine-readableval11-report.json.
Runs the HA failover validation slice (
run_ha_failover_val13_lab), proving the HA leader-election subsystem under unplanned failure conditions. Twoorchestrator_ha_servernodes (ports 18997/18998) share a dedicated Docker PostgreSQL instance. The 10-check matrix covers:VAL13-01 (baseline): node-1 acquires leadership first; node-2 is a follower
VAL13-02 (pre-kill write): SQL probe row inserted while node-1 is leader
VAL13-03 (SIGTERM failover): failover time measured from kill to node-2 “acquired leadership” log entry; threshold ≤ 5000 ms
VAL13-04 (zero data loss): probe row
note=pre-killreadable from the shared PG instance after failover, proving shared-database durabilityVAL13-05 (post-failover leader active): node-2 ha status confirms it holds leadership after node-1 exits
VAL13-06 (PG crash detected):
docker stopon PG primary → node-2/v1/ha/quorumtransitions toquorum_health=lostVAL13-07 (PG crash recovery): manual
docker start→quorum_health=healthyrestored; probe row still intactVAL13-08 (rapid kill cycles): 3 × alternating SIGTERM failover cycles, each measured ≤ 5000 ms, followed by leader-status and probe-row checks
VAL13-09 (SIGKILL disk-fault proxy): SIGKILL on leader (no graceful Resign, simulates OOM/disk crash) → node-2 takes over ≤ 5000 ms
VAL13-10 (post-chaos stability): single write-ready leader confirmed after all failure scenarios
Produces
val13-report.txtandval13-report.json.Runs the HA replication lag baseline slice (
run_ha_replication_lag_val14_lab), benchmarking PostgreSQL streaming replication lag under the autonomyops HA architecture. Aval14-pg-primary+val14-pg-standbyDocker pair is provisioned viapg_basebackup; a single HA server at port 18999 uses--min-sync-replicas 1so quorum health tracks standby availability. The 10-check matrix covers:VAL14-01 (replication streaming):
pg_stat_replication.state = streamingconfirmed after standby provisioningVAL14-02 (idle LSN gap zero):
(pg_current_wal_lsn() - write_lsn)::bigint = 0at rest — no unreplicated WALVAL14-03 (light load drain): 100 rows × 500 bytes inserted with
synchronous_commit=off; lag is sampled during the active drain window and WAL LSN gap drains to 0 within 2000 msVAL14-04 (heavy load drain): 500 rows × 2000 bytes (~1 MB WAL) with
synchronous_commit=off; LSN gap drains within 5000 msVAL14-05 (post-drain gap closed): LSN gap confirmed 0 after the heavy drain sequence completes
VAL14-06 (HA replication endpoint):
/v1/health/replicationresponds 200 with a valid JSON bodyVAL14-07 (standby stop degraded):
docker stop val14-pg-standby→quorum_health=degradedwithin 30 sVAL14-08 (standby start healthy):
docker start val14-pg-standby→quorum_health=healthyrestoredVAL14-09 (catch-up after restart): after generating WAL while the standby is offline, the backlog drains to 0 within 10 s of standby restart, proving the replication slot catches up correctly
VAL14-10 (threshold report): practical alerting thresholds derived from observed p95 lag using the formula
healthy = max(p95×3+1, 10),degraded = max(healthy×10, 100),alert = max(healthy×50, 500)— all three thresholds must be positive
Produces
val14-report.txtandval14-report.json.Runs the backup/restore validation slice (
run_backup_restore_val15_lab), proving theha backup create/list/restoreworkflow end-to-end with integrity verification, timing bounds, and error-path safety checks. A dedicated Docker PostgreSQL instance (val15-pg-primary) is provisioned with two fixture tables (val15_small: 100 rows x 200 bytes;val15_medium: 1,000 rows x 1,000 bytes). The HA server at port 19001 is restarted between phases to transition between normal mode (backup) and maintenance mode (restore). The 10-check matrix covers:VAL15-01 (backup created):
ha backup createexits 0;.dumpfile exists withsize > 0VAL15-02 (backup file valid):
pg_restore -lon the.dumpfile exits 0 and the TOC containsTABLE DATAentriesVAL15-03 (backup metadata correct):
ha backup list --output jsonshowsbackup_idwithstatus=completedVAL15-04 (checksum verified): SHA-256 of the
.dumpfile computed independently matches thechecksum=<hex>field in the CLI outputVAL15-05 (backup timing bound):
ha backup createwall time ≤ 30,000 msVAL15-06 (restore correct): after mutating the tables post-backup (
UPDATEfirst 50 rows;DELETEhalf the medium table),ha backup restore --confirmreverts the data; post-restore counts aresmall=100,medium=1000with correct spot-check payload valuesVAL15-07 (restore timing bound):
ha backup restorewall time ≤ 60,000 msVAL15-08 (multi-backup inventory): a second
ha backup createfollowed byha backup listreturnscount ≥ 2with both backup IDs presentVAL15-09 (restore requires confirm):
ha backup restorewithout--confirmexits non-zero and mentions--confirmin the error output (safety gate enforcement)VAL15-10 (audit events captured):
audit query --event-type ha.backup.createdand--event-type ha.backup.restoredeach return ≥ 1 event from the shared audit store
Produces
val15-report.txtandval15-report.json.Runs the split-brain chaos validation slice (
run_split_brain_chaos_val16_lab), proving the split-brain detection and recovery subsystem under SQL-injected fault conditions. A dedicated Docker PostgreSQL instance (val16-pg-primary) is provisioned with a single HA server at port 19002 (--min-sync-replicas 0,--campaign 500ms). Injection is performed directly viapsqlagainstleadership_stateandleader_epochs— no iptables or second HA node required. The 10-check matrix covers:VAL16-01 (baseline risk none):
/v1/ha/split-brainreturnsrisk=nonebefore any injectionVAL16-02 (epoch inject detected): after
UPDATE leadership_state SET current_epoch = current_epoch + 99, holder_id = 'val16-injected-node', the API returnsrisk=detectedVAL16-03 (detection idempotent): a second API call returns the same
risk=detectedwithout side effectsVAL16-04 (dry-run reconcile ok):
ha split-brain recover --strategy manual-reconcileexits 0 and a follow-up/v1/ha/split-brainread still reportsrisk=detected(planning path, no DB writes)VAL16-05 (promote leader recovers):
ha split-brain recover --strategy promote-leaderexits 0 and the API subsequently returnsrisk=noneVAL16-06 (data integrity after recovery): a user-table probe row (
val16_probe.note = 'pre-inject') is untouched after promote-leader recovery, confirming recovery is scoped to metadata tables onlyVAL16-07 (ghost node possible): inserting two
leader_epochsrows withresigned_at IS NULLand foreignholder_idvalues raisesrisk=possibleVAL16-08 (ghost node self-clears): stamping
resigned_aton the injected rows causes the API to returnrisk=nonewithout any HA server restartVAL16-09 (audit events captured):
audit query --event-type ha.split_brain.detected --start-time <slice_start>and--event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start>each return ≥ 1 event from this slice in the shared audit storeVAL16-10 (post chaos stability):
/v1/ha/statusconfirmsholder_idcontainscp-val16-nodeafter all chaos scenarios complete
Produces
val16-report.txtandval16-report.json.Runs the quorum loss validation slice (
run_quorum_loss_val17_lab), proving quorum loss detection timing, write-blocking safety behavior, and recovery detection against the workplan’s ≤ 60 s target (Gap HA-004). A dedicated Docker PostgreSQL instance (val17-pg-primary) is provisioned with a single HA server at port 19003 (--min-sync-replicas 0,--quorum-monitor-interval 500ms). Quorum loss is induced bydocker stop val17-pg-primary; recovery bydocker start. Timing is measured in milliseconds usingpython3 time.time()*1000wrapped around eachdocker stop/startcall. The 10-check matrix covers:VAL17-01 (baseline quorum healthy):
/v1/ha/quorumreturnsquorum_health=healthyandwrite_block_active=falsebefore any faultVAL17-02 (loss detected):
quorum_health=lostafterdocker stopVAL17-03 (loss detection timing bound):
loss_ms ≤ 30,000(30 s threshold for workplan ≤ 60 s target with--quorum-monitor-interval 500ms)VAL17-04 (write blocked during loss):
write_block_active=trueandcan_accept_protected_writes=falsein the quorum status JSON during lossVAL17-05 (loss reason populated):
quorum_loss_reasonnon-empty during lossVAL17-06 (recovery detected):
quorum_health=healthyafterdocker startVAL17-07 (recovery timing bound):
recovery_ms ≤ 30,000VAL17-08 (write unblocked after recovery):
write_block_active=false;can_accept_protected_writes=true;last_lost_atandlast_restored_attimestamps populated;detected_loss_count ≥ 1VAL17-09 (second cycle count increments): after a second loss/recovery completes,
detected_loss_count ≥ 2VAL17-10 (audit events captured):
audit query --event-type ha.quorum.lost --start-time <val17_start_time>and--event-type ha.quorum.restored --start-time <val17_start_time>each return ≥ 1 event
Produces
val17-report.txtandval17-report.json.
5. Generated Artifacts¶
The evidence bundle is split by surface:
autonomy/for rollout and cert capturesmetrics/for the live PR-22 metrics list/query capturesha/for live HA backup and failover capturesquorum/for live HA quorum status and transition capturessplit-brain/for live split-brain detection capturesrbac/for local RBAC store and assignment capturesrelay/for the live edge relay capturesretained/for the file-backed audit store plus query/export capturessupport-bundle/for the generated archive plus extracted proof filesconfig-migrate/for the live PR-27 config migration capturescert_rbac/for the PR-29-followup-e cert RBAC enforcement capturesval03/for the VAL03 RBAC permission enforcement validation capturesval04/for the VAL04 audit completeness validation capturesval05/for the VAL05 OTel integration validation capturesval06/for the VAL06 support-bundle validation capturesval07/for the VAL07 rollout latency baseline capturesval08/for the VAL08 rollout throughput validation capturesval09/for the VAL09 stuck rollout detection validation capturesval10/for the VAL10 rollback reliability validation capturesval11/for the VAL11 chaos validation captures
Representative files:
autonomy/rollout-plan-create.txtautonomy/audit-rollout-plan-create.logautonomy/cert-rotate.txtautonomy/audit-cert-rotate.logautonomy/cert-rotation-list-expiring.txt— VAL01-1: 2-day cert appears in expiry windowautonomy/cert-rotation-prerotate-health.json— VAL01-2: old cert accepted over live mTLSautonomy/cert-rotation-before-dates.txt— openssl serial/dates before rotation (baseline)autonomy/cert-rotation-timing.txt— VAL01-3: elapsed seconds +pass=truevs 300s boundautonomy/cert-rotation-rotate.txt—cert rotatestdoutautonomy/cert-rotation-audit-rotate.log— slogcert.rotatedevent from rotationautonomy/cert-rotation-after-dates.txt— openssl serial/dates after rotationautonomy/cert-rotation-list-after.txt— VAL01-4: no certificates matched in expiry windowautonomy/cert-rotation-postrotate-health.json— VAL01-5: rotated client cert accepted without restartautonomy/cert-rotation-audit-events.json— VAL01-6: retainedcert.rotatedqueryautonomy/cert-rotation-val01-report.txt— composite 6/6 PASS report + serial assertionautonomy/cert-rejection-missing-client-cert.stderr— VAL02-1: TLS cert required errorautonomy/cert-rejection-invalid-chain.stderr— VAL02-2: chain verification failure (rogue CA)autonomy/cert-rejection-expired-cert.stderr— VAL02-3: cert expired rejectionautonomy/cert-rejection-revoked.stderr— VAL02-4: CRL rejection (node-a, same gate as Phase 5)autonomy/cert-rejection-wrong-server-trust.stderr— VAL02-5: client-side verify failureautonomy/cert-rejection-val02-report.txt— composite 5/5 PASS report + right_ca_wrong_cn notemetrics/orchestrator-metrics-raw.txtmetrics/metrics-list.txtmetrics/metrics-list.jsonmetrics/metrics-query-all.txtmetrics/metrics-query-rollout-plans.jsonmetrics/metrics-query-http-duration.txtrbac/rbac-role-create.txtrbac/rbac-role-list-before-assign.txtrbac/rbac-role-assign.txtrbac/audit-rbac-role-assign.logrbac/rbac-role-assign-repeat.txtrbac/rbac-role-list-after-assign.txtrbac/rbac-role-list.jsonrbac/assignments.jsonrbac/rbac-default-on-denied.stderr— bootstrap mode denial with no env varrbac/rbac-bootstrap-seed.txt— bootstrap role assign outputrbac/audit-rbac-bootstrap-seed.log— auth.bootstrap.access audit eventrbac/rbac-post-bootstrap-insufficient.stderr— outsider denied after bootstraprbac/rbac-post-bootstrap-allowed.json— allowed after bootstrap + role grantrbac/rbac-enforcement-disabled.json— opt-out AUTONOMY_RBAC_ENFORCEMENT=0rbac/rbac-break-glass-allowed.json— break-glass bypassed actionrbac/audit-rbac-break-glass.log— auth.break_glass.used audit eventrbac/audit-break-glass-events.json— retained break-glass eventsrbac/rbac-break-glass-no-operator.stderr— break-glass safety: denied without operatorrbac/rbac-role-assign-operator.txtrbac/rbac-ha-status-denied.stderr— default-on denial (no env var)rbac/rbac-ha-status-allowed.txtrbac/rbac-audit-query-denied.stderr— default-on denial (no env var)rbac/rbac-audit-query-allowed.jsonrbac/rbac-audit-export-denied.stderr— default-on denial (no env var)rbac/rbac-audit-export-allowed.jsonrbac/retained-auth-access-denied.jsonha/ha-status-no-header.headersha/ha-status-no-header.jsonha/ha-status-with-header.headersha/ha-status-with-header.jsonha/ha-backup-create.txtha/audit-ha-backup-create.logha/ha-backup-restore.txtha/audit-ha-backup-restore.logha/ha-failover-trigger.txtha/audit-ha-failover-trigger.logquorum/ha-quorum-healthy.txtquorum/ha-quorum-degraded.txtquorum/ha-quorum-lost.jsonquorum/ha-quorum-restored.jsonquorum/audit-ha-quorum-lost.jsonquorum/audit-ha-quorum-restored.jsonsplit-brain/ha-split-brain-healthy.txtsplit-brain/ha-split-brain-detected.jsonsplit-brain/ha-split-brain-detected-repeat.jsonsplit-brain/ha-split-brain-recover-dry-run.txtsplit-brain/ha-split-brain-recover-execute.txtsplit-brain/ha-split-brain-recovered.txtsplit-brain/ha-split-brain-recovered.jsonsplit-brain/ha-status-after-recovery.jsonsplit-brain/ha-split-brain-possible.txtsplit-brain/audit-ha-split-brain-detected-after-dedupe.jsonsplit-brain/audit-ha-split-brain-detected-all.jsonsplit-brain/audit-ha-split-brain-recovered-after-execute.jsonsplit-brain/audit-ha-split-brain-recovered-all.jsonsplit-brain/ha-split-brain-summary.jsonrelay/relay-deadletter-retry.txtrelay/audit-relay-deadletter-retry.logrelay/relay-bandwidth-set.txtrelay/audit-relay-bandwidth-set.logretained/audit-query-all.txtretained/audit-query-category-auth.jsonretained/audit-query-category-rollout.txtretained/audit-query-ha-backup-created.jsonretained/audit-query-source-edge.jsonretained/audit-query-outcome-success.txtretained/audit-query-invalid-output.stderrretained/audit-query-invalid-range.stderrretained/audit-export-all.jsonretained/audit-export-all.csvretained/audit-export-invalid-format.stderrretained/audit-export-invalid-format-target-before.sha256retained/audit-export-invalid-format-target-after.sha256retained/audit-export-invalid-format-target.txtretained/retained-file-list.txtsupport-bundle/autonomy-support-bundle-live.tar.gzsupport-bundle/support-bundle-generate.logsupport-bundle/support-bundle-contents.txtsupport-bundle/manifest.jsonsupport-bundle/config_redacted.yamlsupport-bundle/ha_status.jsonsupport-bundle/audit_recent.jsonsupport-bundle/logs-autonomy.logsupport-bundle/support-bundle-summary.txtconfig-migrate/config-migrate-dry-run.txtconfig-migrate/config-migrate-stdout.yamlconfig-migrate/config-migrate-stdout.tomlconfig-migrate/config-migrated.yamlconfig-migrate/config-migrated.tomlconfig-migrate/config-migrated-in-place.yamlconfig-migrate/config-migrate-in-place-before.statconfig-migrate/config-migrate-in-place-after.statconfig-migrate/config-migrate-unsupported.stderrconfig-migrate/config-migrate-invalid-input.stderrconfig-migrate/config-migrate-invalid-v0.stderrdb_audit/schema-dry-run.txt(DB-backed; skipped when no PG URL)db_audit/query-file-all.txt— file-backed baseline query (fallback path)db_audit/query-db-all.txt— DB-backed query text (primary path)db_audit/query-db-cert.json— DB-backed query filtered to cert categorydb_audit/export-db-all.json— DB-backed JSON exportdb_audit/export-db-all.csv— DB-backed CSV exportdb_audit/prune-90d.txt— prune with 90d cutoff (no-op output)db_audit/prune-1d.txt— prune with 1d cutoff (deletes seeded old rows)db_audit/query-db-after-prune.json— post-prune query confirming remaining rowscert_rbac/denied-issue.txt— cert issue denied (cert:manage in error)cert_rbac/denied-rotate.txt— cert rotate deniedcert_rbac/denied-revoke.txt— cert revoke deniedcert_rbac/denied-list.txt— cert list denied (newly guarded; cert:manage in error)cert_rbac/denied-check-revocation.txt— cert check-revocation denied (newly guarded)cert_rbac/denied-sync-crl.txt— cert sync-crl deniedcert_rbac/allowed-issue.txt— successful cert issue undercert:managecert_rbac/allowed-list.txt— successful cert list undercert:readcert_rbac/allowed-check-revocation.txt— successful read-only non-revoked check undercert:readcert_rbac/audit-denied-events.json— retained auth.access.denied records for cert operationsval03/setup-seed-auditor.txt— VAL03 bootstrap auditor assignmentval03/setup-assign-operator.txt— VAL03 operator assignmentval03/setup-assign-analyst.txt— VAL03 analyst assignmentval03/setup-server-assign-auditor.txt— mirrored server-side auditor assignmentval03/setup-server-assign-operator.txt— mirrored server-side operator assignmentval03/setup-server-assign-analyst.txt— mirrored server-side analyst assignmentval03/val03-01-ha-status-deny.stderr— unassigned deniedha status(rbac:pattern)val03/val03-02-ha-status-operator-allow.txt— operator allowedha statusval03/val03-03-ha-status-analyst-allow.txt— analyst allowedha statusval03/val03-04-ha-status-auditor-allow.txt— auditor allowedha statusval03/val03-05-audit-query-operator-deny.stderr— operator deniedaudit queryval03/val03-06-audit-query-analyst-deny.stderr— analyst deniedaudit queryval03/val03-07-audit-query-auditor-allow.json— auditor allowedaudit queryval03/val03-08-rbac-role-list-unassigned.txt— unassignedrbac role list(not guarded)val03/val03-09-rollout-plan-list-unassigned.stderr— unassignedrollout plan list(connection error, not RBAC)val03/val03-10-rbac-role-create-operator-deny.stderr— operator deniedrbac role createval03/val03-11-rbac-role-create-analyst-deny.stderr— analyst deniedrbac role createval03/val03-12-rbac-role-create-auditor-allow.txt— auditor allowedrbac role createval03/val03-13-support-bundle-unassigned.stderr— unassignedsupport-bundle generateprogress withbundle written:(top-level command not guarded)val03/val03-support-bundle.tar.gz— bundle generated by the unguarded support-bundle checkval03/val03-14-access-denied-events.json— retained denial tuples from this VAL03 sliceval03/val03-report.txt— composite 14-check PASS/FAIL reportval03/val03-report.json— machine-readable JSON reportval04/val04-store-inventory.txt— retained store JSONL file countval04/val04-category-rollout.json— all rollout-category records from retained storeval04/val04-category-ha.json— all ha-category recordsval04/val04-category-cert.json— all cert-category recordsval04/val04-category-relay.json— all relay-category recordsval04/val04-category-auth.json— all auth-category recordsval04/val04-category-rollback.json— all rollback-category recordsval04/val04-category-summary.txt—category=<cat> count=<N>lines for all 6 categoriesval04/val04-schema-check.txt— per-category schema field check results (PASS/FAIL + MISSING lines)val04/val04-latency.txt—query_ok,query_elapsed_ms,bound_ms=2000,pass=true/falseval04/val04-coverage-all.json— full retained store JSON (all records,--limit 0)val04/val04-coverage-report.txt— PRESENT/ABSENT per event type, finalfound/expected/threshold/passval04/val04-report.txt— composite 10-check PASS/FAIL reportval04/val04-report.json— machine-readable JSON report with latency and coverage countsval05/val05-prometheus-status.txt— HTTP status code for/metricsendpointval05/val05-prometheus-raw.txt— full Prometheus text exposition from orchestratorval05/val05-prometheus-families.txt— PRESENT/ABSENT per required metric familyval05/val05-events-ingest.json— explicitPOST /v1/eventsresponse used to exercisecp_events_ingested_totalval05/val05-prometheus-observations.txt— sample lines for 4 non-zero observation checksval05/val05-emit-helper.txt—telemetry_emit_helperoutput (emitted 3 events to …)val05/val05-wal-status.json—telemetry status --jsonoutput{total, exported, pending}val05/val05-wal-inventory.txt— WAL dir, file count, total events, pass flagval05/val05-export.jsonl— JSONL output fromtelemetry export(3 events)val05/val05-export-fields.txt— PRESENT/ABSENT per mandatory JSONL event fieldval05/val05-sink-output.txt— OTLP payloads received byautonomy telemetry sinkval05/val05-flush-stdout.txt—telemetry flush: OK — N events sent to …val05/val05-flush-summary.txt—flush_ok,sink_lines,sink_payloads,passflagval05/val05-traceid-jsonl.txt—trace_id_found=<value>,span_id_found=<value>, or ABSENTval05/val05-traceid-otlp.txt—traceId_found=true/false,spanId_found=true/falseval05/val05-report.txt— composite 9-check PASS/FAIL reportval05/val05-report.json— machine-readable JSON reportval06/val06-bundle.tar.gz— normal support bundle archive under inspectionval06/val06-generate-stdout.txt— normal stdout capture (typically empty; progress is emitted on stderr)val06/val06-generate.log— stderr includinggenerating support bundle, per-collector progress, andbundle written:confirmationval06/val06-timing.txt—generate_ok,elapsed_s,bound_s=30,pass_timingval06/val06-contents.txt— archive file listing fromtar -tzfval06/val06-manifest.json— extractedmanifest.jsonwith all 6 collector statusesval06/val06-system-info.json— extractedsystem_info.jsonval06/val06-config-redacted.yaml— extractedconfig_redacted.yamlfor redaction inspectionval06/val06-audit-recent.json— extractedaudit_recent.json(50 most recent records)val06/val06-core-files.txt—PRESENT/ABSENTper required core fileval06/val06-manifest-check.txt—collector=<name> status=<status>for each of the 6 collectorsval06/val06-sysinfo-check.txt—PRESENT/ABSENTper requiredsystem_infofieldval06/val06-audit-check.txt—audit_recent_count,passval06/val06-redaction-salt.txt—fleet_salt_placeholder,fleet_salt_actual_absent,passval06/val06-redaction-pg.txt—pg_redacted_present,pg_secret_absent,passval06/val06-privkey-check.txt—privkey_hits=0,passval06/val06-bundle-degraded.tar.gz— degraded bundle (ha_status failed)val06/val06-degraded-manifest.json— extracted manifest from degraded bundleval06/val06-degraded-check.txt—bundle_exit_ok=true,ha_status_status=failed,passval06/val06-report.txt— composite 10-check PASS/FAIL reportval06/val06-report.json— machine-readable JSON report with elapsed_s and per-check statusesval07/val07-cp.log— dedicated VAL07 control-plane startup and per-request logsval07/val07-health.txt—health_code=200,passval07/val07-create-raw.txt— 20 lines:<http_code> <time_total_s>from sequential createsval07/val07-create-percentiles.txt—expected_n=20,n=20,sample_complete=true,p50_ms,p95_ms,p99_ms,min_ms,max_msval07/val07-list-raw.txt— 20 lines:<http_code> <time_total_s>from sequential listsval07/val07-list-percentiles.txt—expected_n=20,n=20,sample_complete=true,p50_ms,p95_ms,p99_ms,min_ms,max_msval07/val07-concurrent-raw.txt— 5 lines from concurrent createsval07/val07-concurrent-summary.txt—concurrent_n=5,conc_ok,conc_errors,wall_ms,bound_ms=2000,wall_passval07/val07-error-summary.txt—total_requests=45,error_count,passval07/val07-metrics-raw.txt— Prometheus text exposition from dedicated VAL07 control planeval07/val07-prometheus-check.txt—cp_http_requests_totalcount,passval07/val07-report.txt— composite 9-check PASS/FAIL report with p50/p95/p99 valuesval07/val07-report.json— machine-readable JSON report withplan_create_msandplan_list_mslatency objectsval08/val08-cp.log— dedicated VAL08 control-plane startup and per-request logsval08/val08-health.txt—status=ok,pass=trueval08/scenario-n1/scenario-report.txt—n_workers=1,total_plans=5,ok=5,errors=0,throughput_plans_per_secval08/scenario-n10/scenario-report.txt—n_workers=10,total_plans=50,ok=50,errors=0,throughput_plans_per_secval08/scenario-n50/scenario-report.txt—n_workers=50,total_plans=250,ok=250,errors=0,throughput_plans_per_secval08/scenario-n100/scenario-report.txt—n_workers=100,total_plans=500,ok=500,errors=0,throughput_plans_per_secval08/scenario-n{1,10,50,100}/worker-*.txt— per-worker error counts (one file per worker; all should contain0)val08/val08-wall-clock-n100.txt—elapsed_ms,bound_ms=30000,passval08/val08-throughput-scaling.txt—tput_n1,tput_n10,tput_n50,tput_n100,scaling_passval08/val08-error-aggregate.txt—total_errors=0,pass=trueval08/val08-list-consistency.txt—grand_total_created=805,expected_min,list_count,passval08/val08-metrics-raw.txt— Prometheus text exposition from dedicated VAL08 control planeval08/val08-prometheus-check.txt—cp_http_requests_totalcount,passval08/val08-report.txt— composite 10-check PASS/FAIL report with throughput-by-tier tableval08/val08-report.json— machine-readable JSON report withthroughputobject and per-check statusesval09/val09-cp.log— dedicated VAL09 control-plane startup and per-request logsval09/val09-health.txt—status=ok,pass=trueval09/val09-plans-created.txt— 5 plan IDs with HTTP 201 create codesval09/val09-scan-empty.json— stuck scan on empty store (stuck_count=0)val09/val09-baseline-check.txt—stuck_count=0,expected=0,pass=trueval09/val09-scan-fresh.json— stuck scan immediately after creation (stuck_count=0)val09/val09-fresh-check.txt—stuck_count=0,expected=0,pass=trueval09/val09-scan-stale.json— stuck scan after sleep; fullstuck_plansarray withdiagnosisfieldsval09/val09-stale-check.txt—stuck_count=5,expected=5,pass=trueval09/val09-diagnosis-check.txt—total=5,diagnosis_populated=5,pass=trueval09/val09-pause-planb.json— pause response for plan-b (phase=paused)val09/val09-scan-after-pause.json— stuck scan after pause (plan-b absent)val09/val09-pause-check.txt—paused_plan=val09-plan-b,in_stuck_scan=no,pass=trueval09/val09-cancel-planc.json— cancel response for plan-cval09/val09-scan-after-cancel.json— stuck scan after cancel (plan-c absent)val09/val09-cancel-check.txt—cancelled_plan=val09-plan-c,in_stuck_scan=no,pass=trueval09/val09-recover-retry-pland.json— retry recovery response (new_phase=active)val09/val09-retry-check.txt—new_phase=active,pass=trueval09/val09-recover-rollback-plane.json— rollback recovery response (new_phase=rolled_back)val09/val09-rollback-check.txt—new_phase=rolled_back,pass=trueval09/val09-scan-final.json— final stuck scan (plan-a present; b/c/d/e absent)val09/val09-final-check.txt— per-plan presence flags +pass=trueval09/val09-report.txt— composite 10-check PASS/FAIL reportval09/val09-report.json— machine-readable JSON report with scan counts and per-check statusesval10/val10-cp.log— dedicated VAL10 control-plane startup and per-request logsval10/val10-health.txt—status=ok,pass=trueval10/val10-preview-rollout_plan.txt— text safety profile for rollout_plan targetval10/val10-preview-rollout_stage.txt— text safety profile for rollout_stage targetval10/val10-preview-ha_leader_resign.txt— text safety profile for ha_leader_resign targetval10/val10-preview-relay_deadletter.txt— text safety profile for relay_deadletter targetval10/val10-preview-check.txt—preview_errors=0,pass=trueval10/val10-preview-rollout_plan.json— JSON preview (safety_class=terminal,orchestrated=true)val10/val10-preview-rollout_plan-check.txt— field checks,pass=trueval10/val10-preview-relay_deadletter.json— JSON preview (orchestrated=false)val10/val10-preview-relay-check.txt—orchestrated=false,manual_path_has_edgectl=true,pass=trueval10/val10-retry-plans-created.txt— 5 ×plan_id=val10-retry-N code=201val10/retry/execute-retry-{1..5}.txt— per-plan retry execute output (outcome=success previous=published new=active)val10/val10-retry-rate.txt—strategy=retry ok=5 fail=0 total=5 success_rate=1.000 pass=trueval10/val10-rollback-plans-created.txt— 5 ×plan_id=val10-rollback-N code=201val10/rollback/execute-rollback-{1..5}.txt— per-plan rollback execute output (outcome=success previous=published new=rolled_back)val10/val10-rollback-rate.txt—strategy=rollback ok=5 fail=0 total=5 success_rate=1.000 pass=trueval10/val10-execute-json.json— JSON output from retry execute (Outcome,NewState,Kindfields)val10/val10-execute-json-check.txt— field presence checks,pass=trueval10/val10-execute-nonexistent.txt— error output for nonexistent planval10/val10-nonexistent-check.txt—exit_code=1,pass=trueval10/val10-execute-relay-not-orchestrated.txt—edgectlinstructions in error outputval10/val10-relay-not-orchestrated-check.txt—exit_code!=0,has_edgectl_instructions=1,pass=trueval10/val10-audit-preview-events.json—rollback.preview.requestedevents from audit storeval10/val10-audit-preview-check.txt—rollback.preview.requested_count ≥ 4with actor/start-time scope,pass=trueval10/val10-audit-execute-events.json—rollback.executedevents from audit storeval10/val10-aggregate-rate.txt—agg_ok=10 agg_total=10 agg_success_rate=1.000 pass=trueval10/val10-report.txt— composite 10-check PASS/FAIL report with success rate tableval10/val10-report.json— machine-readable JSON withsuccess_rateobject and per-check statusesval11/val11-cp.log— chaos CP stdout/stderr across all start/stop cycles (append mode)val11/val11-health.txt—status=ok,pass=trueval11/val11-plans-created.txt— 9 plan create HTTP codes (dur-1..5, gate-1, corrupt-1, dev-1..3)val11/val11-scan-stuck.json— stuck scan result (threshold=2s, aftersleep=3s)val11/val11-stuck-check.txt—stuck_count≥3,diagnosis_ok=true,pass=trueval11/val11-corrupt-plan-create.txt— HTTP 201 for corrupt-artifact planval11/val11-corrupt-plan.json— GET /v1/rollouts/val11-corrupt-1 responseval11/val11-corrupt-check.txt—create_code=201,get_ok=truewithplan.metadata.id=val11-corrupt-1,pass=trueval11/val11-execute-rollback-corrupt.txt— rollback execute output for val11-corrupt-1val11/val11-rollback-corrupt-check.txt—exit_ok=true,pass=trueval11/cascade/execute-retry-dev-{1..3}.txt— cascade retry output per planval11/val11-cascade-check.txt—cascade_ok=3,cascade_fail=0,pass=trueval11/val11-kill-client-error.txt— CLI output after CP killval11/val11-kill-check.txt—exit_nonzero=true,has_connection_error=true,pass=trueval11/val11-durability-list.json— GET /v1/rollouts after first restartval11/val11-durability-check.txt—list_count≥10,pass=trueval11/val11-gate-plan.json— GET /v1/rollouts/val11-gate-1 after restartval11/val11-gate-check.txt—phase=published,pass=trueval11/val11-rapid-restart.txt— 3× kill/restart cycle log + final list count + new plan codeval11/val11-audit-executed.json—rollback.executedevents from retained storeval11/val11-audit-check.txt—rollback_executed_success_count≥1with actor/start-time scope,pass=trueval11/val11-report.txt— composite 10-check PASS/FAIL chaos reportval11/val11-report.json— machine-readable JSON chaos reportval13/for the VAL13 HA failover validation capturesval13/probe-table-create.txt— output ofCREATE TABLE val13_probeval13/val13-node1-s0.log— node-1 HA server initial session logval13/val13-node2-s0.log— node-2 HA server initial session logval13/val13-node1-s1.log— node-1 HA server restart-1 logval13/val13-node1-s2.log— node-1 HA server restart-2 logval13/val13-node2-s1.log— node-2 HA server restart-1 logval13/val13-node2-s2.log— node-2 HA server restart-2 logval13/val13-01-node1-status.txt—ha statusfor node-1 at baselineval13/val13-01-node2-status.txt—ha statusfor node-2 at baseline (follower)val13/val13-02-pre-kill-write.txt— SQL INSERT output before killval13/val13-03-failover-timing.txt—failover_ms=<N> signal=TERMval13/val13-03-node2-status-after.txt—ha statusimmediately after node-2 takes overval13/val13-04-data-probe.txt—SELECT noteresult (should containpre-kill)val13/val13-05-write-ready.txt—ha statusconfirming node-2 is active leaderval13/val13-06-quorum-lost.json—/v1/ha/quorumJSON showingquorum_health=lostval13/val13-07-quorum-healthy.json—/v1/ha/quorumJSON showingquorum_health=healthyval13/val13-07-data-after-pg-restart.txt— probe row intact after PG stop/startval13/val13-08-cycle1.txt— rapid kill cycle 1 timingval13/val13-08-cycle2.txt— rapid kill cycle 2 timingval13/val13-08-cycle3.txt— rapid kill cycle 3 timingval13/val13-08-rapid-summary.txt—cycle1_ms=N cycle2_ms=N cycle3_ms=Nval13/val13-08-post-cycle-status.txt—ha statusafter the rapid cyclesval13/val13-08-data-after-rapid.txt— probe row still intact after the rapid cyclesval13/val13-09-sigkill-timing.txt—failover_ms=<N> signal=KILLval13/val13-10-final-status.txt—ha statusfor surviving leader after all testsval13/val13-report.txt— composite 10-check PASS/FAIL HA failover reportval13/val13-report.json— machine-readable JSON HA failover reportval14/for the VAL14 HA replication lag baseline capturesval14/val14-pg-setup.txt— role create, pg_hba update, table create, synchronous configval14/val14-pg-basebackup.txt—pg_basebackupprogress + replication slot creationval14/val14-ha-server.log— HA server log (leader election + quorum monitor)val14/val14-01-replication.txt—pg_stat_replicationoutput (state=streaming)val14/val14-02-idle-lsn-gap.txt— LSN gap at rest (should be0)val14/val14-02-idle-lag-samples.txt— 10 idlewrite_lag_mssamplesval14/val14-light-write.txt— INSERT output for 100-row light writeval14/val14-03-light-lag-samples.txt— 5 lag samples captured before the light-load drain fully completesval14/val14-03-light-result.txt—light_drain_ms=<N> threshold=2000val14/val14-heavy-write.txt— INSERT output for 500-row heavy writeval14/val14-04-heavy-lag-samples.txt— 5 lag samples during/after heavy writeval14/val14-04-heavy-result.txt—heavy_drain_ms=<N> threshold=5000val14/val14-05-post-drain-samples.txt— 5 lag samples after heavy drainval14/val14-05-drain-result.txt—post_drain_gap=<N>val14/val14-06-ha-endpoint.json—/v1/health/replicationJSON responseval14/val14-07-quorum-degraded.json—/v1/ha/quorumJSON showingquorum_health=degradedval14/val14-08-quorum-healthy.json—/v1/ha/quorumJSON showingquorum_health=healthyval14/val14-09-offline-write.txt— INSERT output for the backlog generated while standby is offlineval14/val14-09-post-restart-replication.txt—pg_stat_replicationafter standby restartval14/val14-09-catchup-result.txt—catchup_drain_ms=<N> ok=true/falseval14/val14-report.txt— composite 10-check report with lag measurements and derived thresholdsval14/val14-report.json— machine-readable JSON replication lag baseline reportval15/for the VAL15 backup/restore validation capturesval15/val15-pg-setup.txt— Docker container IP, postgres URL, table create and fixture load outputval15/val15-ha-server.log— HA server log (normal mode, backup phase)val15/val15-ha-server-maint.log— HA server log (maintenance mode, restore phase)val15/val15-ha-server-post.log— HA server log (normal mode, multi-backup and error-path phase)val15/val15-01-backup-create.txt—ha backup createstdout+stderr (backup_id, checksum, size_bytes)val15/val15-01-backup-file-stat.txt—statof the.dumpfileval15/val15-02-backup-toc.txt—pg_restore -ltable-of-contents of the backup archiveval15/val15-03-backup-list.txt—ha backup list --output jsonafter first backupval15/val15-03-metadata-check.txt— assertion result: backup_id found with status=completedval15/val15-04-checksum-verify.txt—cli_checksum=<hex>+file_checksum=<hex>comparisonval15/val15-05-backup-timing.txt—backup_ms=<N>val15/val15-05-db-before.txt— row counts before backup (small=100 medium=1000)val15/val15-06-post-backup-mutation.txt— SQL UPDATE/DELETE output confirming post-backup mutationval15/val15-06-restore.txt—ha backup restorestdout+stderrval15/val15-06-db-after-restore.txt— row counts after restoreval15/val15-06-data-check.txt— SQL spot-check query output (100|1000|t|t)val15/val15-06-integrity-result.txt— assertion result:restore_correct=true small=100 medium=1000val15/val15-07-restore-timing.txt—restore_ms=<N>val15/val15-08-backup-create-2.txt— secondha backup createoutputval15/val15-08-backup-list-multi.txt—ha backup list --output jsonshowing both backup IDsval15/val15-08-inventory-check.txt— assertion result:multi_backup_count=2val15/val15-09-restore-no-confirm.txt—ha backup restorewithout--confirm(expected error mentioning--confirm)val15/val15-10-audit-backup-created.json—audit query --event-type ha.backup.createdresultval15/val15-10-audit-backup-restored.json—audit query --event-type ha.backup.restoredresultval15/val15-10-audit-check.txt— assertion result: event counts for both audit event typesval15/backups/backup-val15-a.dump— pg_dump custom-format archive (first backup)val15/backups/backup-val15-b.dump— pg_dump custom-format archive (second backup)val15/val15-report.txt— composite 10-check PASS/FAIL report with timing valuesval15/val15-report.json— machine-readable JSON report withbackup_ms,restore_ms,pass_countval16/for the VAL16 split-brain chaos validation capturesval16/val16-ha-server.log— HA server log (single normal session throughout)val16/val16-probe-setup.txt— Docker psql output: CREATE TABLE + INSERT for probe rowval16/val16-01-baseline.json—/v1/ha/split-brainJSON at baseline (risk=none)val16/val16-01-baseline.txt—ha split-brain detectCLI output at baselineval16/val16-02-epoch-inject.txt— SQL UPDATE output (epoch divergence injection)val16/val16-02-detected.json—/v1/ha/split-brainJSON after injection (risk=detected)val16/val16-02-detected.txt—ha split-brain detectCLI output after injectionval16/val16-03-detect-repeat.json— second/v1/ha/split-braincall (idempotency check)val16/val16-04-recover-dry-run.txt—ha split-brain recover --strategy manual-reconcilestdout+stderrval16/val16-04-risk-after-dry-run.json—/v1/ha/split-brainJSON after dry-run (risk unchanged)val16/val16-04-risk-check.txt— assertion result:risk_after_dry_run=detectedval16/val16-05-recover-execute.txt—ha split-brain recover --strategy promote-leaderstdout+stderrval16/val16-05-recovered.json—/v1/ha/split-brainJSON after promote-leader (risk=none)val16/val16-05-recovered.txt—ha split-brain detectCLI output after recoveryval16/val16-06-probe-after-recovery.txt— SQL SELECT:notefromval16_probe WHERE id=1val16/val16-07-ghost-inject.txt— SQL INSERT output (ghost-node epoch rows)val16/val16-07-possible.json—/v1/ha/split-brainJSON after ghost injection (risk=possible)val16/val16-07-possible.txt—ha split-brain detectCLI output forrisk=possibleval16/val16-08-ghost-clear.txt— SQL UPDATE output (stampresigned_aton ghost rows)val16/val16-08-cleared.json—/v1/ha/split-brainJSON after clearing (risk=none)val16/val16-09-audit-detected.json—audit query --event-type ha.split_brain.detected --start-time <slice_start>JSON resultval16/val16-09-audit-recovered.json—audit query --event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start>JSON resultval16/val16-09-audit-check.txt— assertion result: slice-scoped event counts for both audit event typesval16/val16-10-final-status.json—/v1/ha/statusJSON confirmingcp-val16-nodeholds leadershipval16/val16-10-stability-check.txt— assertion result:holder_idcheckval16/val16-report.txt— composite 10-check PASS/FAIL reportval16/val16-report.json— machine-readable JSON report withpass_countand scenario resultsval17/for the VAL17 quorum loss validation capturesval17/val17-pg-setup.txt— Docker container IP and PG URL used by HA serverval17/val17-ha-server.log— HA server log (single session throughout all phases)val17/val17-01-baseline.json—/v1/ha/quorumJSON at baseline (quorum_health=healthy)val17/val17-01-baseline.txt—ha quorum statusCLI output at baselineval17/val17-01-baseline-check.txt— assertion:quorum_health=healthy write_block_active=Falseval17/val17-02-quorum-lost.json—/v1/ha/quorumJSON after PG stop (quorum_health=lost)val17/val17-02-quorum-lost.txt—ha quorum statusCLI output during lossval17/val17-03-loss-timing.txt—loss_ms=<N>val17/val17-04-write-block-check.txt— assertion:write_block_active=True can_accept_protected_writes=Falseval17/val17-05-loss-reason.txt— assertion:quorum_loss_reason=<message>val17/val17-06-quorum-recovered.json—/v1/ha/quorumJSON after PG restart (quorum_health=healthy)val17/val17-06-quorum-recovered.txt—ha quorum statusCLI output after recoveryval17/val17-07-recovery-timing.txt—recovery_ms=<N>val17/val17-08-recovery-check.txt— assertion: write_block_active, can_accept_protected_writes, timestamps, detected_loss_countval17/val17-09-second-loss.json—/v1/ha/quorumJSON during second lossval17/val17-09-second-recovery.json—/v1/ha/quorumJSON after second recoveryval17/val17-09-count-check.txt— assertion:detected_loss_count=2 (second cycle confirmed)val17/val17-10-audit-lost.json—audit query --event-type ha.quorum.lostJSON resultval17/val17-10-audit-restored.json—audit query --event-type ha.quorum.restoredJSON resultval17/val17-10-audit-check.txt— assertion:lost_events=N restored_events=Mval17/val17-report.txt— composite 10-check PASS/FAIL report withloss_msandrecovery_msval17/val17-report.json— machine-readable JSON withloss_ms,recovery_ms,pass_count
6. Expected Results¶
Every audit log in the bundle should include the canonical fields:
audit.eventaudit.categoryaudit.actionaudit.outcomeaudit.resourceaudit.resource_typeaudit.source=cli
Expected event mapping:
audit-rollout-plan-create.log→rollout.plan.createdaudit-rollout-plan-publish.log→rollout.plan.publishedaudit-rollout-plan-cancel.log→rollout.plan.cancelledaudit-ha-backup-create.log→ha.backup.createdaudit-ha-backup-restore.log→ha.backup.restoredaudit-ha-failover-trigger.log→ha.failover.triggeredandha.failover.completedaudit-cert-issue.log→cert.issuedaudit-cert-rotate.log→cert.rotatedrbac/audit-rbac-role-assign.log→auth.role.assignedrelay/audit-relay-deadletter-retry.log→relay.deadletter.retriedrelay/audit-relay-deadletter-purge.log→relay.deadletter.purgedrelay/audit-relay-bandwidth-set.log→relay.bandwidth.configured
The paired stdout files should show that each command completed successfully in the same live run.
The metrics surface should also show:
metrics/orchestrator-metrics-raw.txtcontains the real Prometheus exposition emitted by the live control-plane started by the labmetrics/metrics-list.txtandmetrics/metrics-list.jsoncontain the same metric catalog in text and JSON formmetrics/metrics-query-all.txtcontains live samples such ascp_http_requests_total,cp_health_checks_total, andcp_rollout_plans_totalmetrics/metrics-query-rollout-plans.jsoncontains the filtered rollout phase counters generated by the rollout actions in the same runmetrics/metrics-query-http-duration.txtcontains the histogram family, including_bucket,_count, and_sum
The RBAC surface should also show:
rbac/rbac-role-create.txtcreates a custom lowercase role derived from the canonicalized input namerbac/rbac-role-list-before-assign.txtshows the custom role with0assignmentsrbac/rbac-role-assign.txtassigns the role to the trimmed subject identityrbac/rbac-role-assign-repeat.txtshows the idempotent repeat-assignment no-oprbac/rbac-role-list-after-assign.txtshows the role with1assignmentrbac/assignments.jsonpersists the normalized role and subjectretained/audit-query-category-auth.jsonreturns the retainedauth.role.assignedrecord from the same live runrbac/rbac-role-assign-operator.txtassigns the predefinedoperatorrole used by the PR-21 allow-path checksrbac/rbac-audit-query-denied.stderrandrbac/rbac-audit-export-denied.stderrshow fail-closed denial for an operator who lacksaudit_history:readrbac/rbac-audit-query-allowed.jsonandrbac/rbac-audit-export-allowed.jsonshow the authorized retained auth view forreviewer@example.comrbac/rbac-ha-status-denied.stderrshows fail-closed denial for an unassigned operator onha statusrbac/rbac-ha-status-allowed.txtshows successful HA status output forfleet-op@example.comha/ha-status-no-header.headersandha/ha-status-no-header.jsonshow the server-side 403 denial when/v1/ha/statusis called without operator identity under enforcementha/ha-status-with-header.headersandha/ha-status-with-header.jsonshow the server-side 200 success path when the authorized operator header is setrbac/retained-auth-access-denied.jsonreturns the retainedauth.access.deniedrecords from the same live enforcement run
The retained surface should also show:
retained/retained-file-list.txtincludes the daily JSONL file underretained/store/retained/audit-query-all.txtreturns a mixed set of rollout, HA, cert, and relay records from the live runretained/audit-query-ha-backup-created.jsonreturns the filteredha.backup.createdrecordsretained/audit-query-category-rollout.txtreturns only rollout-category records from the same live retained datasetretained/audit-query-source-edge.jsonreturns onlysource=edgerelay records from the same live retained datasetretained/audit-query-outcome-success.txtreturns the success-only retained datasetretained/audit-query-invalid-output.stderrshows unknown--outputvalues fail closed with the supported value listretained/audit-query-invalid-range.stderrshows malformed time ranges fail closed instead of returning a silent empty resultretained/audit-export-all.jsonandretained/audit-export-all.csvcontain the same retained dataset in export formretained/audit-export-invalid-format.stderrshows unsupported export formats fail before writingretained/audit-export-invalid-format-target-before.sha256andretained/audit-export-invalid-format-target-after.sha256match, proving the invalid-format failure did not truncate the existing target file
The support-bundle surface should also show:
support-bundle/support-bundle-generate.logrecords the live collection runsupport-bundle/support-bundle-contents.txtlists the expected archive memberssupport-bundle/manifest.jsonmarkssystem_info,build_info,config,ha_status,audit_recent, andlogsasoksupport-bundle/config_redacted.yamlcontains<REDACTED>forfleet_saltandREDACTEDin thepostgres_urlpassword positionsupport-bundle/ha_status.jsoncontains the live HA snapshot from the running helpersupport-bundle/audit_recent.jsoncontains retained records from the same lab runsupport-bundle/logs-autonomy.logcontains the tailed HA server logsupport-bundle/support-bundle-sha256.txtrecords the resulting archive hash
The database-backed audit surface (when AUTONOMY_AUDIT_PG_URL is set) should also show:
db_audit/query-db-all.txtcontains the three seeded rows (rollout, cert, ha) ordered newest-first withevent_name,actor,resource,outcome, andsourcedb_audit/query-db-cert.jsoncontains exactly one record withcategory=certdb_audit/export-db-all.jsonanddb_audit/export-db-all.csvcontain the same seeded dataset asquery-db-all.txtin export formatdb_audit/prune-90d.txtshowsdeleted=0(no rows older than 90 days in the lab)db_audit/prune-1d.txtshowsdeleted=2(the 1h and 2h rows are removed)db_audit/query-db-after-prune.jsoncontains exactly one row (the most recent)db_audit/query-file-all.txtcontains records from the file store in parallel, proving the file emitter remained active alongside the DB emitter
The VAL 01 zero-downtime rotation surface should also show:
autonomy/cert-rotation-list-expiring.txtcontains eitherexpiringornode-c, confirming the 2-day cert falls inside the--expiring-within-days 5windowautonomy/cert-rotation-prerotate-health.jsoncontains"status":"ok", confirming the old cert was accepted over live mTLS before rotationautonomy/cert-rotation-timing.txtcontainspass=trueandrotation_elapsed_seconds=<N>where N is well below the 300-second bound, proving the rotation operation itself is effectively instantaneousautonomy/cert-rotation-rotate.txtcontainsrotated identity=node-c.edge.local cert=... valid_days=90, confirming the operation succeeded with the default 90-day renewalautonomy/cert-rotation-list-after.txtcontainsno certificates matched, confirming the 90-day replacement cert is not in the 5-day expiry windowautonomy/cert-rotation-postrotate-health.jsoncontains"status":"ok", proving the rotated client cert was accepted without restarting the control-planeautonomy/cert-rotation-audit-events.jsoncontains a record withcert.rotated, confirming the event was retained in the audit storeautonomy/cert-rotation-before-dates.txtandautonomy/cert-rotation-after-dates.txthave differentserial=values, confirming a new keypair was issuedautonomy/cert-rotation-val01-report.txtreports6/6 checks PASSwithserials_differ=true
The VAL 02 trust-chain rejection surface should also show:
autonomy/cert-rejection-missing-client-cert.stderris non-empty and the paired.stdoutis empty, and stderr matches the missing-client-cert handshake pattern, confirming the control-plane requires a client certificateautonomy/cert-rejection-invalid-chain.stderrmatches the invalid-chain pattern, confirming a cert signed by a rogue CA is rejected even when the CN matches a legitimate node identityautonomy/cert-rejection-expired-cert.stderrmatches the expired-cert pattern, confirming expired certificates are rejected (validity period enforcement)autonomy/cert-rejection-revoked.stderrmatches the revoked-cert pattern, and the existingautonomy/cert-revocation-rejected-events.jsonretained audit evidence still proves theVerifyPeerCertificatecallback pathautonomy/cert-rejection-wrong-server-trust.stderrmatches the server-trust-verify pattern, confirming mTLS is bidirectional: the client cannot connect when it cannot verify the server’s cert chainautonomy/cert-rejection-val02-report.txtreports5/5 checks PASSand includes theright_ca_wrong_cnnote confirming that a cert from the trusted CA with an unexpected CN is accepted at the TLS layer (identity-layer authorization is RBAC-based, not CN-based)
The cert RBAC surface should also show:
cert_rbac/denied-list.txtcontains the string “cert:manage”, confirming thatcert list(previously unguarded) now requires RBAC authorizationcert_rbac/denied-check-revocation.txtcontains the string “cert:manage”, confirming thatcert check-revocation(previously unguarded) now requires RBAC authorizationcert_rbac/denied-issue.txt,denied-rotate.txt,denied-revoke.txt, anddenied-sync-crl.txteach contain “cert:manage”, confirming consistent coveragecert_rbac/allowed-issue.txtcontains the successfulissued identity=...line and does NOT contain an RBAC denial, confirming mutation success undercert:managecert_rbac/allowed-list.txtcontains the listednode-a.edge.localrow and does NOT contain an RBAC denial, confirming read-only success undercert:readcert_rbac/allowed-check-revocation.txtcontainsnot_revoked, confirming read-only revocation inspection succeeds undercert:readcert_rbac/audit-denied-events.jsoncontainsauth.access.deniedrecords withpermissionfields referencingcert:manageandcert:read | cert:manage, confirming denial is audited before the error is returned
The VAL03 RBAC enforcement surface should also show:
val03/val03-01-ha-status-deny.stderrcontainsrbac:and thefleet:readpermission name, confirming the guard fires before any HTTP call when the operator has no assignmentval03/val03-05-audit-query-operator-deny.stderrandval03/val03-06-audit-query-analyst-deny.stderreach containrbac:withaudit_history:read, confirming the operator and analyst roles both lack the audit permissionval03/val03-10-rbac-role-create-operator-deny.stderrandval03/val03-11-rbac-role-create-analyst-deny.stderreach containrbac:withrbac:manage, confirming neither operator nor analyst can create rolesval03/val03-02-ha-status-operator-allow.txt,val03-03-ha-status-analyst-allow.txt, andval03-04-ha-status-auditor-allow.txteach contain HA status JSON (orSKIPif the HA server was unavailable), confirming all three predefined roles includefleet:readand that the mirrored server-side RBAC store authorizes the same identitiesval03/val03-07-audit-query-auditor-allow.jsoncontains auth-category audit records, confirmingaudit_history:readin the auditor role allows the queryval03/val03-12-rbac-role-create-auditor-allow.txtcontainscreated role "val03-test-role", confirmingrbac:managein the auditor role allows custom role creationval03/val03-08-rbac-role-list-unassigned.txtdoes NOT containrbac: operatorand lists known roles such asoperatorandauditor, confirmingrbac role listhas no RBAC guardval03/val03-09-rollout-plan-list-unassigned.stderrcontains a connection error, NOTrbac:, confirmingrollout plan listhas no RBAC guardval03/val03-13-support-bundle-unassigned.stderrcontains normal bundle generation progress plusbundle written:, andval03/val03-support-bundle.tar.gzis non-empty, confirmingsupport-bundle generatehas no top-level RBAC guard even if optional nested collectors emit RBAC warnings for guarded HA sub-requestsval03/val03-14-access-denied-events.jsoncontainsauth.access.deniedrecords for the five expected VAL03 deny tuples, confirming every denial from the VAL03 DENY checks was written to the retained audit store before the error was returnedval03/val03-report.txtreports the finalpass=<N> skip=<M> fail=<K> total=14summary with zero failuresval03/val03-report.jsoncontainspass_count,skip_count, and per-checkstatusvalues so HA unavailability is recorded asSKIPrather thanPASS
The VAL04 audit completeness surface should also show:
val04/val04-store-inventory.txtreportsstore_jsonl_files> 0, confirming the retained store is non-empty after all prior lab phasesval04/val04-category-summary.txtreportscount> 0 for all 6 categories (rollout, ha, cert, relay, auth, rollback), confirming every category is populatedval04/val04-schema-check.txtreportsPASSfor all 6 category schema checks with noMISSING fieldlines, confirming every returned record carries the mandatory audit fieldsval04/val04-latency.txtreportsquery_ok=trueandpass=truewithquery_elapsed_ms≤ 2000, confirming the full retained store query both succeeded and stayed within the latency boundval04/val04-coverage-report.txtreportsPRESENTfor all 25 of the 25 wired event types, withpass=trueon the final summary line; anyABSENTline is now a real validation failure because the runner is expected to exercise the full wired event surface deterministicallyval04/val04-report.txtreportspass=10 fail=0 total=10summaryval04/val04-report.jsoncontainspass_count=10,coverage_found=25,latency_ms≤ 2000, and per-checkstatusvalues
The VAL05 OTel integration surface should also show:
val05/val05-prometheus-status.txtreportshttp_code=200, confirming the control-plane Prometheus endpoint is reachableval05/val05-prometheus-families.txtreportsPRESENTfor all 4 required metric families (cp_http_requests_total,cp_http_request_duration_seconds,cp_rollout_plans_total,cp_events_ingested_total)val05/val05-events-ingest.jsonshows a successfulPOST /v1/eventsresponse, confirming that VAL05 itself exercised the event-ingestion pathval05/val05-prometheus-observations.txtreports non-zero sample values forcp_http_requests_total,cp_http_request_duration_seconds_count,cp_rollout_plans_total, andcp_events_ingested_total, confirming that lab traffic plus the explicit ingest produced real observationsval05/val05-emit-helper.txtcontainsemitted 3 events to, confirming thetelemetry.Emitter→ WAL write path succeededval05/val05-wal-status.jsoncontains"total":3and"pending":3, confirming events are persisted and not yet flushedval05/val05-export.jsonlcontains exactly 3 lines with"kind","ts","seq","written_at", and"attrs"fields present in each lineval05/val05-flush-stdout.txtcontainstelemetry flush: OK — 3 events sent to http://127.0.0.1:14318, confirming end-to-end OTLP deliveryval05/val05-flush-summary.txtreportssink_payloads > 0, confirming the sink printed at least one actual OTLP payload receipt line rather than only its startup bannerval05/val05-traceid-jsonl.txtreportstrace_id_found=4bf92f3577b34da6a3ce929d0e0e4736andspan_id_found=00f067aa0ba902b7, confirming trace/span propagation through WAL → JSONLval05/val05-traceid-otlp.txtreportstraceId_found=trueandspanId_found=true, confirming trace/span propagation in the OTLP/HTTP pathval05/val05-report.txtreportspass=9 fail=0 total=9summaryval05/val05-report.jsoncontainspass_count=9and per-checkstatusvalues
The VAL06 support-bundle surface should also show:
val06/val06-timing.txtreportsgenerate_ok=trueandelapsed_s≤ 30, confirming the bundle was created within the time boundval06/val06-core-files.txtreportsPRESENTfor all three core files (manifest.json,system_info.json,build_info.json)val06/val06-manifest-check.txtreports astatusline for each of the 6 collectors (system_info,build_info,config,ha_status,audit_recent,logs), confirming all are recorded in the manifestval06/val06-sysinfo-check.txtreportsPRESENTfor all 5 required fields (os,arch,go_version,hostname,collected_at)val06/val06-audit-check.txtreportsaudit_recent_count> 0, confirming the bundle captured records from the retained audit storeval06/val06-redaction-salt.txtreportsfleet_salt_placeholder=trueandfleet_salt_actual_absent=true, confirming the known test salt was replaced with<REDACTED>and the original value does not appearval06/val06-redaction-pg.txtreportspg_redacted_present=trueandpg_secret_absent=true, confirming the postgres password was replaced withREDACTEDin the URL and the original password does not appearval06/val06-privkey-check.txtreportsprivkey_hits=0, confirming no PEM block (-----BEGIN) appears anywhere in the bundle archiveval06/val06-degraded-check.txtreportsbundle_exit_ok=trueandha_status_status=failed, confirming graceful degradation when the control-plane URL is unreachableval06/val06-report.txtreportspass=10 fail=0 total=10summaryval06/val06-report.jsoncontainspass_count=10and per-checkstatusvalues
The VAL07 rollout latency surface should also show:
val07/val07-health.txtreportshealth_code=200, confirming the dedicated VAL07 control plane started and is reachable before the benchmark beginsval07/val07-create-percentiles.txtreportsp50_ms,p95_ms, andp99_msall within their respective bounds (100/300/500 ms), withn=20andsample_complete=trueconfirming a full successful sample was collectedval07/val07-list-percentiles.txtreportsp99_ms≤ 500, confirming the list path (with 20 existing plans) is within the same latency target; it also reportsn=20andsample_complete=trueval07/val07-concurrent-summary.txtreportsconc_ok=5andconc_errors=0, confirming all 5 parallel creates succeeded;wall_ms≤ 2000, confirming that single-writer SQLite serialisation does not make concurrent operator requests unacceptably slowval07/val07-error-summary.txtreportserror_count=0across all 45 benchmark requests (20 creates + 20 lists + 5 concurrent creates)val07/val07-prometheus-check.txtreportscp_http_requests_total> 0, confirming the Prometheus instrumentation on the VAL07 control plane is wired and received observations from the benchmark trafficval07/val07-report.txtreportspass=9 fail=0 total=9summaryval07/val07-report.jsoncontainspass_count=9,plan_create_mslatency object, and per-checkstatusvalues
The VAL08 rollout throughput surface should also show:
val08/val08-health.txtreportsstatus=ok, confirming the dedicated VAL08 control plane started and is reachable before the throughput run beginsval08/scenario-n100/scenario-report.txtreportsok=500anderrors=0, proving the primary workplan target (≥100 concurrent device rollouts without errors) is metval08/val08-wall-clock-n100.txtreportselapsed_ms ≤ 30000, confirming 500 concurrent plan creates complete within the 30-second boundval08/val08-throughput-scaling.txtreportstput_n100 ≥ tput_n1, confirming that issuing 100 concurrent worker streams does not regress throughput below the single-worker serial rate; the SQLite single-writer model is expected to produce a plateau (near-equal throughput) rather than linear scaling, which is acceptableval08/val08-error-aggregate.txtreportstotal_errors=0across all 805 plans created (N=1+10+50+100, 5 plans each)val08/val08-list-consistency.txtreportslist_count ≥ 805, confirming that all created plans are durably stored and returned across paginated list resultsval08/val08-prometheus-check.txtreportscp_http_requests_total> 0, confirming Prometheus instrumentation received observations from the throughput trafficval08/val08-report.txtreportspass=10 fail=0 total=10summaryval08/val08-report.jsoncontainspass_count=10,throughputobject withn1/n10/n50/n100plans/sec values, and per-checkstatusvalues
The VAL09 stuck detection surface should also show:
val09/val09-health.txtreportsstatus=ok, confirming the dedicated VAL09 control plane started before any stuck checks runval09/val09-baseline-check.txtreportsstuck_count=0on an empty store, confirming the detection function handles the zero-plan case without errorval09/val09-fresh-check.txtreportsstuck_count=0immediately after creating 5 plans, confirming freshly-created plans are not falsely reported as stuck before the 3-second threshold elapsesval09/val09-stale-check.txtreportsstuck_count=5after the 4-second sleep, confirming all 5 published-phase plans exceed the threshold and are detected as stuckval09/val09-diagnosis-check.txtreportsdiagnosis_populated=5anddiagnosis_exact=5, confirming every stuck plan carries the exact expected"zero activations"diagnosis stringval09/val09-pause-check.txtreportsin_stuck_scan=noforval09-plan-b, confirming paused plans are excluded from the active-phase scanval09/val09-cancel-check.txtreportsin_stuck_scan=noforval09-plan-c, confirming terminal plans are excluded from the active-phase scanval09/val09-retry-check.txtreportsnew_phase=active, confirming the retry recovery strategy transitions the plan to the active phase and refreshesupdated_at(removing it from the stuck list at VAL09-10)val09/val09-rollback-check.txtreportsnew_phase=rolled_back, confirming the rollback recovery strategy transitions the plan to the terminal phaseval09/val09-final-check.txtreportsplan_a_present=true,plan_b_absent=true,plan_c_absent=true,plan_d_absent=true,plan_e_absent=true,pass=true— the final scan correctly surfaces only the one plan that received no operator actionval09/val09-report.txtreportspass=10 fail=0 total=10summaryval09/val09-report.jsoncontainspass_count=10and per-checkstatusvalues withscan_stale_count=5andscan_final_count=1
The VAL10 rollback reliability surface should also show:
val10/val10-preview-check.txtreportspreview_errors=0, confirming all fourrollback previewtargets exit 0 and produce safety profilesval10/val10-preview-rollout_plan-check.txtreportssafety_class=terminal,orchestrated=true, andvalid_strategies=['retry', 'rollback'], confirming the rollout_plan preview JSON schema is correctval10/val10-preview-relay-check.txtreportsorchestrated=falseandmanual_path_has_edgectl=true, confirming relay_deadletter is correctly surfaced as a manual-only target with edgectl instructionsval10/val10-retry-rate.txtreportsok=5 fail=0 success_rate=1.000, confirming all 5 retry executes on real plans succeedval10/val10-rollback-rate.txtreportsok=5 fail=0 success_rate=1.000, confirming all 5 rollback executes on real plans succeedval10/retry/execute-retry-*.txteach showoutcome=success previous=published new=active, confirming the retry strategy transitions plans to active phaseval10/rollback/execute-rollback-*.txteach showoutcome=success previous=published new=rolled_back, confirming rollback transitions to terminalval10/val10-execute-json-check.txtreports all three JSON output fields present (outcome,new_state,kind), confirming the--output jsonformat is stableval10/val10-nonexistent-check.txtreports non-zero exit code, confirming the CLI surfaces CP 404 errors as non-zero exit rather than silently succeedingval10/val10-relay-not-orchestrated-check.txtreports non-zero exit code andedgectlinstructions present, confirming manual-only targets are blocked from execute with actionable guidanceval10/val10-audit-preview-check.txtreportsrollback.preview.requested_count ≥ 4with actor/start-time scope, confirming this slice’s preview commands emit audit records to the retained storeval10/val10-aggregate-rate.txtreportsagg_success_rate=1.000(10/10), plus at least 10 retained success events scoped to this slice, satisfying the workplan target of ≥99% rollback success rateval10/val10-report.txtreportspass=10 fail=0 total=10summaryval10/val10-report.jsoncontainspass_count=10,success_rate.aggregate.rate=1.000, and per-checkstatusvalues
The VAL11 chaos surface should also show:
val11/val11-health.txtreportsstatus=okconfirming the dedicated chaos CP started cleanly on port 18996val11/val11-kill-check.txtreportsexit_nonzero=trueandhas_connection_error=true, confirming the CLI surfaces CP unavailability as a non-zero exit with an actionable message rather than silently succeeding or hangingval11/val11-durability-check.txtreportslist_count ≥ 10, confirming the full pre-kill plan corpus committed before the CP kill are recoverable from SQLite’s WAL after process restartval11/val11-rapid-restart.txtshows all 3 kill+restart cycles completing withcp_ready=true,list_count_final ≥ list_count_before, andnew_plan_code=201, confirming the write path is fully operational after repeated restartsval11/val11-gate-check.txtreportsphase=published, confirming a plan in the gate-wait state is not lost or corrupted by a CP kill — the operator’s pending gate decision survivesval11/val11-stuck-check.txtreportsstuck_count ≥ 3anddiagnosis_ok=true, confirming the stuck-detection surface correctly identifies the device-unresponsive proxy plans and populates operator-visible diagnosis stringsval11/val11-corrupt-check.txtreportscreate_code=201andget_ok=truewithplan.metadata.id=val11-corrupt-1, confirming the CP accepts and stores plans with unconventional artifact references without rejecting them at ingestion (validation is the edge agent’s responsibility)val11/val11-rollback-corrupt-check.txtreportsexit_ok=true, confirming the operator can roll back a suspect plan regardless of its artifact metadataval11/val11-cascade-check.txtreportscascade_ok=3 cascade_fail=0, confirming all 3 device-unresponsive proxy plans are recoverable via retry in a single operator passval11/val11-audit-check.txtreportsrollback_executed_success_count ≥ 1with actor/start-time scope, confirming the audit capture pipeline is not disrupted by CP kill/restart cycles — events emitted during chaos recovery sessions are retained in the shared audit storeval11/val11-report.txtreportspass=10 fail=0 total=10summaryval11/val11-report.jsoncontainspass_count=10and per-checkstatusvalues
The HA failover surface should also show:
val13/val13-node1-s0.logcontainsacquired leadership, confirming node-1 won the initial leader election via advisory lock Campaignval13/val13-01-node1-status.txtcontainscp-val13-node1confirming that node-1 reports itself as the active leader at baselineval13/val13-03-failover-timing.txtreportsfailover_ms=<N>where N ≤ 5000; the measured latency from SIGTERM to node-2 logging “acquired leadership”val13/val13-04-data-probe.txtcontains exactlypre-kill, confirming the shared PostgreSQL instance retained the probe row across leader failoverval13/val13-06-quorum-lost.jsoncontains"quorum_health":"lost", confirming the quorum monitor detects PG unavailability within the polling windowval13/val13-07-quorum-healthy.jsoncontains"quorum_health":"healthy", confirming quorum health returns after the explicitdocker startrecovery stepval13/val13-08-rapid-summary.txtreports all three cycle timings ≤ 5000 ms, andval13-08-post-cycle-status.txtplusval13-08-data-after-rapid.txtconfirm the rapid cycles end with a stable leader and intact probe rowval13/val13-09-sigkill-timing.txtreportsfailover_ms=<N>where N ≤ 5000 after SIGKILL (no graceful Resign), validating that advisory lock release via TCP RST is fast enough to meet the HA readiness thresholdval13/val13-report.txtreportspass=10 fail=0 total=10val13/val13-report.jsoncontainspass_count=10,sigterm_failover_ms,sigkill_failover_ms, andrapid_cycle_msarray values as measurement evidence
The replication lag baseline surface should show:
val14/val14-01-replication.txtcontainsstreamingin thestatecolumn ofpg_stat_replication, confirming the standby is connected and receiving WALval14/val14-02-idle-lsn-gap.txtcontains0, confirming no unacknowledged WAL at restval14/val14-03-light-result.txtshowslight_drain_ms≤ 2000 ms for the 100-row × 500-byte write workloadval14/val14-04-heavy-result.txtshowsheavy_drain_ms≤ 5000 ms for the 500-row × 2000-byte (~1 MB) write workloadval14/val14-07-quorum-degraded.jsonshowsquorum_health=degradedafterdocker stop val14-pg-standbyval14/val14-08-quorum-healthy.jsonshowsquorum_health=healthyafterdocker start val14-pg-standbyval14/val14-09-offline-write.txtshows the write burst issued while the standby was offline, andval14/val14-09-catchup-result.txtshowsok=trueandcatchup_drain_ms≤ 10000 ms after standby restartval14/val14-report.txtshowspass=10 fail=0 total=10and lists derived threshold values:healthy_thresh_ms,degraded_thresh_ms,alert_thresh_msval14/val14-report.jsoncontainspass_count=10,idle_p95_ms,light_p95_ms,heavy_p99_ms,light_drain_ms,heavy_drain_ms,healthy_thresh_ms,degraded_thresh_ms, andalert_thresh_msas measurement and threshold evidence
The backup/restore validation surface should show:
val15/val15-01-backup-create.txtcontainscreated backup_id=backup-val15-awith non-emptychecksum=and positivesize=valuesval15/val15-02-backup-toc.txtexits without error and containsTABLE DATAentries forval15_smallandval15_mediumval15/val15-03-metadata-check.txtcontainsbackup_id=backup-val15-a status=completedval15/val15-04-checksum-verify.txtshowscli_checksumandfile_checksumfields with matching 64-character hex stringsval15/val15-05-backup-timing.txtshowsbackup_ms≤ 30,000val15/val15-06-data-check.txtcontains100|1000|t|t— row counts and payload spot-checks confirming tables were restored to pre-backup stateval15/val15-06-integrity-result.txtcontainsrestore_correct=true small=100 medium=1000val15/val15-07-restore-timing.txtshowsrestore_ms≤ 60,000val15/val15-08-inventory-check.txtcontainsmulti_backup_count=2val15/val15-09-restore-no-confirm.txtshows an error message about missing--confirm; the CLI must exit non-zeroval15/val15-10-audit-check.txtcontainscreated_events=N restored_events=Mwith both N ≥ 1 and M ≥ 1val15/val15-report.txtshowspass=10 fail=0 total=10val15/val15-report.jsoncontainspass_count=10,backup_ms, andrestore_msas baseline timing evidence
The split-brain chaos validation surface should show:
val16/val16-01-baseline.jsoncontains"risk": "none"before any injectionval16/val16-02-detected.jsoncontains"risk": "detected"after epoch divergence injectionval16/val16-03-detect-repeat.jsonalso contains"risk": "detected", confirming idempotencyval16/val16-04-recover-dry-run.txtexits without error and includes planning/recommendation output from manual-reconcileval16/val16-04-risk-after-dry-run.jsonstill contains"risk": "detected"(dry-run does not write to DB)val16/val16-04-risk-check.txtreportsrisk_after_dry_run=detectedval16/val16-05-recover-execute.txtexits successfully andval16/val16-05-recovered.jsoncontains"risk": "none"after promote-leaderval16/val16-06-probe-after-recovery.txtcontainspre-inject, confirming user data is untouched by recoveryval16/val16-07-possible.jsoncontains"risk": "possible"after ghost-node epoch injectionval16/val16-08-cleared.jsoncontains"risk": "none"afterresigned_atis stamped on ghost rowsval16/val16-09-audit-check.txtcontainsdetected_events=N recovered_events=Mwith both N ≥ 1 and M ≥ 1, scoped to this slice bystart-timeandactor=val16-operatorfor the recovery eventval16/val16-10-final-status.jsonconfirmsholder_idcontainscp-val16-nodeval16/val16-report.txtshowspass=10 fail=0 total=10val16/val16-report.jsoncontainspass_count=10and per-check results
The quorum loss validation surface should show:
val17/val17-01-baseline.jsoncontains"quorum_health": "healthy"and"write_block_active": falsebefore any faultval17/val17-02-quorum-lost.jsoncontains"quorum_health": "lost"afterdocker stop val17-pg-primaryval17/val17-03-loss-timing.txtshowsloss_ms≤ 30,000val17/val17-04-write-block-check.txtconfirmswrite_block_active=Trueandcan_accept_protected_writes=Falseduring lossval17/val17-05-loss-reason.txtshows a non-emptyquorum_loss_reason(e.g."database connection unavailable")val17/val17-06-quorum-recovered.jsoncontains"quorum_health": "healthy"afterdocker start val17-pg-primaryval17/val17-07-recovery-timing.txtshowsrecovery_ms≤ 30,000val17/val17-08-recovery-check.txtconfirmswrite_block_active=False,can_accept_protected_writes=True, non-emptylast_lost_atandlast_restored_at, anddetected_loss_count ≥ 1val17/val17-09-count-check.txtcontainsdetected_loss_count=2 (second cycle confirmed)after the second recovery succeedsval17/val17-10-audit-check.txtcontainslost_events=N restored_events=Mwith both N ≥ 1 and M ≥ 1val17/val17-report.txtshowspass=10 fail=0 total=10val17/val17-report.jsoncontainspass_count=10,loss_ms, andrecovery_msas baseline timing evidence against workplan ≤ 60 s target
The config migration surface should also show:
config-migrate/config-migrate-dry-run.txtreports the planned v0-to-v1 changes without writing any output fileconfig-migrate/config-migrate-stdout.yamlandconfig-migrate/config-migrate-stdout.tomlshow deterministic migrated output in both supported formatsconfig-migrate/config-migrated.yamlandconfig-migrate/config-migrated.tomlshow successful file outputconfig-migrate/config-migrated-in-place.yamldiffers from the pre-migrate checksum while preserving the target file mode captured in the paired stat filesconfig-migrate/config-migrate-unsupported.stderrnames both supported schema versions for unsupported inputconfig-migrate/config-migrate-invalid-input.stderrshows malformed input fails closed instead of producing a synthetic v1 skeletonconfig-migrate/config-migrate-invalid-v0.stderrshows invalid migrated configs fail validation before any output is written
7. Current Evidence Bundle¶
Reference local run:
evidence/pr17-cli-audit-local-2026-03-17/README.mdevidence/pr18-support-bundle-local-2026-03-18/README.mdevidence/pr20-rbac-local-2026-03-18/README.mdevidence/pr22-metrics-local-2026-03-18/README.mdevidence/pr27-config-migration-local-2026-03-19/README.md
8. Scope Boundary¶
This lab proves the current PR-17 and PR-18 scope honestly:
canonical audit schema
CLI-side emission at the wired action sites
reproducible local evidence from real command invocations
support-bundle generation against a live control-plane, retained audit store, and log file
RBAC role create/list/assign against the local file-backed model plus retained audit capture
RBAC enforcement on the current read surfaces, including the server-side
/v1/ha/statuspath and retained denial auditingmetrics catalog visibility and point-in-time metric queries against a live control-plane metrics endpoint
config migration tooling against checked-in v0 fixtures, including safe dry-run output, format selection, fail-closed invalid input handling, and atomic in-place replacement
The VAL 01 surface (Phase 8 of the cert lab) proves bounded certificate rotation with continuity for new client connections:
A 2-day cert for
node-c.edge.localis accepted over live mTLS before rotationautonomy cert rotatecompletes within the 300-second bound (actual: sub-second)The rotated 90-day cert is accepted over live mTLS without restarting the control-plane, proving continuity for a fresh client connection using the same cert/key file paths after atomic replacement
The
cert.rotatedaudit event is retained and queryable viaaudit querySerial numbers differ before and after rotation, proving a new keypair was issued
All six checks captured in
cert-rotation-val01-report.txtas a composite PASS/FAIL
It does not prove CA rotation, server-certificate hot reload, uninterrupted in-flight request continuity, or coordinated multi-node rotation. See the deferred coverage matrix in cert-rotation-validation.md for the exact status of each excluded area.
The VAL 02 surface (Phase 9 of the cert lab) proves consistent rejection across all five trust-chain failure categories:
Missing client certificate is rejected (
RequireAndVerifyClientCertactive)Certificate from a rogue CA is rejected by chain verification — even when the CN matches a known legitimate node, proving rejection is CA-anchor-based, not CN-based
Expired certificate is rejected by the validity period check in Go’s chain verification
Revoked certificate is rejected by the
VerifyPeerCertificateCRL callbackWrong CA bundle on the client causes server cert verification failure, proving mTLS is bidirectional
A cert from the trusted CA with an unexpected CN is accepted at the TLS layer (documented in the report as
right_ca_wrong_cn: expected accepted) — confirming that identity-layer authorization is RBAC-based, not cert CN-based
See cert-rejection-validation.md for the full VAL 02 validation plan, scenario matrix, pass/fail criteria, and report template.
The PR-29-followup-e surface proves cert-management RBAC coverage:
All six
autonomy certsubcommands (issue,rotate,revoke,list,check-revocation,sync-crl) requirecert:manageor, for read-only operations,cert:readcert listandcert check-revocationwere unguarded before PR-29-followup-e; they are now guarded withnewRBACGuard().CheckAny([]string{"cert:read","cert:manage"}, ...)cert:readis a new recognized permission included in theauditorpredefined role;cert:managerequires a custom roleRBAC denial for any cert command emits
auth.access.deniedbefore returning the error; no separate audit-on-denial code is needed —rbacGuard.emitDenied()fires automatically
The PR-29-followup-d surface proves the database-backed audit query path:
audit_eventstable in the pgstore PostgreSQL schema (append-only, INV-AUDIT-01)PGAuditEmitterwriting records at Class 3 (best-effort) durability — write failures are logged and counted but never propagated to the audited operationInitPGAuditEmitter(db)upgrading the package-level emitter to MultiEmitter (slog + PG + file) after a successful pgstore connectionautonomy audit query --pg-url/AUTONOMY_AUDIT_PG_URLas the primary operator query surface when PostgreSQL is availableautonomy audit export --pg-urlfor JSON and CSV export from the DBautonomy audit prune --older-than Ndfor operator-initiated retention enforcement against theaudit_eventstableQueryAuditEventsas a read-only function safe to run on any replicafile-based
audit.FileEmitterpreserved in parallel as the fallback / compat mode when no--pg-urlis provided
It does not yet claim background retention jobs, OCSP-style online status queries, or multi-tenant audit isolation.
The VAL03 surface (slice 14) proves RBAC permission enforcement across a 14-check matrix covering all three enforcement claims:
VAL03-C1 (unauthorized blocked):
ha status(fleet:read),audit query(audit_history:read), andrbac role create(rbac:manage) are each denied for identities whose role does not include the required permission, withauth.access.deniedemitted before any network callVAL03-C2 (authorized succeeds): all three permissions are exercised on the allow path:
fleet:readfor operator, analyst, and auditor;audit_history:readfor auditor;rbac:managefor auditor. The VAL03 identities are mirrored into the HA helper’s server-side RBAC store so theha statusallow-path checks exercise both client-side and server-side authorization. If the HA helper is unavailable, the threeha statusallow-path checks are recorded asSKIP, notPASSVAL03-C3 (unguarded unrestricted):
rbac role list,rollout plan list, andsupport-bundle generatesucceed or fail for non-RBAC reasons regardless of the operator’s assignment, confirming those commands have no guardVAL03-C4 (denial audit visibility): the retained audit query is narrowed to the current VAL03 time window and must contain the five expected actor/action/permission denial tuples from this slice itself
VAL03 covers representative commands from the guarded surface; it does not
re-exercise the bootstrap, break-glass, opt-out, or cert RBAC paths already
covered by run_rbac_enforcement_lab and run_cert_rbac_lab. See
rbac-enforcement-validation.md for the full
VAL03 validation plan, guard coverage map, pass/fail criteria, and report
template.
The VAL04 surface (slice 15) proves audit completeness across a 10-check matrix covering all four completeness claims:
VAL04-C1 (store populated): the retained file-backed store is non-empty and all 6 audit categories contain at least one record after all prior lab phases have run
VAL04-C2 (schema complete): every audit category’s records contain all 6 mandatory fields (
event_name,category,action,outcome,source,timestamp), confirming no field is silently dropped by any emit callVAL04-C3 (queryable within latency bound): a full retained-store query with
--limit 0 --output jsonsucceeds and completes in ≤ 2000 ms, confirming operational usability at lab corpus sizesVAL04-C4 (event-type coverage): all 25 of the 25 defined wired event types appear in the retained store; absent events are listed explicitly in
val04-coverage-report.txtand any absence is a validation failure
VAL04 validates against the 25 wired event types. The 6 deferred event types
(rollout.gate.approved, rollout.recovered, rollout.stuck.detected,
auth.login.succeeded, auth.login.failed, relay.deadletter.inspected)
are excluded — their absence is expected and correct. See
audit-completeness-validation.md for the
full VAL04 validation plan, wired event inventory, pass/fail criteria, and
report template.
The VAL05 surface (slice 16) proves OTel integration across a 9-check matrix covering all four integration claims:
VAL05-C1 (Prometheus metrics): the control-plane
/metricsendpoint returns HTTP 200, all 4 required metric families are present, andcp_http_requests_total/cp_rollout_plans_totalhave non-zero values after lab traffic — confirming that Prometheus instrumentation is wired and receiving real observationsVAL05-C2 (WAL durability): events emitted via
telemetry.NewEmitterare persisted to the local WAL and readable bytelemetry statusandtelemetry export; this is validated with an isolated test WAL populated bytelemetry_emit_helper(a small lab binary), because no CLI command exists to emit adapter-side telemetry events directlyVAL05-C3 (JSONL export):
telemetry export --outproduces non-empty JSONL with all mandatory event fields (kind,ts,seq,written_at,attrs) — confirming the offline pipeline can surface events to downstream consumers that do not use OTLPVAL05-C4 (correlation ID propagation):
trace_id/span_idset at emit time survive through the WAL → JSONL path (astrace_id/span_id) and the WAL → OTLP flush path (astraceId/spanIdin the OTLP log record), confirming that the custom OTLP encoding correctly propagates correlation context for consumers such as Jaeger and Grafana Tempo
VAL05 validates the two implemented observability paths (Prometheus metrics and WAL/OTLP events). It does not validate the OTel Go SDK (not used), automatic traceparent header extraction (not implemented), slog trace context injection (not implemented), or the edge Prometheus metrics (edge process not started by this lab). See otel-integration-validation.md for the full VAL05 validation plan, architecture notes, pass/fail criteria, and report template.
The VAL06 surface (slice 17) proves support-bundle correctness across a 10-check matrix covering all four bundle claims:
VAL06-C1 (generation succeeds):
support-bundle generateexits 0 and produces a non-empty.tar.gzarchive within 30 seconds — confirming the collector pipeline runs to completion and the archive-write path is functional at lab corpus sizesVAL06-C2 (diagnostic coverage): the archive contains all three always- present core files (
manifest.json,system_info.json,build_info.json) andmanifest.jsonrecords all 6 collector names regardless of their individual outcome;system_info.jsoncontains all 5 required runtime fields;audit_recent.jsonhas at least 1 record from the retained store, proving end-to-end connectivity between the bundle and the audit subsystemVAL06-C3 (secrets redacted):
config_redacted.yamlreplaces the known testfleet_saltwith<REDACTED>and the postgres URL password withREDACTED; original values are verified absent; no PEM block (-----BEGIN) appears anywhere in the extracted archiveVAL06-C4 (graceful degradation): generating a bundle with a non- existent
--orchestrator-urlexits 0 andmanifest.jsonrecordsha_statusas"failed", confirming the non-fatal collector pattern is preserved for optional data sources
VAL06 validates the CLI surface and archive structure. It does not test
bundle ingestion by external tools, RBAC guarding of the command (confirmed
unguarded by VAL03-C3), per-field value correctness of system_info.json,
or the DB-backed audit path (requires a live PostgreSQL instance). See
support-bundle-validation.md for the full
VAL06 validation plan, bundle architecture, collector status definitions,
pass/fail criteria, and report template.
The VAL07 surface (slice 18) establishes a rollout latency baseline across a 9-check matrix covering all four latency claims:
VAL07-C1 (control plane reachable): a dedicated fresh control-plane instance starts on
127.0.0.1:18992with an isolated SQLite data directory and responds toGET /v1/healthwith 200, establishing a clean starting point for the benchmarkVAL07-C2 (plan-create latency): 20 sequential
POST /v1/rolloutsrequests are timed withcurl -w '%{time_total}'and Python percentiles are computed; p50 ≤ 100 ms, p95 ≤ 300 ms, and p99 ≤ 500 ms prove the primary workplan target (rollout plan creation < 500ms p99) is met in the local- SQLite environmentVAL07-C3 (plan-list latency): 20 sequential
GET /v1/rolloutsrequests against a store containing 20 plans; p99 ≤ 500 ms proves the read path is within the same bound after realistic state accumulationVAL07-C4 (concurrent responsiveness): 5 parallel
POST /v1/rolloutsrequests all return 2xx and complete within a 2000 ms wall clock, proving that the single-writer SQLite connection serialises concurrent creates without returning errors or making the operator API unacceptably slow
VAL07 is a local-lab latency baseline. The bounds (100/300/500 ms) are generous for in-process loopback SQLite and are designed to detect regressions (e.g. an accidentally synchronous fsync, a missing index on the list path) rather than to measure production PostgreSQL performance. Benchmark methodology, sample size rationale, and environment assumptions are documented in rollout-latency-validation.md.
The VAL08 surface (slice 19) validates concurrent fleet rollout throughput across a 10-check matrix covering all four throughput claims:
VAL08-C1 (N=100 zero errors): 100 concurrent workers each creating 5 plans produce zero errors, proving the workplan target (
≥100 concurrent device rollouts) is met in the local-SQLite environmentVAL08-C2 (durable storage): all 805 created plans (across four concurrency tiers) appear across paginated
GET /v1/rollouts?limit=100responses, confirming the serialised SQLite writer commits every plan before returning 201VAL08-C3 (wall-clock bound): the N=100 scenario (500 plans) completes within 30 seconds, establishing a safe upper bound for operator-facing throughput at design-partner fleet sizes
VAL08-C4 (no concurrency regression): throughput at N=100 is ≥ throughput at N=1, confirming that the single-writer SQLite connection serialises concurrent creates without causing a performance regression
VAL08 validates the control-plane write path under concurrent load at lab scale. It does not test PostgreSQL backend throughput (requires a live PG instance), network-constrained relay delivery, edge-agent reconciliation latency, or fleet sizes beyond 1,000 devices (the proposed workplan maximum). Scenario matrix design, SQLite serialisation notes, and pass/fail thresholds are documented in rollout-throughput-validation.md.
The VAL09 surface (slice 20) validates stuck rollout detection and recovery across a 10-check matrix covering all four stuck-detection claims:
VAL09-C1 (detection accuracy): plans in active phases with
updated_atolder than the threshold appear inGET /v1/rollouts/stuckwith non-emptydiagnosisstrings — validated by detecting all 5 test plans after a 4-second sleep against a 3-second thresholdVAL09-C2 (exclusion correctness): plans in paused or terminal phases are excluded from stuck detection regardless of
updated_atstaleness — validated by pausing plan-b and cancelling plan-c, then confirming both are absent from subsequent scansVAL09-C3 (retry recovery):
recover strategy=retrytransitions the plan toactiveand refreshesupdated_at, removing it from future stuck scans — validated by the plan-d flow and confirmed at VAL09-10VAL09-C4 (rollback recovery):
recover strategy=rollbacktransitions the plan to therolled_backterminal phase, removing it from all subsequent scans — validated by the plan-e flow and confirmed at VAL09-10
VAL09 validates the detection and recovery surfaces against lab-scale plans
in published phase with no edge-agent activity. It does not test automatic
periodic stuck scanning (not yet implemented), stuck detection across HA
replicas, the skip_failed recovery strategy (which requires a plan with an
open stage in stage_in_progress phase), or the rollout.stuck.detected
audit event path (slog-only; not yet wired to the retained audit store).
Staleness injection method, diagnosis logic, and scenario design are
documented in stuck-detection-validation.md.
The VAL10 surface (slice 21) validates rollback reliability across a 10-check matrix covering all four rollback claims:
VAL10-C1 (preview coverage):
rollback previewexits 0 for all four target kinds (rollout_plan,rollout_stage,ha_leader_resign,relay_deadletter), producing safety class, trigger conditions, and known limitationsVAL10-C2 (retry success rate):
rollback execute strategy=retryon real rollout plans succeeds with 100% success rate across a batch of 5 executions, with each plan transitioning frompublishedtoactiveVAL10-C3 (rollback success rate):
rollback execute strategy=rollbackon real rollout plans succeeds with 100% success rate across a batch of 5 executions, with each plan transitioning frompublishedtorolled_backVAL10-C4 (aggregate rate + audit): aggregate rate across all 10 executes is ≥ 99%;
rollback.executedaudit events withoutcome=successare captured in the retained store
VAL10 validates the CLI execute path and the workplan ≥99% target against the
local-SQLite control-plane. It does not test skip_failed, ha_leader_resign
via VAL10, automatic rollback, the 30-day soak (workplan GA gate), or the
PostgreSQL backend. Success rate formula, JSON field handling, and evidence
structure are documented in rollback-reliability-validation.md.
The VAL11 surface (slice 22) validates operator-surface resilience under representative chaos conditions across a 10-check matrix covering all four chaos claims:
VAL11-C1 (kill → client error + no silent data loss): after a CP SIGTERM, client CLI requests exit non-zero with a connection error keyword, and the full pre-kill plan corpus is present after restart
VAL11-C2 (gate-wait survival): a plan in
publishedphase (gate-wait state) retains its phase across the CP kill boundary, confirming the operator’s pending gate decision is not lostVAL11-C3 (rapid restart resilience): three successive kill+restart cycles do not corrupt the store; new plan creates succeed after the final restart, confirming the write path is operational after repeated restarts
VAL11-C4 (diagnostic and recovery surfaces functional post-chaos): stuck detection, artifact corruption proxy queries, rollback execute, and audit capture all function correctly in and around chaos injection windows
VAL11 validates the control-plane durability and operator-surface resilience using process-level SIGTERM injection only. It does not test iptables-level network partitions (requires root), concurrent creates under kill (inherently racy; replaced by deterministic rapid-restart), SIGKILL WAL recovery, PostgreSQL backend chaos, edge-agent reconnect after partition, or automatic stage promotion under chaos. Chaos mechanism rationale, safety guardrails, and scenario design are documented in chaos-validation.md.
VAL 12 — Fleet Rollout 30-Day Soak is the workplan Gate D long-duration
framework and is not a slice of this runner. The existing run_cli_audit_lab.sh
is a synchronous single-shot evidence collector; the 30-day soak requires
persistent infrastructure, scheduled round execution (cron every 30 minutes),
rolling evidence windows, daily aggregation, and a final pass/fail report.
The soak is driven by three separate scripts:
scripts/labs/run_soak_val12_setup.sh— one-time environment provisioning; starts a persistent CP at127.0.0.1:19000, writesconfig.env, and runs the first workload round to verify VAL12-01 (framework provisioned) and VAL12-02 (initial round zero errors)scripts/labs/run_soak_val12_round.sh— single workload round called by cron every 30 minutes; creates 10 plans, runs stuck scan + auto-recovery, scrapes Prometheus metrics, and writesround-summary.jsonto$SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/scripts/labs/run_soak_val12_report.sh— daily summary and final report aggregator; reads all round summaries, computes rollback rate, P99 latency, and CP uptime, and checks all 10 VAL12 thresholds
The soak satisfies four claims over 30 days: ≥100 concurrent plans sustained (VAL12-C1), ≥99% rollback success rate (VAL12-C2), P99 ≤500ms maintained under accumulated store state (VAL12-C3), and CP availability ≥99.9% (VAL12-C4). The Gate D minimum-acceptable pass is VAL12-03 (fleet target reached) + VAL12-10 (30-day aggregate rollback rate ≥0.990). Soak environment design, workload schedule, alert thresholds, evidence retention plan, and the final report template are documented in soak-validation.md.
VAL 13 — HA Failover Validation is slice 23 of this runner
(run_ha_failover_val13_lab). It extends the existing HA lab infrastructure in
run_cli_audit_lab.sh with a dedicated function using fresh Docker containers
(val13-pg-primary, val13-ha-net) and isolated ports (18997/18998) to avoid
interference with the backup/restore and quorum labs.
VAL13 validates four HA readiness claims:
VAL13-C1 (SIGTERM failover latency): leader election completes within 5 seconds of SIGTERM on the current leader, measured end-to-end from kill signal to follower logging “acquired leadership”
VAL13-C2 (zero data loss): rows written directly to the shared PostgreSQL instance while the original leader held the advisory lock remain readable from that same instance after failover
VAL13-C3 (PG crash recovery): quorum monitor detects PostgreSQL unavailability (
quorum_health=lost) and returns toquorum_health=healthyafter explicit operator restart of PostgreSQLVAL13-C4 (unplanned crash / disk-fault proxy): SIGKILL on the leader (no graceful
Resign(), simulating OOM or disk crash) results in advisory lock release via TCP RST and a new leader within 5 seconds
VAL13 does NOT validate: streaming-replication failover (covered by
run_ha_lab() and run_quorum_lab()), iptables-based network partitions
(requires root), concurrent writes under kill, SIGKILL on PostgreSQL (WAL
recovery path; covered by VAL13-09 indirectly via the same single-node PG),
multi-region or cross-datacenter failover, or automatic rollback trigger under
HA failure. Scenario design, measurement method, and pass/fail criteria are
documented in ha-failover-validation.md.
VAL 14 — HA Replication Lag Baseline benchmarks PostgreSQL streaming
replication lag under the autonomyops HA architecture and derives practical
alerting thresholds from observed data. A val14-pg-primary + val14-pg-standby
pair is provisioned via pg_basebackup, and a single HA server at port 18999
uses --min-sync-replicas 1 so quorum health tracks standby availability.
VAL14 proves these workplan claims:
VAL14-C1 (replication streaming): a PostgreSQL streaming-replication standby is established, confirmed by
pg_stat_replication.state = streamingand an LSN gap of 0 at restVAL14-C2 (light load drain ≤ 2 s): after a 100-row × 500-byte write batch committed with
synchronous_commit=off, lag is sampled during the active drain window and the WAL LSN gap drains to 0 within 2000 ms on local DockerVAL14-C3 (heavy load drain ≤ 5 s): after a 500-row × 2000-byte (~1 MB) write batch committed with
synchronous_commit=off, the LSN gap drains within 5000 msVAL14-C4 (threshold derivation): practical alerting thresholds (
healthy,degraded,alert) are derived from observed p95 lag using the formulahealthy = max(p95×3+1, 10),degraded = max(healthy×10, 100),alert = max(healthy×50, 500), anchoring monitoring configuration to measured behaviour
VAL14 does NOT validate: write-path lag through the HA server HTTP surface
(writes go directly to PostgreSQL), streaming-replication switchover or
promotion (covered by run_ha_lab() and run_quorum_lab()), PG logical
replication, multi-standby topologies, network-partition-induced lag
(requires iptables/root), or production-scale throughput on cloud storage.
Benchmark design, analysis method, and threshold derivation formula are
documented in ha-replication-lag-validation.md.
VAL 15 — Backup/Restore Validation proves the ha backup create/list/restore
CLI workflow end-to-end, including integrity verification, timing bounds, and
safety-gate enforcement. A dedicated Docker PostgreSQL instance with two fixture
tables (~1 MB total) is provisioned for isolation. The HA server is cycled
through normal -> maintenance -> normal modes to match the real operator workflow.
VAL15 proves these workplan claims:
VAL15-C1 (backup file integrity):
ha backup createproduces a valid pg_dump custom-format archive; the SHA-256 checksum recorded in the inventory matches an independently computed hash of the produced fileVAL15-C2 (restore correctness): after post-backup mutations (UPDATE + DELETE),
ha backup restorereverts the database to its pre-backup state; row counts and spot-check payload values are verified by SQL assertionVAL15-C3 (timing bounds): backup completes in ≤ 30 s; restore completes in ≤ 60 s on local Docker (conservative thresholds that flag hangs or permission errors without constraining normal operation)
VAL15-C4 (safety gate):
ha backup restorewithout--confirmexits non-zero, confirming the mandatory confirmation flag prevents accidental restores
VAL15 does NOT validate: backup rotation / retention policy (no implementation), cross-PG-version restore compatibility, concurrent writes during backup, backup storage to remote object stores, disaster recovery runbook execution timing, or automatic scheduled backups. Fixture strategy, checksum method, and test sequence are documented in backup-restore-validation.md.
VAL16 (run_split_brain_chaos_val16_lab) uses SQL injection against
leadership_state and leader_epochs to trigger risk=detected (epoch
divergence + holder mismatch) and risk=possible (unclosed ghost-node epoch
rows). A user-table probe row (val16_probe) is checked post-recovery to
confirm recovery does not affect data beyond leadership metadata tables.
VAL16 proves these workplan claims:
VAL16-C1 (split-brain detection): epoch divergence and ghost-node conditions are reliably detected and reported as
risk=detected/risk=possiblerespectively via the/v1/ha/split-brainAPI andha split-brain detectCLIVAL16-C2 (recovery correctness):
ha split-brain recover --strategy promote-leaderclearsrisk=detectedand returns the cluster torisk=nonewithout corrupting user dataVAL16-C3 (dry-run safety):
manual-reconcileexits 0 and does not write to the database, confirming operators can plan a recovery before committingVAL16-C4 (self-clearing ghost nodes): unclosed epoch rows that are subsequently resigned clear automatically, returning the cluster to
risk=nonewithout operator intervention
VAL16 does NOT validate: real network partitions (requires iptables/root), genuine two-node concurrent-write split-brain, automatic rollback under detected split-brain, multi-region scenarios, or HA server binary restart-triggered epoch divergence. Scenario design, injection SQL, and safety rationale are documented in split-brain-chaos-validation.md.
VAL17 (run_quorum_loss_val17_lab) exercises the QuorumMonitor’s healthy → lost → healthy cycle with timed measurements and write-blocking assertions.
Single PG with --min-sync-replicas 0 and --quorum-monitor-interval 500ms;
loss induced by docker stop, recovery by docker start. VAL17 proves these
workplan claims (Gap HA-004):
VAL17-C1 (loss detection timing):
quorum_health=lostis detected within ≤ 30,000 ms (30 s, well under the workplan’s 60 s target) ofdocker stopwith a 500 ms monitor intervalVAL17-C2 (write safety during loss):
write_block_active=trueandcan_accept_protected_writes=falseare confirmed in the quorum status JSON during the loss window, proving the WriteGate middleware is engagedVAL17-C3 (recovery detection timing):
quorum_health=healthyis detected within ≤ 30,000 ms after PostgreSQL is restored withdocker start, confirming timely recovery detection once the dependency returnsVAL17-C4 (monitor history correctness):
last_lost_at,last_restored_at, anddetected_loss_countare correctly populated and increment across repeated cycles, confirming the QuorumMonitor’s state-change tracking is reliable
VAL17 does NOT validate: the healthy → degraded path (covered by
run_quorum_lab() with --min-sync-replicas 1), network-partition-induced
quorum loss (iptables, requires root), HTTP write-gating via rollout endpoint
(HA server binary does not expose /v1/rollouts), or multi-region topologies.
Timing method, threshold rationale, and scenario design are documented in
quorum-loss-validation.md.
VAL 18 — HA 30-Day Soak is the workplan Gate D long-duration HA framework
and is not a slice of this runner. The 30-day lifecycle — with persistent
Docker infrastructure, cron-scheduled round execution, PID tracking across
invocations, node restart recovery, and progressive report generation — cannot
be expressed as a run_cli_audit_lab.sh function. Same reasoning applied for
VAL12 (fleet soak).
The soak is driven by three separate scripts:
scripts/labs/run_soak_val18_setup.sh— one-time environment provisioning; buildsorchestrator_ha_server+autonomybinaries, starts a persistent Docker PostgreSQL instance (val18-pg-primary, host port 5488) for a stable 30-day connection address, creates theval18_probetable, starts HA node1 (19010) + node2 (19011) as background processes, writesconfig.env, and runs the first health round to verify VAL18-01 (framework provisioned) and VAL18-02 (initial round success). On a normal rerun it reuses the persistent Docker volume/container instead of deleting 30-day soak statescripts/labs/run_soak_val18_round.sh— single HA health round called by cron every 2 hours; checks PG + HA node liveness (restarts dead processes), identifies the current leader from/v1/ha/status holder_id, checks quorum health + probe row, triggers a timed SIGTERM failover everySOAK_FAILOVER_INTERVAL_ROUNDSrounds (polls follower every 50 ms forholder_idchange, measuresfailover_ms, restarts killed node as follower, verifies probe row on new leader), and writesround-summary.jsonto$SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/scripts/labs/run_soak_val18_report.sh— daily summary and checkpoint/final report aggregator; reads allround-summary.jsonfiles, computes HA uptime%, failover count, p50/p95/p99failover_ms, and data continuity rate, and checks all VAL18 thresholds with a Gate D HA assessment. The report uses separate failover-count thresholds for the 7-day checkpoint (>= 1) and the 30-day final Gate D check (>= 3)
The soak satisfies four claims over 30 days: ≥ 3 scheduled leader failovers
sustained (VAL18-C1), failover_ms ≤ 10,000 on every failover (VAL18-C2),
probe row accessible after every failover (data continuity rate = 1.0,
VAL18-C3), and HA uptime ≥ 99.9% (VAL18-C4). The Gate D minimum-acceptable
pass requires VAL18-09 (≥ 3 total failovers) + VAL18-10 (data continuity
rate = 1.000) + VAL18-07 (max failover timing maintained ≤ 10,000 ms). Soak
environment design, failover schedule strategy, observability plan, evidence
retention, and the final report template are documented in
ha-soak-validation.md.
VAL25 — Fleet Rollout Proof Report¶
VAL25 is a report generator (not a test runner) that reads the evidence produced by the five fleet rollout validation slices (VAL07–VAL11) and emits a consolidated proof report suitable for engineering leads and external reviewers.
It evaluates three readiness levels and clearly separates measured results from proposed targets:
Readiness level |
Achievable with VAL07–VAL11? |
Additional requirements |
|---|---|---|
Design Partner |
YES — if all five slices pass and key targets met |
None beyond VAL07–VAL11 |
GA |
NO |
PostgreSQL backend + VAL12 30-day soak |
Public Production |
NO |
Everything above GA + security audit + multi-region |
Run after completing a full cli-audit-lab run:
bash scripts/labs/run_fleet_rollout_proof_report_val25.sh \
evidence/cli-audit-lab-YYYY-MM-DD
Output is written to val25/ within the evidence directory. The formal plan,
metric definitions, and readiness criteria are documented in
fleet-rollout-proof-report-validation.md.
VAL26 — HA Proof Report¶
VAL26 is a report generator (not a test runner) that reads the evidence produced by the five HA validation slices (VAL13–VAL17) and emits a consolidated HA proof report.
It evaluates three readiness levels with measured results clearly separated from proposed targets and derived thresholds:
Readiness level |
Achievable with VAL13–VAL17? |
Additional requirements |
|---|---|---|
HA Design Partner |
YES — if all five slices pass and key targets met |
None beyond VAL13–VAL17 |
HA GA |
NO |
VAL18 30-day soak + streaming replication promotion |
HA Public Production |
NO |
Everything above GA + multi-AZ + security audit |
Run after completing a full cli-audit-lab run:
bash scripts/labs/run_ha_proof_report_val26.sh \
evidence/cli-audit-lab-YYYY-MM-DD
Output is written to val26/ within the evidence directory. The formal plan,
metric definitions, and readiness criteria are documented in
ha-proof-report-validation.md.
VAL28 — Cross-Cutting Proof Report¶
VAL28 is a report generator (not a test runner) that reads the evidence produced by six cross-cutting validation slices (VAL01–VAL06) and emits a consolidated report covering the Security, Observability, and Operations surfaces of the AutonomyOps ADK control plane.
It evaluates three readiness levels:
Readiness level |
Achievable with VAL01–VAL06? |
Additional requirements |
|---|---|---|
Cross-Cutting Design Partner |
YES — if all six slices pass and key targets met |
None beyond VAL01–VAL06 |
Cross-Cutting GA |
NO |
PG-backed audit perf + production OTel collector + no SKIP checks |
Public Production Claim |
NO |
Everything above GA + external security audit + compliance audit |
Run after completing a full cli-audit-lab run:
bash scripts/labs/run_crosscut_proof_report_val28.sh \
evidence/cli-audit-lab-YYYY-MM-DD
Output is written to val28/ within the evidence directory. VAL28 handles the
mixed evidence formats: VAL01/VAL02 text reports are parsed via regex;
VAL03–VAL06 JSON reports are schema-validated before extraction. The formal
plan, metric definitions, and readiness criteria are documented in
crosscut-proof-report-validation.md.
VAL29 — v1 Public-Claim Evidence Matrix¶
VAL29 is a meta-aggregator that reads the JSON artifacts from all four proof reports (VAL25/VAL26/VAL27/VAL28) and produces a single capability-level evidence matrix covering every v1 public claim.
The matrix assigns one of five evidence states to each claim:
State |
Meaning |
|---|---|
|
Claim fully supported by completed VAL runs |
|
Claim supported with disclosed limitations |
|
Soak framework exists; Gate D not yet run |
|
Not validated; additional work required |
|
Required for Public Production Claim only |
Run after all four proof reports have been generated:
bash scripts/labs/run_evidence_matrix_val29.sh \
evidence/cli-audit-lab-YYYY-MM-DD \
evidence/
Output is written to val29/ within the cli-audit-lab evidence directory.
The formal plan and full matrix definition are documented in
evidence-matrix-validation.md.