VAL 15 — Backup/Restore Validation¶

Status: Implemented Runner: run_backup_restore_val15_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val15/ Port: cp-val15-node → 19001

Purpose¶

Validates the autonomy ha backup create/list/restore workflow end-to-end:

Confirms backup creates a valid pg_dump custom-format archive
Verifies stored metadata (backup_id, status, checksum) matches the produced file
Measures backup and restore durations against bounded thresholds
Proves restore correctness by injecting post-backup mutations and verifying reversion
Validates the backup inventory grows correctly with multiple backups
Confirms the safety gate (mandatory --confirm flag) blocks restore without explicit confirmation
Captures ha.backup.created and ha.backup.restored audit events in the shared store

Branch-Specific Rule Application¶

Question	Answer
Is this covered by an existing LAB?	Partially. `run_ha_lab()` exercises `ha backup create/list/restore` as part of a larger HA flow (lines 1283–1334). It does NOT cover: backup duration measurement, checksum verification, restore correctness assertion with row-count comparison, multi-backup inventory, error-path testing (missing `--confirm`), or a standalone 10-check pass/fail report.
Which LAB/evidence bundle is extended?	`run_cli_audit_lab.sh` — new function `run_backup_restore_val15_lab()` appended as slice 25. Reuses `start_ha_server()`, `wait_for_http()`, `wait_for_log()`, `wait_for_pg_container()` helpers defined in the same file.
New evidence files	29 files in `$EVIDENCE_DIR/val15/` — see Evidence Files table below.
Tutorial/runbook docs updated	`docs/tutorials/cli-audit-lab.md` §4 (slice 25), §5 (val15/ files), §6 (expected results), §8 (scope).
Reason new runner function required	The backup/restore steps in `run_ha_lab()` are embedded mid-function in a larger HA bring-up/tear-down flow and share Docker infrastructure (`pr17ha-primary`, `pr17-ha-net`). Injecting timing measurement, integrity assertions, multi-backup tests, and error-path tests into that flow would break the HA lab’s sequencing. A narrowly scoped `run_backup_restore_val15_lab()` with isolated infrastructure is cleaner.

Backup/Restore Test Plan¶

Environment¶

val15-ha-net  (Docker bridge network, isolated)
     │
val15-pg-primary  (postgres:16, DB: autonomy)
     │
cp-val15-node:19001  (orchestrator_ha_server binary)
     │  ← normal mode for backup create (phases 1–3)
cp-val15-maint:19001 (orchestrator_ha_server --maintenance-mode)
     │  ← maintenance mode for restore (phase 4)
cp-val15-post:19001  (orchestrator_ha_server binary)
     │  ← normal mode for multi-backup + error-path tests (phases 5–6)

Test Sequence¶

Phase	Actions
1. Setup	Provision `val15-pg-primary` Docker container; load fixture tables
2. Backup	Start HA server (normal); `ha backup create`; capture timing + checksum
3. Validate	pg_restore -l; ha backup list; checksum comparison
4. Mutate + Restore	Mutate tables; restart HA in `--maintenance-mode`; `ha backup restore`; verify data
5. Multi-backup	Restart HA (normal); `ha backup create` second backup; `ha backup list` shows count ≥ 2
6. Error path	`ha backup restore` without `--confirm`; must exit non-zero
7. Audit	`audit query --event-type ha.backup.created/restored`

Dataset / Fixture Strategy¶

Two tables loaded via psql generate_series:

Table	Rows	Bytes/row	Approximate size
`val15_small`	100	200	~20 KB
`val15_medium`	1,000	1,000	~1 MB

Why two tables?

val15_small exercises the happy-path integrity check (spot-value comparison)
val15_medium adds enough WAL and backup data to make timing measurements meaningful

Mutation (post-backup, pre-restore):

UPDATE val15_small SET payload = 'MUTATED' WHERE id <= 50 — corrupts first 50 rows
DELETE FROM val15_medium WHERE id > 500 — removes half the medium table

These mutations are deliberately visible: after a correct restore, val15_small should have 100 rows with LEFT(payload,1)='s' and val15_medium should have 1,000 rows with LEFT(payload,1)='m'.

Integrity Verification Method¶

Row-count and spot-value check¶

SELECT
  (SELECT COUNT(*) FROM val15_small)::int  AS small_count,
  (SELECT COUNT(*) FROM val15_medium)::int AS medium_count,
  (SELECT LEFT(payload,1) FROM val15_small  WHERE id=1) = 's' AS small_payload_ok,
  (SELECT LEFT(payload,1) FROM val15_medium WHERE id=1) = 'm' AS medium_payload_ok

Expected post-restore: small_count=100 | medium_count=1000 | t | t

Checksum verification¶

cli_checksum  = extract "checksum=<hex>" from ha backup create output
file_checksum = SHA-256 of .dump file (Python hashlib, 64K chunks)
PASS if cli_checksum == file_checksum

The ha backup create command records the SHA-256 hex string in the backup_inventory table and includes it in the text output as checksum=<hex>. The lab computes the same hash independently of the CLI to confirm no truncation or corruption occurred.

pg_dump format validation¶

docker cp <dump_file> val15-pg-primary:/tmp/<backup_id>.dump
docker exec val15-pg-primary pg_restore -l /tmp/<backup_id>.dump
PASS if exit code = 0 AND output contains "TABLE DATA"

The -l flag lists the TOC without restoring. The harness first copies the archive into the container so pg_restore reads the real backup file generated by ha backup create, then captures the TOC in the evidence bundle.

Timing Measurements¶

Measurement	Method	Threshold
`backup_ms`	`int(time.time()*1000)` before/after `ha backup create` CLI call	≤ 30,000 ms
`restore_ms`	`int(time.time()*1000)` before/after `ha backup restore` CLI call	≤ 60,000 ms

Threshold rationale:

30 s for backup: pg_dump -Fc on ~1 MB of data on local Docker completes in < 2 s. 30 s is a conservative upper bound that accommodates cold Docker starts and slow CI environments.
60 s for restore: pg_restore --clean --if-exists requires a brief CP stop and re-creation of all objects. 60 s is 2× the backup threshold to account for the additional DROP/CREATE overhead and the brief HA server restart cycle.

Both thresholds are deliberately lenient — they flag runaway hangs or permission errors, not performance regression.

VAL15 10-Check Matrix¶

Check	Name	Threshold	Phase
VAL15-01	backup_created	.dump file exists and `size > 0`; CLI exits 0	Backup
VAL15-02	backup_file_valid	`pg_restore -l` exits 0; output contains `TABLE DATA`	Backup
VAL15-03	backup_metadata_correct	`ha backup list --output json` contains `backup_id` with `status=completed`	Validate
VAL15-04	checksum_verified	CLI-reported checksum == SHA-256 of .dump file	Validate
VAL15-05	backup_timing_bound	`backup_ms ≤ 30000`	Backup
VAL15-06	restore_correct	Post-restore: `small_count=100`, `medium_count=1000`, both payload checks = `t`	Restore
VAL15-07	restore_timing_bound	`restore_ms ≤ 60000`	Restore
VAL15-08	multi_backup_inventory	`ha backup list` shows `count ≥ 2` with both backup IDs present	Multi-backup
VAL15-09	restore_requires_confirm	`ha backup restore` without `--confirm` exits non-zero and mentions `--confirm` in the error output	Error path
VAL15-10	audit_events_captured	≥ 1 `ha.backup.created` event + ≥ 1 `ha.backup.restored` event in audit store	Audit

Pass/Fail Criteria¶

Outcome	Condition
PASS	All 10 checks pass
PARTIAL	Checks 1, 5, 6, 7 pass (file created, both timing bounds, restore correctness)
FAIL	Check 6 fails (restore did not return data to pre-backup state) OR check 1 fails (no backup file produced)

The mandatory check is VAL15-06 (restore correctness) — a backup that creates a valid-looking file but restores to wrong data is a critical defect regardless of timing or audit.

Evidence Files¶

File	Description
`val15-pg-setup.txt`	Docker container IP, postgres URL, table create + fixture load output
`val15-ha-server.log`	HA server log (initial normal session — backup phase)
`val15-ha-server-maint.log`	HA server log (maintenance mode session — restore phase)
`val15-ha-server-post.log`	HA server log (post-restore normal session — multi-backup + error-path)
`val15-01-backup-create.txt`	`ha backup create` stdout+stderr (includes `backup_id`, `checksum`, `size_bytes`)
`val15-01-backup-file-stat.txt`	`stat` of `.dump` file (confirms existence and size)
`val15-02-backup-toc.txt`	`pg_restore -l` table-of-contents of the backup archive
`val15-03-backup-list.txt`	`ha backup list --output json` after first backup
`val15-03-metadata-check.txt`	Python assertion result: `backup_id` found with `status=completed`
`val15-04-checksum-verify.txt`	`cli_checksum=<hex>` + `file_checksum=<hex>` comparison
`val15-05-backup-timing.txt`	`backup_ms=<N>`
`val15-05-db-before.txt`	Row counts before backup (`small=100 medium=1000`)
`val15-06-post-backup-mutation.txt`	SQL UPDATE/DELETE output (post-backup data mutation)
`val15-06-restore.txt`	`ha backup restore` stdout+stderr
`val15-06-db-after-restore.txt`	Row counts after restore (should match pre-backup)
`val15-06-data-check.txt`	SQL spot-check query output (`100\|1000\|t\|t`)
`val15-06-integrity-result.txt`	Python assertion result: `restore_correct=true small=100 medium=1000`
`val15-07-restore-timing.txt`	`restore_ms=<N>`
`val15-08-backup-create-2.txt`	Second `ha backup create` output
`val15-08-backup-list-multi.txt`	`ha backup list --output json` showing both backup IDs
`val15-08-inventory-check.txt`	Python assertion result: `multi_backup_count=2`
`val15-09-restore-no-confirm.txt`	`ha backup restore` without `--confirm` (expected error output mentioning `--confirm`)
`val15-10-audit-backup-created.json`	`audit query --event-type ha.backup.created` JSON result
`val15-10-audit-backup-restored.json`	`audit query --event-type ha.backup.restored` JSON result
`val15-10-audit-check.txt`	Python assertion result: event counts for both types
`backups/backup-val15-a.dump`	pg_dump custom-format archive for first backup
`backups/backup-val15-b.dump`	pg_dump custom-format archive for second backup
`val15-report.txt`	Human-readable 10-check PASS/FAIL report with timing values
`val15-report.json`	Machine-readable JSON report with `backup_ms`, `restore_ms`, `pass_count`

Known Failure Modes¶

Failure	Likely Cause	Mitigation
VAL15-01 FAIL: no .dump file	`ha backup create` failed; HA server not yet leader or PG unavailable	Check `val15-01-backup-create.txt` for error message; verify `val15-ha-server.log` shows `acquired leadership`
VAL15-04 FAIL: checksum mismatch	`ha backup create` stdout did not include `checksum=<hex>` in parseable form	Check `val15-01-backup-create.txt` for correct output format; verify CLI uses `--output text` (default)
VAL15-06 FAIL: row counts wrong	Restore did not run or completed against wrong DB	Check `val15-06-restore.txt` for error; verify maintenance mode was entered before restore
VAL15-09 FAIL: missing-confirm exits 0 or omits `--confirm` guidance	Restore accepted without `--confirm` or safety-gate text regressed	Investigate `ha_backup.go` `confirm` flag handling
VAL15-10 FAIL: no audit events	Audit dir not set or HA server using different audit dir	Verify `AUTONOMY_AUDIT_DIR` env var is set; check `val15-ha-server.log` for audit emitter startup
Docker not available	CI environment without Docker daemon	Function prints SKIP and exits 0; VAL15 not counted in pass total

Final Report Template¶

# VAL 15 — Backup/Restore Validation

Generated:     <timestamp>
Node:          cp-val15-node:19001

## Timing
Backup duration:  <N> ms  (threshold=30000)
Restore duration: <N> ms  (threshold=60000)

## Checks
VAL15-01 backup_created:          PASS
VAL15-02 backup_file_valid:        PASS
VAL15-03 backup_metadata_correct:  PASS
VAL15-04 checksum_verified:        PASS
VAL15-05 backup_timing_bound:      PASS  (backup_ms=<N>, threshold=30000)
VAL15-06 restore_correct:          PASS
VAL15-07 restore_timing_bound:     PASS  (restore_ms=<N>, threshold=60000)
VAL15-08 multi_backup_inventory:   PASS
VAL15-09 restore_requires_confirm: PASS
VAL15-10 audit_events_captured:    PASS

## Summary
pass=10  fail=0  total=10

Backup/Restore Readiness Assessment:

PASS requires VAL15-06 (restore correctness) + VAL15-01 (backup file produced) + VAL15-09 (safety gate)
Record backup_ms and restore_ms as baseline timing evidence
Repeat VAL15 after any change to pgstore/backup.go, PG version upgrade, or storage backend migration