VAL 15 — Backup/Restore Validation¶
Status: Implemented
Runner: run_backup_restore_val15_lab() in scripts/labs/run_cli_audit_lab.sh
Evidence dir: $EVIDENCE_DIR/val15/
Port: cp-val15-node → 19001
Purpose¶
Validates the autonomy ha backup create/list/restore workflow end-to-end:
Confirms backup creates a valid pg_dump custom-format archive
Verifies stored metadata (backup_id, status, checksum) matches the produced file
Measures backup and restore durations against bounded thresholds
Proves restore correctness by injecting post-backup mutations and verifying reversion
Validates the backup inventory grows correctly with multiple backups
Confirms the safety gate (mandatory
--confirmflag) blocks restore without explicit confirmationCaptures
ha.backup.createdandha.backup.restoredaudit events in the shared store
Branch-Specific Rule Application¶
Question |
Answer |
|---|---|
Is this covered by an existing LAB? |
Partially. |
Which LAB/evidence bundle is extended? |
|
New evidence files |
29 files in |
Tutorial/runbook docs updated |
|
Reason new runner function required |
The backup/restore steps in |
Backup/Restore Test Plan¶
Environment¶
val15-ha-net (Docker bridge network, isolated)
│
val15-pg-primary (postgres:16, DB: autonomy)
│
cp-val15-node:19001 (orchestrator_ha_server binary)
│ ← normal mode for backup create (phases 1–3)
cp-val15-maint:19001 (orchestrator_ha_server --maintenance-mode)
│ ← maintenance mode for restore (phase 4)
cp-val15-post:19001 (orchestrator_ha_server binary)
│ ← normal mode for multi-backup + error-path tests (phases 5–6)
Test Sequence¶
Phase |
Actions |
|---|---|
1. Setup |
Provision |
2. Backup |
Start HA server (normal); |
3. Validate |
pg_restore -l; ha backup list; checksum comparison |
4. Mutate + Restore |
Mutate tables; restart HA in |
5. Multi-backup |
Restart HA (normal); |
6. Error path |
|
7. Audit |
|
Dataset / Fixture Strategy¶
Two tables loaded via psql generate_series:
Table |
Rows |
Bytes/row |
Approximate size |
|---|---|---|---|
|
100 |
200 |
~20 KB |
|
1,000 |
1,000 |
~1 MB |
Why two tables?
val15_smallexercises the happy-path integrity check (spot-value comparison)val15_mediumadds enough WAL and backup data to make timing measurements meaningful
Mutation (post-backup, pre-restore):
UPDATE val15_small SET payload = 'MUTATED' WHERE id <= 50— corrupts first 50 rowsDELETE FROM val15_medium WHERE id > 500— removes half the medium table
These mutations are deliberately visible: after a correct restore, val15_small should have 100 rows with LEFT(payload,1)='s' and val15_medium should have 1,000 rows with LEFT(payload,1)='m'.
Integrity Verification Method¶
Row-count and spot-value check¶
SELECT
(SELECT COUNT(*) FROM val15_small)::int AS small_count,
(SELECT COUNT(*) FROM val15_medium)::int AS medium_count,
(SELECT LEFT(payload,1) FROM val15_small WHERE id=1) = 's' AS small_payload_ok,
(SELECT LEFT(payload,1) FROM val15_medium WHERE id=1) = 'm' AS medium_payload_ok
Expected post-restore: small_count=100 | medium_count=1000 | t | t
Checksum verification¶
cli_checksum = extract "checksum=<hex>" from ha backup create output
file_checksum = SHA-256 of .dump file (Python hashlib, 64K chunks)
PASS if cli_checksum == file_checksum
The ha backup create command records the SHA-256 hex string in the
backup_inventory table and includes it in the text output as
checksum=<hex>. The lab computes the same hash independently of the CLI to
confirm no truncation or corruption occurred.
pg_dump format validation¶
docker cp <dump_file> val15-pg-primary:/tmp/<backup_id>.dump
docker exec val15-pg-primary pg_restore -l /tmp/<backup_id>.dump
PASS if exit code = 0 AND output contains "TABLE DATA"
The -l flag lists the TOC without restoring. The harness first copies the
archive into the container so pg_restore reads the real backup file generated
by ha backup create, then captures the TOC in the evidence bundle.
Timing Measurements¶
Measurement |
Method |
Threshold |
|---|---|---|
|
|
≤ 30,000 ms |
|
|
≤ 60,000 ms |
Threshold rationale:
30 s for backup:
pg_dump -Fcon ~1 MB of data on local Docker completes in < 2 s. 30 s is a conservative upper bound that accommodates cold Docker starts and slow CI environments.60 s for restore:
pg_restore --clean --if-existsrequires a brief CP stop and re-creation of all objects. 60 s is 2× the backup threshold to account for the additionalDROP/CREATEoverhead and the brief HA server restart cycle.
Both thresholds are deliberately lenient — they flag runaway hangs or permission errors, not performance regression.
VAL15 10-Check Matrix¶
Check |
Name |
Threshold |
Phase |
|---|---|---|---|
VAL15-01 |
backup_created |
.dump file exists and |
Backup |
VAL15-02 |
backup_file_valid |
|
Backup |
VAL15-03 |
backup_metadata_correct |
|
Validate |
VAL15-04 |
checksum_verified |
CLI-reported checksum == SHA-256 of .dump file |
Validate |
VAL15-05 |
backup_timing_bound |
|
Backup |
VAL15-06 |
restore_correct |
Post-restore: |
Restore |
VAL15-07 |
restore_timing_bound |
|
Restore |
VAL15-08 |
multi_backup_inventory |
|
Multi-backup |
VAL15-09 |
restore_requires_confirm |
|
Error path |
VAL15-10 |
audit_events_captured |
≥ 1 |
Audit |
Pass/Fail Criteria¶
Outcome |
Condition |
|---|---|
PASS |
All 10 checks pass |
PARTIAL |
Checks 1, 5, 6, 7 pass (file created, both timing bounds, restore correctness) |
FAIL |
Check 6 fails (restore did not return data to pre-backup state) OR check 1 fails (no backup file produced) |
The mandatory check is VAL15-06 (restore correctness) — a backup that creates a valid-looking file but restores to wrong data is a critical defect regardless of timing or audit.
Evidence Files¶
File |
Description |
|---|---|
|
Docker container IP, postgres URL, table create + fixture load output |
|
HA server log (initial normal session — backup phase) |
|
HA server log (maintenance mode session — restore phase) |
|
HA server log (post-restore normal session — multi-backup + error-path) |
|
|
|
|
|
|
|
|
|
Python assertion result: |
|
|
|
|
|
Row counts before backup ( |
|
SQL UPDATE/DELETE output (post-backup data mutation) |
|
|
|
Row counts after restore (should match pre-backup) |
|
SQL spot-check query output ( |
|
Python assertion result: |
|
|
|
Second |
|
|
|
Python assertion result: |
|
|
|
|
|
|
|
Python assertion result: event counts for both types |
|
pg_dump custom-format archive for first backup |
|
pg_dump custom-format archive for second backup |
|
Human-readable 10-check PASS/FAIL report with timing values |
|
Machine-readable JSON report with |
Known Failure Modes¶
Failure |
Likely Cause |
Mitigation |
|---|---|---|
VAL15-01 FAIL: no .dump file |
|
Check |
VAL15-04 FAIL: checksum mismatch |
|
Check |
VAL15-06 FAIL: row counts wrong |
Restore did not run or completed against wrong DB |
Check |
VAL15-09 FAIL: missing-confirm exits 0 or omits |
Restore accepted without |
Investigate |
VAL15-10 FAIL: no audit events |
Audit dir not set or HA server using different audit dir |
Verify |
Docker not available |
CI environment without Docker daemon |
Function prints SKIP and exits 0; VAL15 not counted in pass total |
Final Report Template¶
# VAL 15 — Backup/Restore Validation
Generated: <timestamp>
Node: cp-val15-node:19001
## Timing
Backup duration: <N> ms (threshold=30000)
Restore duration: <N> ms (threshold=60000)
## Checks
VAL15-01 backup_created: PASS
VAL15-02 backup_file_valid: PASS
VAL15-03 backup_metadata_correct: PASS
VAL15-04 checksum_verified: PASS
VAL15-05 backup_timing_bound: PASS (backup_ms=<N>, threshold=30000)
VAL15-06 restore_correct: PASS
VAL15-07 restore_timing_bound: PASS (restore_ms=<N>, threshold=60000)
VAL15-08 multi_backup_inventory: PASS
VAL15-09 restore_requires_confirm: PASS
VAL15-10 audit_events_captured: PASS
## Summary
pass=10 fail=0 total=10
Backup/Restore Readiness Assessment:
PASS requires VAL15-06 (restore correctness) + VAL15-01 (backup file produced) + VAL15-09 (safety gate)
Record
backup_msandrestore_msas baseline timing evidenceRepeat VAL15 after any change to
pgstore/backup.go, PG version upgrade, or storage backend migration