VAL 15 — Backup/Restore Validation

Status: Implemented Runner: run_backup_restore_val15_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val15/ Port: cp-val15-node → 19001


Purpose

Validates the autonomy ha backup create/list/restore workflow end-to-end:

  • Confirms backup creates a valid pg_dump custom-format archive

  • Verifies stored metadata (backup_id, status, checksum) matches the produced file

  • Measures backup and restore durations against bounded thresholds

  • Proves restore correctness by injecting post-backup mutations and verifying reversion

  • Validates the backup inventory grows correctly with multiple backups

  • Confirms the safety gate (mandatory --confirm flag) blocks restore without explicit confirmation

  • Captures ha.backup.created and ha.backup.restored audit events in the shared store


Branch-Specific Rule Application

Question

Answer

Is this covered by an existing LAB?

Partially. run_ha_lab() exercises ha backup create/list/restore as part of a larger HA flow (lines 1283–1334). It does NOT cover: backup duration measurement, checksum verification, restore correctness assertion with row-count comparison, multi-backup inventory, error-path testing (missing --confirm), or a standalone 10-check pass/fail report.

Which LAB/evidence bundle is extended?

run_cli_audit_lab.sh — new function run_backup_restore_val15_lab() appended as slice 25. Reuses start_ha_server(), wait_for_http(), wait_for_log(), wait_for_pg_container() helpers defined in the same file.

New evidence files

29 files in $EVIDENCE_DIR/val15/ — see Evidence Files table below.

Tutorial/runbook docs updated

docs/tutorials/cli-audit-lab.md §4 (slice 25), §5 (val15/ files), §6 (expected results), §8 (scope).

Reason new runner function required

The backup/restore steps in run_ha_lab() are embedded mid-function in a larger HA bring-up/tear-down flow and share Docker infrastructure (pr17ha-primary, pr17-ha-net). Injecting timing measurement, integrity assertions, multi-backup tests, and error-path tests into that flow would break the HA lab’s sequencing. A narrowly scoped run_backup_restore_val15_lab() with isolated infrastructure is cleaner.


Backup/Restore Test Plan

Environment

val15-ha-net  (Docker bridge network, isolated)
     │
val15-pg-primary  (postgres:16, DB: autonomy)
     │
cp-val15-node:19001  (orchestrator_ha_server binary)
     │  ← normal mode for backup create (phases 1–3)
cp-val15-maint:19001 (orchestrator_ha_server --maintenance-mode)
     │  ← maintenance mode for restore (phase 4)
cp-val15-post:19001  (orchestrator_ha_server binary)
     │  ← normal mode for multi-backup + error-path tests (phases 5–6)

Test Sequence

Phase

Actions

1. Setup

Provision val15-pg-primary Docker container; load fixture tables

2. Backup

Start HA server (normal); ha backup create; capture timing + checksum

3. Validate

pg_restore -l; ha backup list; checksum comparison

4. Mutate + Restore

Mutate tables; restart HA in --maintenance-mode; ha backup restore; verify data

5. Multi-backup

Restart HA (normal); ha backup create second backup; ha backup list shows count ≥ 2

6. Error path

ha backup restore without --confirm; must exit non-zero

7. Audit

audit query --event-type ha.backup.created/restored


Dataset / Fixture Strategy

Two tables loaded via psql generate_series:

Table

Rows

Bytes/row

Approximate size

val15_small

100

200

~20 KB

val15_medium

1,000

1,000

~1 MB

Why two tables?

  • val15_small exercises the happy-path integrity check (spot-value comparison)

  • val15_medium adds enough WAL and backup data to make timing measurements meaningful

Mutation (post-backup, pre-restore):

  • UPDATE val15_small SET payload = 'MUTATED' WHERE id <= 50 — corrupts first 50 rows

  • DELETE FROM val15_medium WHERE id > 500 — removes half the medium table

These mutations are deliberately visible: after a correct restore, val15_small should have 100 rows with LEFT(payload,1)='s' and val15_medium should have 1,000 rows with LEFT(payload,1)='m'.


Integrity Verification Method

Row-count and spot-value check

SELECT
  (SELECT COUNT(*) FROM val15_small)::int  AS small_count,
  (SELECT COUNT(*) FROM val15_medium)::int AS medium_count,
  (SELECT LEFT(payload,1) FROM val15_small  WHERE id=1) = 's' AS small_payload_ok,
  (SELECT LEFT(payload,1) FROM val15_medium WHERE id=1) = 'm' AS medium_payload_ok

Expected post-restore: small_count=100 | medium_count=1000 | t | t

Checksum verification

cli_checksum  = extract "checksum=<hex>" from ha backup create output
file_checksum = SHA-256 of .dump file (Python hashlib, 64K chunks)
PASS if cli_checksum == file_checksum

The ha backup create command records the SHA-256 hex string in the backup_inventory table and includes it in the text output as checksum=<hex>. The lab computes the same hash independently of the CLI to confirm no truncation or corruption occurred.

pg_dump format validation

docker cp <dump_file> val15-pg-primary:/tmp/<backup_id>.dump
docker exec val15-pg-primary pg_restore -l /tmp/<backup_id>.dump
PASS if exit code = 0 AND output contains "TABLE DATA"

The -l flag lists the TOC without restoring. The harness first copies the archive into the container so pg_restore reads the real backup file generated by ha backup create, then captures the TOC in the evidence bundle.


Timing Measurements

Measurement

Method

Threshold

backup_ms

int(time.time()*1000) before/after ha backup create CLI call

≤ 30,000 ms

restore_ms

int(time.time()*1000) before/after ha backup restore CLI call

≤ 60,000 ms

Threshold rationale:

  • 30 s for backup: pg_dump -Fc on ~1 MB of data on local Docker completes in < 2 s. 30 s is a conservative upper bound that accommodates cold Docker starts and slow CI environments.

  • 60 s for restore: pg_restore --clean --if-exists requires a brief CP stop and re-creation of all objects. 60 s is 2× the backup threshold to account for the additional DROP/CREATE overhead and the brief HA server restart cycle.

Both thresholds are deliberately lenient — they flag runaway hangs or permission errors, not performance regression.


VAL15 10-Check Matrix

Check

Name

Threshold

Phase

VAL15-01

backup_created

.dump file exists and size > 0; CLI exits 0

Backup

VAL15-02

backup_file_valid

pg_restore -l exits 0; output contains TABLE DATA

Backup

VAL15-03

backup_metadata_correct

ha backup list --output json contains backup_id with status=completed

Validate

VAL15-04

checksum_verified

CLI-reported checksum == SHA-256 of .dump file

Validate

VAL15-05

backup_timing_bound

backup_ms 30000

Backup

VAL15-06

restore_correct

Post-restore: small_count=100, medium_count=1000, both payload checks = t

Restore

VAL15-07

restore_timing_bound

restore_ms 60000

Restore

VAL15-08

multi_backup_inventory

ha backup list shows count 2 with both backup IDs present

Multi-backup

VAL15-09

restore_requires_confirm

ha backup restore without --confirm exits non-zero and mentions --confirm in the error output

Error path

VAL15-10

audit_events_captured

≥ 1 ha.backup.created event + ≥ 1 ha.backup.restored event in audit store

Audit


Pass/Fail Criteria

Outcome

Condition

PASS

All 10 checks pass

PARTIAL

Checks 1, 5, 6, 7 pass (file created, both timing bounds, restore correctness)

FAIL

Check 6 fails (restore did not return data to pre-backup state) OR check 1 fails (no backup file produced)

The mandatory check is VAL15-06 (restore correctness) — a backup that creates a valid-looking file but restores to wrong data is a critical defect regardless of timing or audit.


Evidence Files

File

Description

val15-pg-setup.txt

Docker container IP, postgres URL, table create + fixture load output

val15-ha-server.log

HA server log (initial normal session — backup phase)

val15-ha-server-maint.log

HA server log (maintenance mode session — restore phase)

val15-ha-server-post.log

HA server log (post-restore normal session — multi-backup + error-path)

val15-01-backup-create.txt

ha backup create stdout+stderr (includes backup_id, checksum, size_bytes)

val15-01-backup-file-stat.txt

stat of .dump file (confirms existence and size)

val15-02-backup-toc.txt

pg_restore -l table-of-contents of the backup archive

val15-03-backup-list.txt

ha backup list --output json after first backup

val15-03-metadata-check.txt

Python assertion result: backup_id found with status=completed

val15-04-checksum-verify.txt

cli_checksum=<hex> + file_checksum=<hex> comparison

val15-05-backup-timing.txt

backup_ms=<N>

val15-05-db-before.txt

Row counts before backup (small=100 medium=1000)

val15-06-post-backup-mutation.txt

SQL UPDATE/DELETE output (post-backup data mutation)

val15-06-restore.txt

ha backup restore stdout+stderr

val15-06-db-after-restore.txt

Row counts after restore (should match pre-backup)

val15-06-data-check.txt

SQL spot-check query output (100|1000|t|t)

val15-06-integrity-result.txt

Python assertion result: restore_correct=true small=100 medium=1000

val15-07-restore-timing.txt

restore_ms=<N>

val15-08-backup-create-2.txt

Second ha backup create output

val15-08-backup-list-multi.txt

ha backup list --output json showing both backup IDs

val15-08-inventory-check.txt

Python assertion result: multi_backup_count=2

val15-09-restore-no-confirm.txt

ha backup restore without --confirm (expected error output mentioning --confirm)

val15-10-audit-backup-created.json

audit query --event-type ha.backup.created JSON result

val15-10-audit-backup-restored.json

audit query --event-type ha.backup.restored JSON result

val15-10-audit-check.txt

Python assertion result: event counts for both types

backups/backup-val15-a.dump

pg_dump custom-format archive for first backup

backups/backup-val15-b.dump

pg_dump custom-format archive for second backup

val15-report.txt

Human-readable 10-check PASS/FAIL report with timing values

val15-report.json

Machine-readable JSON report with backup_ms, restore_ms, pass_count


Known Failure Modes

Failure

Likely Cause

Mitigation

VAL15-01 FAIL: no .dump file

ha backup create failed; HA server not yet leader or PG unavailable

Check val15-01-backup-create.txt for error message; verify val15-ha-server.log shows acquired leadership

VAL15-04 FAIL: checksum mismatch

ha backup create stdout did not include checksum=<hex> in parseable form

Check val15-01-backup-create.txt for correct output format; verify CLI uses --output text (default)

VAL15-06 FAIL: row counts wrong

Restore did not run or completed against wrong DB

Check val15-06-restore.txt for error; verify maintenance mode was entered before restore

VAL15-09 FAIL: missing-confirm exits 0 or omits --confirm guidance

Restore accepted without --confirm or safety-gate text regressed

Investigate ha_backup.go confirm flag handling

VAL15-10 FAIL: no audit events

Audit dir not set or HA server using different audit dir

Verify AUTONOMY_AUDIT_DIR env var is set; check val15-ha-server.log for audit emitter startup

Docker not available

CI environment without Docker daemon

Function prints SKIP and exits 0; VAL15 not counted in pass total


Final Report Template

# VAL 15 — Backup/Restore Validation

Generated:     <timestamp>
Node:          cp-val15-node:19001

## Timing
Backup duration:  <N> ms  (threshold=30000)
Restore duration: <N> ms  (threshold=60000)

## Checks
VAL15-01 backup_created:          PASS
VAL15-02 backup_file_valid:        PASS
VAL15-03 backup_metadata_correct:  PASS
VAL15-04 checksum_verified:        PASS
VAL15-05 backup_timing_bound:      PASS  (backup_ms=<N>, threshold=30000)
VAL15-06 restore_correct:          PASS
VAL15-07 restore_timing_bound:     PASS  (restore_ms=<N>, threshold=60000)
VAL15-08 multi_backup_inventory:   PASS
VAL15-09 restore_requires_confirm: PASS
VAL15-10 audit_events_captured:    PASS

## Summary
pass=10  fail=0  total=10

Backup/Restore Readiness Assessment:

  • PASS requires VAL15-06 (restore correctness) + VAL15-01 (backup file produced) + VAL15-09 (safety gate)

  • Record backup_ms and restore_ms as baseline timing evidence

  • Repeat VAL15 after any change to pgstore/backup.go, PG version upgrade, or storage backend migration