VAL26 — HA Proof Report Generator¶

Audience: engineering leads, product managers, and external reviewers who need a consolidated, evidence-backed HA readiness assessment.

VAL26 is a report generator, not a test runner. It reads evidence from five completed HA validation slices (VAL13–VAL17) and produces a single production-quality proof report with explicit readiness conclusions.

1. Scope¶

VAL26 consolidates evidence from:

Slice	Name	What it proves
VAL13	HA Failover Validation	SIGTERM/SIGKILL leader failover timing (≤5 s), zero data loss
VAL14	HA Replication Lag Baseline	write_lag distribution; derived healthy/degraded/alert thresholds
VAL15	Backup/Restore Validation	Backup timing, SHA-256 integrity, restore correctness (row-count + payload)
VAL16	Split-Brain Chaos Validation	Epoch-divergence detection, metadata-injection recovery, user data integrity
VAL17	Quorum Loss Validation	Loss detection timing (≤30 s), write blocking, recovery detection after PostgreSQL is restored

Branch rule: coverage by existing runner¶

Existing asset	Coverage
`run_soak_val18_report.sh`	30-day HA soak — VAL18 only, different metrics (failover count, uptime, data continuity over 30 days)
`run_cli_audit_lab.sh`	Runs VAL13–VAL17 but produces per-slice JSON reports with no cross-slice aggregation

New aggregator required. No existing script reads and combines VAL13–VAL17 evidence into a single HA proof artifact with readiness conclusions.

Out of scope¶

30-day HA durability soak (covered by VAL18 framework, not yet run)
Streaming-replication promotion failover (standby→primary under real PG HA)
Real network partitions (iptables / tc — all chaos is SIGTERM/SIGKILL or docker stop)
Concurrent-under-kill write requests
apply_lag measurement (VAL14 measures write_lag only)
PITR / WAL-based backup (VAL15 uses pg_dump/pg_restore)
Multi-AZ or multi-region topology
Write-gate HTTP endpoint rejection test (HA binary in this deployment does not expose /v1/rollouts; write-gate is verified via status endpoint fields)

2. Evidence Structure¶

VAL26 reads from the evidence directory produced by run_cli_audit_lab.sh (default: evidence/cli-audit-lab-YYYY-MM-DD).

Input file	Produced by	Contents
`val13/val13-report.json`	`run_ha_failover_val13_lab()`	Failover timing (SIGTERM/SIGKILL/rapid), 10-check results
`val14/val14-report.json`	`run_ha_replication_lag_val14_lab()`	Lag distribution (idle/light/heavy), derived thresholds, 10-check results
`val15/val15-report.json`	`run_backup_restore_val15_lab()`	Backup/restore timing, 10-check results
`val16/val16-report.json`	`run_split_brain_chaos_val16_lab()`	Split-brain detection/recovery outcomes, 10-check results
`val17/val17-report.json`	`run_quorum_loss_val17_lab()`	Loss detection timing, write-block state, 10-check results

VAL26 expects these five reports to come from one coherent run_cli_audit_lab.sh evidence set. The generator checks each report’s embedded timestamp and requires all found reports to fall within a single 6-hour evidence window before issuing a design-partner readiness conclusion.

Missing slice reports are reported as MISSING in the coverage table rather than aborting. Reports that exist but do not match the expected schema are degraded to MISSING with a schema-mismatch detail.

3. Metric Definitions and Targets¶

VAL26 reports on the following metrics. Workplan targets are annotated with [target]. VAL14 alerting thresholds are derived from measured lab values and are explicitly labelled as derived — they are NOT workplan targets.

Failover Timeline (VAL13)¶

Metric	Target	Source
SIGTERM graceful failover	≤ 5,000 ms	VAL13 / Gap HA-003
SIGKILL unplanned failover	≤ 5,000 ms	VAL13 / Gap HA-003
3× rapid cycle timings	All ≤ 5,000 ms	VAL13 resilience check
Data loss across failover	= 0 rows	VAL13 mandatory criterion

Method: poll follower /v1/ha/status holder_id every 50 ms after leader kill.

Replication Lag Distribution (VAL14)¶

Metric	What it is	Threshold
idle_p95_ms	write_lag at rest	Baseline noise floor (informational)
light_p95_ms	write_lag under 100 rows × 500 B	N/A (feeds threshold derivation)
light_drain_ms	LSN gap drain after light workload	≤ 2,000 ms [target]
heavy_p99_ms	write_lag under 500 rows × 2,000 B	N/A (informational)
heavy_drain_ms	LSN gap drain after heavy workload	≤ 5,000 ms [target]

Derived alerting thresholds (computed from measured values, NOT from the workplan):

observed_p95   = max(idle_p95_ms, light_p95_ms)
healthy_ms     = max(observed_p95 × 3 + 1, 10)
degraded_ms    = max(healthy_ms × 10, 100)
alert_ms       = max(healthy_ms × 50, 500)

These thresholds must be recalibrated against production write_lag observations before deployment. Docker write speeds are materially faster than cloud VMs.

Backup/Restore (VAL15)¶

Metric	Target	Source
backup_ms	≤ 30,000 ms	VAL15 workplan target
restore_ms	≤ 60,000 ms	VAL15 workplan target
SHA-256 checksum	CLI == file hash	VAL15 integrity criterion
Restore correctness	Row counts + payload spot-values match pre-backup	VAL15 mandatory

Quorum Loss (VAL17)¶

Metric	Target	Source
loss_detection_ms	≤ 30,000 ms	VAL17 / Gap HA-004
recovery_ms	≤ 30,000 ms	VAL17 / Gap HA-004
write_block_active during loss	= true	VAL17 mandatory safety criterion
can_accept_protected_writes during loss	= false	VAL17 mandatory safety criterion

4. Readiness Level Definitions¶

VAL26 evaluates three readiness levels. Only HA Design Partner level is achievable with this validation suite.

HA Design Partner Ready¶

Criteria (all must hold):

All five VAL13–VAL17 slices pass (zero failed checks each)
SIGTERM failover ≤ 5,000 ms (VAL13-03)
Zero data loss across failover (VAL13-04)
Backup restore correctness verified (VAL15-06)
Quorum loss detected within 30,000 ms AND writes blocked during loss (VAL17-02 + VAL17-03 + VAL17-04)
Evidence timestamps fall within a single 6-hour evidence window

Meaning: the core HA lifecycle is functional and stable enough to offer to early adopters under the scope limitations stated in §1.

HA GA Ready¶

Not achievable with VAL13–VAL17 alone. Additional requirements:

VAL18 30-day HA soak: Gate D requires ≥ 3 scheduled failovers, all failover_ms ≤ 10,000, data_continuity_rate = 1.000, HA uptime ≥ 99.9%
Streaming-replication promotion failover: standby is promoted to primary and the cluster recovers under a real PostgreSQL HA configuration
VAL14 alerting thresholds recalibrated against production write_lag

HA Public Production Claim¶

Requires everything for HA GA Ready plus:

Multi-AZ or multi-region topology validation
External security hardening audit covering HA auth surfaces
Production-grade monitoring and alerting validation

5. 10-Check Matrix¶

ID	When	Description	Pass criterion
VAL26-01	Setup	VAL13 failover report found and all checks pass	`s13 == "PASS"` (10/10 checks)
VAL26-02	Setup	VAL14 replication lag report found and all checks pass	`s14 == "PASS"` (10/10 checks)
VAL26-03	Setup	VAL15 backup/restore report found and all checks pass	`s15 == "PASS"` (10/10 checks)
VAL26-04	Setup	VAL16 split-brain report found and all checks pass	`s16 == "PASS"` (10/10 checks)
VAL26-05	Setup	VAL17 quorum loss report found and all checks pass	`s17 == "PASS"` (10/10 checks)
VAL26-06	Metric	SIGTERM leader failover ≤ 5,000 ms	`val13 VAL13-03 passes`
VAL26-07	Metric	Zero data loss across leader failover	`val13 VAL13-04 passes`
VAL26-08	Metric	Backup restore correctness verified	`val15 VAL15-06 passes`
VAL26-09	Metric	Quorum loss detected + writes blocked ≤ 30,000 ms	`val17 VAL17-02 AND VAL17-03 AND VAL17-04 pass`
VAL26-10	Summary	HA design partner readiness — all above pass and evidence coherent	VAL26-01..09 all PASS + 6-hour window

6. Run the Report¶

Prerequisites¶

Run the full cli-audit-lab (or at minimum VAL13–VAL17 slices, which require Docker and docker compose):

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_cli_audit_lab.sh

The evidence directory is printed at the end: evidence/cli-audit-lab-YYYY-MM-DD.

Generate the proof report¶

bash scripts/labs/run_ha_proof_report_val26.sh \
  evidence/cli-audit-lab-2026-03-23

Output files¶

File	Contents
stdout	Human-readable HA proof report
`val26/val26-proof-report.txt`	Same content as stdout
`val26/val26-proof-report.json`	Machine-readable JSON artifact

7. Final Report Format¶

VAL26 — HA Proof Report
Generated:    <YYYY-MM-DDTHH:MM:SSZ>
Evidence dir: <path>

Tested Topology:
  Cluster:    Two-node HA (node1 + node2), single-region
  Database:   PostgreSQL in Docker (single primary; VAL14 adds Docker standby)
  Chaos:      SIGTERM + SIGKILL for failover; SQL metadata injection for split-brain
              docker stop/start for quorum loss; NO iptables, NO real network partitions
  Evidence:   single coherent evidence window
  Note:       All timing figures are from controlled local Docker runs.
              Results are NOT representative of cloud VM or multi-region deployments.

Scenario Coverage:
  VAL13  HA Failover Validation         PASS     (10/10 checks)
  VAL14  HA Replication Lag Baseline    PASS     (10/10 checks)
  VAL15  Backup/Restore Validation      PASS     (10/10 checks)
  VAL16  Split-Brain Chaos Validation   PASS     (10/10 checks)
  VAL17  Quorum Loss Validation         PASS     (10/10 checks)

Failover Timeline  (VAL13, Docker two-node cluster):
  SIGTERM graceful failover:      342 ms   [target <= 5,000 ms]   PASS
  SIGKILL unplanned failover:     389 ms   [target <= 5,000 ms]   PASS
  Rapid 3x cycles:                281ms, 305ms, 318ms   [target: all <= 5,000 ms]   PASS
  Zero data loss across failover:                                  PASS
  Method: SIGTERM/SIGKILL leader PID → poll follower holder_id every 50ms

Replication Lag Distribution  (VAL14, Docker primary + standby):
  Measured write_lag (pg_stat_replication.write_lag):
    Idle p95:            0 ms   (baseline noise floor)
    Light load p95:      0 ms   (100 rows × 500 B)   drain=87 ms   [target drain <= 2,000 ms]   PASS
    Heavy load p99:      1 ms   (500 rows × 2,000 B)  drain=412 ms  [target drain <= 5,000 ms]   PASS

  Derived alerting thresholds (healthy = max(p95×3+1, 10)):
    Healthy:   <=    10 ms   (lag expected within normal ops)
    Degraded:  <=   100 ms   (standby lagging; investigate)
    Alert:     >    100 ms   (alert threshold = 500 ms)
  Note: Thresholds are DERIVED from measured lab values, not workplan targets.
        Recalibrate against production observed_p95 before alerting deployment.

Backup/Restore Results  (VAL15):
  Backup duration:        1247 ms   [target <= 30,000 ms]   PASS
  Restore duration:       2891 ms   [target <= 60,000 ms]   PASS
  SHA-256 checksum:                                          PASS   (CLI checksum == file hash)
  Restore correctness:                                       PASS   (row counts + payload spot-values)
  ...

10-Check Matrix:
  VAL26-01 PASS  VAL13 failover report: all 10 checks passed
           val13 10/10 checks
  ...

Overall: PASS=10 FAIL=0

Risks and Open Gaps:
  ...  (see §1 Out of scope)

Readiness Conclusion:
  HA DESIGN PARTNER READY  ✓
  ...
  HA GA READY  ✗  (NOT YET)
  ...
  HA PUBLIC PRODUCTION CLAIM  ✗  (NOT YET)
  ...

Verdict: HA DESIGN PARTNER READY

8. Tooling¶

File	Role
`scripts/labs/run_ha_proof_report_val26.sh`	VAL26 HA proof report generator
`scripts/labs/run_cli_audit_lab.sh`	Source of VAL13–VAL17 evidence
`docs/tutorials/ha-failover-validation.md`	VAL13 formal plan
`docs/tutorials/ha-replication-lag-validation.md`	VAL14 formal plan
`docs/tutorials/backup-restore-validation.md`	VAL15 formal plan
`docs/tutorials/split-brain-chaos-validation.md`	VAL16 formal plan
`docs/tutorials/quorum-loss-validation.md`	VAL17 formal plan
`docs/tutorials/ha-soak-validation.md`	VAL18 30-day soak (GA gate, not yet run)