VAL26 — HA Proof Report Generator

Audience: engineering leads, product managers, and external reviewers who need a consolidated, evidence-backed HA readiness assessment.

VAL26 is a report generator, not a test runner. It reads evidence from five completed HA validation slices (VAL13–VAL17) and produces a single production-quality proof report with explicit readiness conclusions.

1. Scope

VAL26 consolidates evidence from:

Slice

Name

What it proves

VAL13

HA Failover Validation

SIGTERM/SIGKILL leader failover timing (≤5 s), zero data loss

VAL14

HA Replication Lag Baseline

write_lag distribution; derived healthy/degraded/alert thresholds

VAL15

Backup/Restore Validation

Backup timing, SHA-256 integrity, restore correctness (row-count + payload)

VAL16

Split-Brain Chaos Validation

Epoch-divergence detection, metadata-injection recovery, user data integrity

VAL17

Quorum Loss Validation

Loss detection timing (≤30 s), write blocking, recovery detection after PostgreSQL is restored

Branch rule: coverage by existing runner

Existing asset

Coverage

run_soak_val18_report.sh

30-day HA soak — VAL18 only, different metrics (failover count, uptime, data continuity over 30 days)

run_cli_audit_lab.sh

Runs VAL13–VAL17 but produces per-slice JSON reports with no cross-slice aggregation

New aggregator required. No existing script reads and combines VAL13–VAL17 evidence into a single HA proof artifact with readiness conclusions.

Out of scope

  • 30-day HA durability soak (covered by VAL18 framework, not yet run)

  • Streaming-replication promotion failover (standby→primary under real PG HA)

  • Real network partitions (iptables / tc — all chaos is SIGTERM/SIGKILL or docker stop)

  • Concurrent-under-kill write requests

  • apply_lag measurement (VAL14 measures write_lag only)

  • PITR / WAL-based backup (VAL15 uses pg_dump/pg_restore)

  • Multi-AZ or multi-region topology

  • Write-gate HTTP endpoint rejection test (HA binary in this deployment does not expose /v1/rollouts; write-gate is verified via status endpoint fields)

2. Evidence Structure

VAL26 reads from the evidence directory produced by run_cli_audit_lab.sh (default: evidence/cli-audit-lab-YYYY-MM-DD).

Input file

Produced by

Contents

val13/val13-report.json

run_ha_failover_val13_lab()

Failover timing (SIGTERM/SIGKILL/rapid), 10-check results

val14/val14-report.json

run_ha_replication_lag_val14_lab()

Lag distribution (idle/light/heavy), derived thresholds, 10-check results

val15/val15-report.json

run_backup_restore_val15_lab()

Backup/restore timing, 10-check results

val16/val16-report.json

run_split_brain_chaos_val16_lab()

Split-brain detection/recovery outcomes, 10-check results

val17/val17-report.json

run_quorum_loss_val17_lab()

Loss detection timing, write-block state, 10-check results

VAL26 expects these five reports to come from one coherent run_cli_audit_lab.sh evidence set. The generator checks each report’s embedded timestamp and requires all found reports to fall within a single 6-hour evidence window before issuing a design-partner readiness conclusion.

Missing slice reports are reported as MISSING in the coverage table rather than aborting. Reports that exist but do not match the expected schema are degraded to MISSING with a schema-mismatch detail.

3. Metric Definitions and Targets

VAL26 reports on the following metrics. Workplan targets are annotated with [target]. VAL14 alerting thresholds are derived from measured lab values and are explicitly labelled as derived — they are NOT workplan targets.

Failover Timeline (VAL13)

Metric

Target

Source

SIGTERM graceful failover

≤ 5,000 ms

VAL13 / Gap HA-003

SIGKILL unplanned failover

≤ 5,000 ms

VAL13 / Gap HA-003

3× rapid cycle timings

All ≤ 5,000 ms

VAL13 resilience check

Data loss across failover

= 0 rows

VAL13 mandatory criterion

Method: poll follower /v1/ha/status holder_id every 50 ms after leader kill.

Replication Lag Distribution (VAL14)

Metric

What it is

Threshold

idle_p95_ms

write_lag at rest

Baseline noise floor (informational)

light_p95_ms

write_lag under 100 rows × 500 B

N/A (feeds threshold derivation)

light_drain_ms

LSN gap drain after light workload

≤ 2,000 ms [target]

heavy_p99_ms

write_lag under 500 rows × 2,000 B

N/A (informational)

heavy_drain_ms

LSN gap drain after heavy workload

≤ 5,000 ms [target]

Derived alerting thresholds (computed from measured values, NOT from the workplan):

observed_p95   = max(idle_p95_ms, light_p95_ms)
healthy_ms     = max(observed_p95 × 3 + 1, 10)
degraded_ms    = max(healthy_ms × 10, 100)
alert_ms       = max(healthy_ms × 50, 500)

These thresholds must be recalibrated against production write_lag observations before deployment. Docker write speeds are materially faster than cloud VMs.

Backup/Restore (VAL15)

Metric

Target

Source

backup_ms

≤ 30,000 ms

VAL15 workplan target

restore_ms

≤ 60,000 ms

VAL15 workplan target

SHA-256 checksum

CLI == file hash

VAL15 integrity criterion

Restore correctness

Row counts + payload spot-values match pre-backup

VAL15 mandatory

Quorum Loss (VAL17)

Metric

Target

Source

loss_detection_ms

≤ 30,000 ms

VAL17 / Gap HA-004

recovery_ms

≤ 30,000 ms

VAL17 / Gap HA-004

write_block_active during loss

= true

VAL17 mandatory safety criterion

can_accept_protected_writes during loss

= false

VAL17 mandatory safety criterion

4. Readiness Level Definitions

VAL26 evaluates three readiness levels. Only HA Design Partner level is achievable with this validation suite.

HA Design Partner Ready

Criteria (all must hold):

  1. All five VAL13–VAL17 slices pass (zero failed checks each)

  2. SIGTERM failover ≤ 5,000 ms (VAL13-03)

  3. Zero data loss across failover (VAL13-04)

  4. Backup restore correctness verified (VAL15-06)

  5. Quorum loss detected within 30,000 ms AND writes blocked during loss (VAL17-02 + VAL17-03 + VAL17-04)

  6. Evidence timestamps fall within a single 6-hour evidence window

Meaning: the core HA lifecycle is functional and stable enough to offer to early adopters under the scope limitations stated in §1.

HA GA Ready

Not achievable with VAL13–VAL17 alone. Additional requirements:

  1. VAL18 30-day HA soak: Gate D requires ≥ 3 scheduled failovers, all failover_ms 10,000, data_continuity_rate = 1.000, HA uptime ≥ 99.9%

  2. Streaming-replication promotion failover: standby is promoted to primary and the cluster recovers under a real PostgreSQL HA configuration

  3. VAL14 alerting thresholds recalibrated against production write_lag

HA Public Production Claim

Requires everything for HA GA Ready plus:

  1. Multi-AZ or multi-region topology validation

  2. External security hardening audit covering HA auth surfaces

  3. Production-grade monitoring and alerting validation

5. 10-Check Matrix

ID

When

Description

Pass criterion

VAL26-01

Setup

VAL13 failover report found and all checks pass

s13 == "PASS" (10/10 checks)

VAL26-02

Setup

VAL14 replication lag report found and all checks pass

s14 == "PASS" (10/10 checks)

VAL26-03

Setup

VAL15 backup/restore report found and all checks pass

s15 == "PASS" (10/10 checks)

VAL26-04

Setup

VAL16 split-brain report found and all checks pass

s16 == "PASS" (10/10 checks)

VAL26-05

Setup

VAL17 quorum loss report found and all checks pass

s17 == "PASS" (10/10 checks)

VAL26-06

Metric

SIGTERM leader failover ≤ 5,000 ms

val13 VAL13-03 passes

VAL26-07

Metric

Zero data loss across leader failover

val13 VAL13-04 passes

VAL26-08

Metric

Backup restore correctness verified

val15 VAL15-06 passes

VAL26-09

Metric

Quorum loss detected + writes blocked ≤ 30,000 ms

val17 VAL17-02 AND VAL17-03 AND VAL17-04 pass

VAL26-10

Summary

HA design partner readiness — all above pass and evidence coherent

VAL26-01..09 all PASS + 6-hour window

6. Run the Report

Prerequisites

Run the full cli-audit-lab (or at minimum VAL13–VAL17 slices, which require Docker and docker compose):

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_cli_audit_lab.sh

The evidence directory is printed at the end: evidence/cli-audit-lab-YYYY-MM-DD.

Generate the proof report

bash scripts/labs/run_ha_proof_report_val26.sh \
  evidence/cli-audit-lab-2026-03-23

Output files

File

Contents

stdout

Human-readable HA proof report

val26/val26-proof-report.txt

Same content as stdout

val26/val26-proof-report.json

Machine-readable JSON artifact

7. Final Report Format

VAL26 — HA Proof Report
Generated:    <YYYY-MM-DDTHH:MM:SSZ>
Evidence dir: <path>

Tested Topology:
  Cluster:    Two-node HA (node1 + node2), single-region
  Database:   PostgreSQL in Docker (single primary; VAL14 adds Docker standby)
  Chaos:      SIGTERM + SIGKILL for failover; SQL metadata injection for split-brain
              docker stop/start for quorum loss; NO iptables, NO real network partitions
  Evidence:   single coherent evidence window
  Note:       All timing figures are from controlled local Docker runs.
              Results are NOT representative of cloud VM or multi-region deployments.

Scenario Coverage:
  VAL13  HA Failover Validation         PASS     (10/10 checks)
  VAL14  HA Replication Lag Baseline    PASS     (10/10 checks)
  VAL15  Backup/Restore Validation      PASS     (10/10 checks)
  VAL16  Split-Brain Chaos Validation   PASS     (10/10 checks)
  VAL17  Quorum Loss Validation         PASS     (10/10 checks)

Failover Timeline  (VAL13, Docker two-node cluster):
  SIGTERM graceful failover:      342 ms   [target <= 5,000 ms]   PASS
  SIGKILL unplanned failover:     389 ms   [target <= 5,000 ms]   PASS
  Rapid 3x cycles:                281ms, 305ms, 318ms   [target: all <= 5,000 ms]   PASS
  Zero data loss across failover:                                  PASS
  Method: SIGTERM/SIGKILL leader PID → poll follower holder_id every 50ms

Replication Lag Distribution  (VAL14, Docker primary + standby):
  Measured write_lag (pg_stat_replication.write_lag):
    Idle p95:            0 ms   (baseline noise floor)
    Light load p95:      0 ms   (100 rows × 500 B)   drain=87 ms   [target drain <= 2,000 ms]   PASS
    Heavy load p99:      1 ms   (500 rows × 2,000 B)  drain=412 ms  [target drain <= 5,000 ms]   PASS

  Derived alerting thresholds (healthy = max(p95×3+1, 10)):
    Healthy:   <=    10 ms   (lag expected within normal ops)
    Degraded:  <=   100 ms   (standby lagging; investigate)
    Alert:     >    100 ms   (alert threshold = 500 ms)
  Note: Thresholds are DERIVED from measured lab values, not workplan targets.
        Recalibrate against production observed_p95 before alerting deployment.

Backup/Restore Results  (VAL15):
  Backup duration:        1247 ms   [target <= 30,000 ms]   PASS
  Restore duration:       2891 ms   [target <= 60,000 ms]   PASS
  SHA-256 checksum:                                          PASS   (CLI checksum == file hash)
  Restore correctness:                                       PASS   (row counts + payload spot-values)
  ...

10-Check Matrix:
  VAL26-01 PASS  VAL13 failover report: all 10 checks passed
           val13 10/10 checks
  ...

Overall: PASS=10 FAIL=0

Risks and Open Gaps:
  ...  (see §1 Out of scope)

Readiness Conclusion:
  HA DESIGN PARTNER READY  ✓
  ...
  HA GA READY  ✗  (NOT YET)
  ...
  HA PUBLIC PRODUCTION CLAIM  ✗  (NOT YET)
  ...

Verdict: HA DESIGN PARTNER READY

8. Tooling

File

Role

scripts/labs/run_ha_proof_report_val26.sh

VAL26 HA proof report generator

scripts/labs/run_cli_audit_lab.sh

Source of VAL13–VAL17 evidence

docs/tutorials/ha-failover-validation.md

VAL13 formal plan

docs/tutorials/ha-replication-lag-validation.md

VAL14 formal plan

docs/tutorials/backup-restore-validation.md

VAL15 formal plan

docs/tutorials/split-brain-chaos-validation.md

VAL16 formal plan

docs/tutorials/quorum-loss-validation.md

VAL17 formal plan

docs/tutorials/ha-soak-validation.md

VAL18 30-day soak (GA gate, not yet run)