VAL26 — HA Proof Report Generator¶
Audience: engineering leads, product managers, and external reviewers who need a consolidated, evidence-backed HA readiness assessment.
VAL26 is a report generator, not a test runner. It reads evidence from five completed HA validation slices (VAL13–VAL17) and produces a single production-quality proof report with explicit readiness conclusions.
1. Scope¶
VAL26 consolidates evidence from:
Slice |
Name |
What it proves |
|---|---|---|
VAL13 |
HA Failover Validation |
SIGTERM/SIGKILL leader failover timing (≤5 s), zero data loss |
VAL14 |
HA Replication Lag Baseline |
write_lag distribution; derived healthy/degraded/alert thresholds |
VAL15 |
Backup/Restore Validation |
Backup timing, SHA-256 integrity, restore correctness (row-count + payload) |
VAL16 |
Split-Brain Chaos Validation |
Epoch-divergence detection, metadata-injection recovery, user data integrity |
VAL17 |
Quorum Loss Validation |
Loss detection timing (≤30 s), write blocking, recovery detection after PostgreSQL is restored |
Branch rule: coverage by existing runner¶
Existing asset |
Coverage |
|---|---|
|
30-day HA soak — VAL18 only, different metrics (failover count, uptime, data continuity over 30 days) |
|
Runs VAL13–VAL17 but produces per-slice JSON reports with no cross-slice aggregation |
New aggregator required. No existing script reads and combines VAL13–VAL17 evidence into a single HA proof artifact with readiness conclusions.
Out of scope¶
30-day HA durability soak (covered by VAL18 framework, not yet run)
Streaming-replication promotion failover (standby→primary under real PG HA)
Real network partitions (iptables / tc — all chaos is SIGTERM/SIGKILL or docker stop)
Concurrent-under-kill write requests
apply_lag measurement (VAL14 measures write_lag only)
PITR / WAL-based backup (VAL15 uses pg_dump/pg_restore)
Multi-AZ or multi-region topology
Write-gate HTTP endpoint rejection test (HA binary in this deployment does not expose
/v1/rollouts; write-gate is verified via status endpoint fields)
2. Evidence Structure¶
VAL26 reads from the evidence directory produced by run_cli_audit_lab.sh
(default: evidence/cli-audit-lab-YYYY-MM-DD).
Input file |
Produced by |
Contents |
|---|---|---|
|
|
Failover timing (SIGTERM/SIGKILL/rapid), 10-check results |
|
|
Lag distribution (idle/light/heavy), derived thresholds, 10-check results |
|
|
Backup/restore timing, 10-check results |
|
|
Split-brain detection/recovery outcomes, 10-check results |
|
|
Loss detection timing, write-block state, 10-check results |
VAL26 expects these five reports to come from one coherent run_cli_audit_lab.sh
evidence set. The generator checks each report’s embedded timestamp and
requires all found reports to fall within a single 6-hour evidence window
before issuing a design-partner readiness conclusion.
Missing slice reports are reported as MISSING in the coverage table rather
than aborting. Reports that exist but do not match the expected schema are
degraded to MISSING with a schema-mismatch detail.
3. Metric Definitions and Targets¶
VAL26 reports on the following metrics. Workplan targets are annotated with
[target]. VAL14 alerting thresholds are derived from measured lab values
and are explicitly labelled as derived — they are NOT workplan targets.
Failover Timeline (VAL13)¶
Metric |
Target |
Source |
|---|---|---|
SIGTERM graceful failover |
≤ 5,000 ms |
VAL13 / Gap HA-003 |
SIGKILL unplanned failover |
≤ 5,000 ms |
VAL13 / Gap HA-003 |
3× rapid cycle timings |
All ≤ 5,000 ms |
VAL13 resilience check |
Data loss across failover |
= 0 rows |
VAL13 mandatory criterion |
Method: poll follower /v1/ha/status holder_id every 50 ms after leader kill.
Replication Lag Distribution (VAL14)¶
Metric |
What it is |
Threshold |
|---|---|---|
idle_p95_ms |
write_lag at rest |
Baseline noise floor (informational) |
light_p95_ms |
write_lag under 100 rows × 500 B |
N/A (feeds threshold derivation) |
light_drain_ms |
LSN gap drain after light workload |
≤ 2,000 ms [target] |
heavy_p99_ms |
write_lag under 500 rows × 2,000 B |
N/A (informational) |
heavy_drain_ms |
LSN gap drain after heavy workload |
≤ 5,000 ms [target] |
Derived alerting thresholds (computed from measured values, NOT from the workplan):
observed_p95 = max(idle_p95_ms, light_p95_ms)
healthy_ms = max(observed_p95 × 3 + 1, 10)
degraded_ms = max(healthy_ms × 10, 100)
alert_ms = max(healthy_ms × 50, 500)
These thresholds must be recalibrated against production write_lag observations before deployment. Docker write speeds are materially faster than cloud VMs.
Backup/Restore (VAL15)¶
Metric |
Target |
Source |
|---|---|---|
backup_ms |
≤ 30,000 ms |
VAL15 workplan target |
restore_ms |
≤ 60,000 ms |
VAL15 workplan target |
SHA-256 checksum |
CLI == file hash |
VAL15 integrity criterion |
Restore correctness |
Row counts + payload spot-values match pre-backup |
VAL15 mandatory |
Quorum Loss (VAL17)¶
Metric |
Target |
Source |
|---|---|---|
loss_detection_ms |
≤ 30,000 ms |
VAL17 / Gap HA-004 |
recovery_ms |
≤ 30,000 ms |
VAL17 / Gap HA-004 |
write_block_active during loss |
= true |
VAL17 mandatory safety criterion |
can_accept_protected_writes during loss |
= false |
VAL17 mandatory safety criterion |
4. Readiness Level Definitions¶
VAL26 evaluates three readiness levels. Only HA Design Partner level is achievable with this validation suite.
HA Design Partner Ready¶
Criteria (all must hold):
All five VAL13–VAL17 slices pass (zero failed checks each)
SIGTERM failover ≤ 5,000 ms (VAL13-03)
Zero data loss across failover (VAL13-04)
Backup restore correctness verified (VAL15-06)
Quorum loss detected within 30,000 ms AND writes blocked during loss (VAL17-02 + VAL17-03 + VAL17-04)
Evidence timestamps fall within a single 6-hour evidence window
Meaning: the core HA lifecycle is functional and stable enough to offer to early adopters under the scope limitations stated in §1.
HA GA Ready¶
Not achievable with VAL13–VAL17 alone. Additional requirements:
VAL18 30-day HA soak: Gate D requires ≥ 3 scheduled failovers, all
failover_ms ≤ 10,000,data_continuity_rate = 1.000, HA uptime ≥ 99.9%Streaming-replication promotion failover: standby is promoted to primary and the cluster recovers under a real PostgreSQL HA configuration
VAL14 alerting thresholds recalibrated against production write_lag
HA Public Production Claim¶
Requires everything for HA GA Ready plus:
Multi-AZ or multi-region topology validation
External security hardening audit covering HA auth surfaces
Production-grade monitoring and alerting validation
5. 10-Check Matrix¶
ID |
When |
Description |
Pass criterion |
|---|---|---|---|
VAL26-01 |
Setup |
VAL13 failover report found and all checks pass |
|
VAL26-02 |
Setup |
VAL14 replication lag report found and all checks pass |
|
VAL26-03 |
Setup |
VAL15 backup/restore report found and all checks pass |
|
VAL26-04 |
Setup |
VAL16 split-brain report found and all checks pass |
|
VAL26-05 |
Setup |
VAL17 quorum loss report found and all checks pass |
|
VAL26-06 |
Metric |
SIGTERM leader failover ≤ 5,000 ms |
|
VAL26-07 |
Metric |
Zero data loss across leader failover |
|
VAL26-08 |
Metric |
Backup restore correctness verified |
|
VAL26-09 |
Metric |
Quorum loss detected + writes blocked ≤ 30,000 ms |
|
VAL26-10 |
Summary |
HA design partner readiness — all above pass and evidence coherent |
VAL26-01..09 all PASS + 6-hour window |
6. Run the Report¶
Prerequisites¶
Run the full cli-audit-lab (or at minimum VAL13–VAL17 slices, which require
Docker and docker compose):
export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local
bash scripts/labs/run_cli_audit_lab.sh
The evidence directory is printed at the end:
evidence/cli-audit-lab-YYYY-MM-DD.
Generate the proof report¶
bash scripts/labs/run_ha_proof_report_val26.sh \
evidence/cli-audit-lab-2026-03-23
Output files¶
File |
Contents |
|---|---|
stdout |
Human-readable HA proof report |
|
Same content as stdout |
|
Machine-readable JSON artifact |
7. Final Report Format¶
VAL26 — HA Proof Report
Generated: <YYYY-MM-DDTHH:MM:SSZ>
Evidence dir: <path>
Tested Topology:
Cluster: Two-node HA (node1 + node2), single-region
Database: PostgreSQL in Docker (single primary; VAL14 adds Docker standby)
Chaos: SIGTERM + SIGKILL for failover; SQL metadata injection for split-brain
docker stop/start for quorum loss; NO iptables, NO real network partitions
Evidence: single coherent evidence window
Note: All timing figures are from controlled local Docker runs.
Results are NOT representative of cloud VM or multi-region deployments.
Scenario Coverage:
VAL13 HA Failover Validation PASS (10/10 checks)
VAL14 HA Replication Lag Baseline PASS (10/10 checks)
VAL15 Backup/Restore Validation PASS (10/10 checks)
VAL16 Split-Brain Chaos Validation PASS (10/10 checks)
VAL17 Quorum Loss Validation PASS (10/10 checks)
Failover Timeline (VAL13, Docker two-node cluster):
SIGTERM graceful failover: 342 ms [target <= 5,000 ms] PASS
SIGKILL unplanned failover: 389 ms [target <= 5,000 ms] PASS
Rapid 3x cycles: 281ms, 305ms, 318ms [target: all <= 5,000 ms] PASS
Zero data loss across failover: PASS
Method: SIGTERM/SIGKILL leader PID → poll follower holder_id every 50ms
Replication Lag Distribution (VAL14, Docker primary + standby):
Measured write_lag (pg_stat_replication.write_lag):
Idle p95: 0 ms (baseline noise floor)
Light load p95: 0 ms (100 rows × 500 B) drain=87 ms [target drain <= 2,000 ms] PASS
Heavy load p99: 1 ms (500 rows × 2,000 B) drain=412 ms [target drain <= 5,000 ms] PASS
Derived alerting thresholds (healthy = max(p95×3+1, 10)):
Healthy: <= 10 ms (lag expected within normal ops)
Degraded: <= 100 ms (standby lagging; investigate)
Alert: > 100 ms (alert threshold = 500 ms)
Note: Thresholds are DERIVED from measured lab values, not workplan targets.
Recalibrate against production observed_p95 before alerting deployment.
Backup/Restore Results (VAL15):
Backup duration: 1247 ms [target <= 30,000 ms] PASS
Restore duration: 2891 ms [target <= 60,000 ms] PASS
SHA-256 checksum: PASS (CLI checksum == file hash)
Restore correctness: PASS (row counts + payload spot-values)
...
10-Check Matrix:
VAL26-01 PASS VAL13 failover report: all 10 checks passed
val13 10/10 checks
...
Overall: PASS=10 FAIL=0
Risks and Open Gaps:
... (see §1 Out of scope)
Readiness Conclusion:
HA DESIGN PARTNER READY ✓
...
HA GA READY ✗ (NOT YET)
...
HA PUBLIC PRODUCTION CLAIM ✗ (NOT YET)
...
Verdict: HA DESIGN PARTNER READY
8. Tooling¶
File |
Role |
|---|---|
|
VAL26 HA proof report generator |
|
Source of VAL13–VAL17 evidence |
|
VAL13 formal plan |
|
VAL14 formal plan |
|
VAL15 formal plan |
|
VAL16 formal plan |
|
VAL17 formal plan |
|
VAL18 30-day soak (GA gate, not yet run) |