VAL 13 — HA Failover Validation¶
Status: Implemented
Runner: run_ha_failover_val13_lab() in scripts/labs/run_cli_audit_lab.sh
Evidence dir: $EVIDENCE_DIR/val13/
Ports: cp-val13-node1 → 18997 · cp-val13-node2 → 18998
Purpose¶
Validates the HA leader-election subsystem’s behaviour under unplanned failure conditions:
Measures end-to-end failover latency under different kill signals
Verifies zero-data-loss for rows written before leader loss
Confirms PostgreSQL crash detection and manual quorum recovery
Establishes that rapid restart cycles preserve single-leader state and probe-row integrity
Provides an evidence-backed HA readiness result against the workplan’s Gate C criteria
Branch-Specific Rule Application¶
Question |
Answer |
|---|---|
Is this covered by an existing LAB? |
Partially. |
Which LAB/evidence bundle is extended? |
|
New evidence files |
27 files in |
Tutorial/runbook docs updated |
|
Reason new runner function required |
The graceful-failover test in |
Scenarios Under Test¶
Scenario |
Mechanism |
Notes |
|---|---|---|
Leader killed (planned) |
|
Go runtime calls |
Leader killed (unplanned) |
|
OS sends RST to PG, advisory lock released by PG on connection close |
PostgreSQL crash on leader |
|
PG receives SIGTERM, checkpoints, closes all client connections |
Disk fault approximation |
|
Simulates hard crash without graceful PG shutdown (no checkpoint); serves as practical proxy for disk failure causing host OOM/reset |
Rapid restart resilience |
3 × SIGTERM leader alternation |
Verifies repeated failovers preserve a stable leader and intact probe row |
Partition proxy |
SIGKILL leader |
No iptables available; immediate process death approximates a network partition where the leader becomes unreachable |
Out-of-scope¶
Real network partitions via iptables (requires root)
PostgreSQL streaming replication failover (covered by
run_quorum_lab()andrun_ha_lab())Concurrent writes under kill (racy; not deterministic in CI)
Full CP (
autonomy-orchestrator serve) instances — HA server binary used for speed and isolationAutomatic rollback trigger under HA failure (deferred, no implementation)
Multi-region / WAN failover
Safety Guardrails¶
Guardrail |
Detail |
|---|---|
Fresh Docker containers |
|
Local |
Kills both HA server PIDs, removes container/volume/network before returning — called before any |
|
Top-level cleanup trap now also removes |
Docker availability guard |
|
Single PG node |
|
|
Ensures follower acquires lock within ~200–400 ms of leader releasing it |
|
Advisory lock keepalive; PG connection loss detected within one keepalive interval |
|
Fast quorum state polling for VAL13-06/07 |
Harness Plan¶
Environment¶
val13-ha-net (Docker bridge network, isolated)
│
val13-pg-primary (postgres:16, port 5432 internal)
│
├── cp-val13-node1:18997 (orchestrator_ha_server binary)
└── cp-val13-node2:18998 (orchestrator_ha_server binary)
Both HA nodes connect to the same PostgreSQL instance. Advisory lock (pg_try_advisory_lock) governs who is the write-ready leader. The follower continuously re-tries Campaign at --campaign 200ms.
Port Allocation¶
Resource |
Port |
Notes |
|---|---|---|
cp-val13-node1 HTTP |
18997 |
Not used by any other lab function |
cp-val13-node2 HTTP |
18998 |
Not used by any other lab function |
Docker PG |
internal |
Reachable via Docker network IP |
Log Files¶
Each node session writes to a numbered log file
(e.g., val13-node1-s0.log, val13-node1-s1.log). A new session log is
created on each restart so that grep -c "acquired leadership" counts are
session-local (starting from 0 after each restart).
Measurement Method¶
Failover Timing¶
1. pre_count = grep -c "acquired leadership" <follower_log>
2. start_ms = python3: int(time.time()*1000)
3. kill -TERM/-KILL <leader_pid>
4. Poll follower_log every 50 ms until line count > pre_count
5. elapsed_ms = python3: int(time.time()*1000) - start_ms
6. Write failover_ms=<value> to evidence file
Expected ranges:
Signal |
Typical range |
Explanation |
|---|---|---|
SIGTERM |
200–500 ms |
Go runtime calls |
SIGKILL |
200–1000 ms |
OS sends RST → PG detects dead client → releases advisory lock → follower Campaign |
Why ≤ 5000 ms threshold? The advisory lock Campaign interval is 200 ms. Even with worst-case detection latency (several Campaign retries + TCP keepalive delay), 5000 ms is a conservative upper bound for local Docker networking. This threshold mirrors the workplan’s “sub-5-second failover” HA readiness criterion.
Zero-Data-Loss Method¶
A val13_probe table row (id=1, note='pre-kill') is written via direct docker exec psql
while node-1 holds leadership. After failover, the same SELECT verifies the row is intact.
Because the write bypasses the HA server surface and goes straight to the shared PostgreSQL
instance, this slice proves shared-database durability across leader loss, not application-write
continuity through node-1’s HTTP surface.
PG Crash Detection Method¶
Uses wait_for_quorum_health() (already defined in run_cli_audit_lab.sh):
docker stop val13-pg-primary→ polls/v1/ha/quorumuntilquorum_health=lostexplicit operator recovery:
docker start val13-pg-primary+wait_for_pg_container→ polls untilquorum_health=healthy
VAL13 10-Check Matrix¶
Check |
Name |
Threshold |
Scenario |
|---|---|---|---|
VAL13-01 |
ha_baseline_two_node |
node-1 log contains |
Leader election baseline |
VAL13-02 |
pre_failover_data_written |
SQL INSERT returns 0 exit code |
Write path before kill |
VAL13-03 |
leader_sigterm_failover |
|
Leader killed (planned) |
VAL13-04 |
zero_data_loss |
|
Data integrity |
VAL13-05 |
post_failover_leader_active |
node-2 ha status contains |
New leader reachable |
VAL13-06 |
pg_crash_quorum_lost |
|
PG crash detection |
VAL13-07 |
pg_crash_quorum_recovered |
|
PG crash recovery |
VAL13-08 |
rapid_leader_kill_3x |
all 3 cycles: |
Rapid restart resilience |
VAL13-09 |
leader_sigkill_disk_proxy |
|
Unplanned crash / disk-fault proxy |
VAL13-10 |
leader_stable_post_chaos |
node-2 ha status contains |
Final cluster stability |
Pass/Fail Criteria¶
Outcome |
Condition |
|---|---|
PASS |
All 10 checks pass |
PARTIAL |
Checks 3, 4, 6, 7, 9 pass (core failover + data integrity + PG crash) |
FAIL |
Any of checks 3, 4, or 9 fail |
The three mandatory checks (VAL13-03, VAL13-04, VAL13-09) correspond directly to the workplan HA readiness claims:
Failover completes within 5 seconds (VAL13-03 + VAL13-09)
No data loss under architecture constraints (VAL13-04)
Evidence Files¶
File |
Description |
|---|---|
|
Output of CREATE TABLE for val13_probe |
|
Node-1 HA server log (initial session) |
|
Node-2 HA server log (initial session) |
|
Node-1 HA server log (restart after cycle 1) |
|
Node-1 HA server log (restart after cycle 2) |
|
Node-2 HA server log (restart after cycle 1) |
|
Node-2 HA server log (restart after cycle 2) |
|
|
|
|
|
SQL INSERT result before kill |
|
|
|
|
|
SQL SELECT result (should contain |
|
|
|
|
|
|
|
SQL SELECT after PG restart (data intact) |
|
Rapid cycle 1 timing |
|
Rapid cycle 2 timing |
|
Rapid cycle 3 timing |
|
|
|
|
|
SQL SELECT after rapid cycles (probe row still intact) |
|
|
|
|
|
Human-readable report with all 10 check results |
|
Machine-readable JSON report |
Known Failure Modes¶
Failure |
Likely Cause |
Mitigation |
|---|---|---|
VAL13-01 FAIL: node-1 not shown as leader |
Race: node-2 starts fast enough to Campaign before node-1 |
Node-1 starts first and |
VAL13-03/09 FAIL: |
Follower log never saw “acquired leadership” within 10 s polling window |
Check HA server logs for error; usually PG connectivity or binary startup issue |
VAL13-06 FAIL: quorum stays healthy |
|
Increase wait; check |
VAL13-07 FAIL: quorum doesn’t recover |
PG volume corruption after forced stop |
|
Docker not available |
CI environment without Docker daemon |
Function prints SKIP and exits 0; VAL13 not counted in pass total |
Final Report Template¶
# VAL 13 — HA Failover Validation
Generated: <timestamp>
Nodes: cp-val13-node1:18997 cp-val13-node2:18998
## Scenario Results
VAL13-01 ha_baseline_two_node: PASS
VAL13-02 pre_failover_data_written: PASS
VAL13-03 leader_sigterm_failover: PASS (failover_ms=<N>, threshold=5000)
VAL13-04 zero_data_loss: PASS
VAL13-05 post_failover_leader_active: PASS
VAL13-06 pg_crash_quorum_lost: PASS
VAL13-07 pg_crash_quorum_recovered: PASS
VAL13-08 rapid_leader_kill_3x: PASS (c1=<N>ms c2=<N>ms c3=<N>ms)
VAL13-09 leader_sigkill_disk_proxy: PASS (failover_ms=<N>, threshold=5000)
VAL13-10 leader_stable_post_chaos: PASS
## Summary
pass=10 fail=0 total=10
Gate C HA Readiness Assessment:
PASS if VAL13-03 + VAL13-09 both have
failover_ms ≤ 5000AND VAL13-04 passesRecord measured
sigterm_failover_msandsigkill_failover_msas evidence values in the Gate C sign-off document