VAL 13 — HA Failover Validation¶

Status: Implemented Runner: run_ha_failover_val13_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val13/ Ports: cp-val13-node1 → 18997 · cp-val13-node2 → 18998

Purpose¶

Validates the HA leader-election subsystem’s behaviour under unplanned failure conditions:

Measures end-to-end failover latency under different kill signals
Verifies zero-data-loss for rows written before leader loss
Confirms PostgreSQL crash detection and manual quorum recovery
Establishes that rapid restart cycles preserve single-leader state and probe-row integrity
Provides an evidence-backed HA readiness result against the workplan’s Gate C criteria

Branch-Specific Rule Application¶

Question	Answer
Is this covered by an existing LAB?	Partially. `run_ha_lab()` covers graceful failover (`ha failover trigger` CLI, graceful `Resign()`). It does NOT cover: failover time measurement, unplanned kill (SIGTERM/SIGKILL), zero-data-loss verification with probe rows, PG crash/recovery detection, or rapid restart resilience.
Which LAB/evidence bundle is extended?	`run_cli_audit_lab.sh` — new function `run_ha_failover_val13_lab()` appended. Reuses `start_ha_server()`, `wait_for_http()`, `wait_for_log()`, `wait_for_quorum_health()`, `wait_for_pg_container()` helpers defined in the same file.
New evidence files	27 files in `$EVIDENCE_DIR/val13/` — see Evidence Files table below.
Tutorial/runbook docs updated	`docs/tutorials/cli-audit-lab.md` §4 (slice 23), §5 (val13/ files), §6 (expected results), §8 (scope).
Reason new runner function required	The graceful-failover test in `run_ha_lab()` uses `ha failover trigger` (orchestrated resign). VAL13 needs: (a) timing telemetry around the failover gap, (b) unplanned kill paths (SIGTERM + SIGKILL), (c) PG crash/recovery observation, (d) rapid successive cycles, (e) dedicated 10-check report. These cannot be added to `run_ha_lab()` without breaking its backup/restore flow. A narrowly scoped function is cleaner.

Scenarios Under Test¶

Scenario	Mechanism	Notes
Leader killed (planned)	`kill -TERM` on node-1 process	Go runtime calls `Resign()`, explicit advisory lock release
Leader killed (unplanned)	`kill -KILL` on node-1 process	OS sends RST to PG, advisory lock released by PG on connection close
PostgreSQL crash on leader	`docker stop val13-pg-primary`	PG receives SIGTERM, checkpoints, closes all client connections
Disk fault approximation	`kill -KILL` on PG-connected leader	Simulates hard crash without graceful PG shutdown (no checkpoint); serves as practical proxy for disk failure causing host OOM/reset
Rapid restart resilience	3 × SIGTERM leader alternation	Verifies repeated failovers preserve a stable leader and intact probe row
Partition proxy	SIGKILL leader	No iptables available; immediate process death approximates a network partition where the leader becomes unreachable

Out-of-scope¶

Real network partitions via iptables (requires root)
PostgreSQL streaming replication failover (covered by run_quorum_lab() and run_ha_lab())
Concurrent writes under kill (racy; not deterministic in CI)
Full CP (autonomy-orchestrator serve) instances — HA server binary used for speed and isolation
Automatic rollback trigger under HA failure (deferred, no implementation)
Multi-region / WAN failover

Safety Guardrails¶

Guardrail	Detail
Fresh Docker containers	`val13-pg-primary`, `val13-pg-vol`, `val13-ha-net` — names distinct from `pr17ha-*` containers to avoid cross-lab interference
Local `_v13_cleanup`	Kills both HA server PIDs, removes container/volume/network before returning — called before any `return 0/1`
`docker_cleanup()` extended	Top-level cleanup trap now also removes `val13-pg-*` resources on EXIT
Docker availability guard	`command -v docker && docker info` check at function entry; prints SKIP and returns 0 if unavailable
Single PG node	`--min-sync-replicas 0` — no standby required; quorum is healthy when PG is reachable, lost when PG is down. Avoids replica setup complexity and `run_quorum_lab()` duplication.
`--campaign 200ms`	Ensures follower acquires lock within ~200–400 ms of leader releasing it
`--keepalive 500ms`	Advisory lock keepalive; PG connection loss detected within one keepalive interval
`--quorum-monitor-interval 500ms`	Fast quorum state polling for VAL13-06/07

Harness Plan¶

Environment¶

val13-ha-net  (Docker bridge network, isolated)
     │
val13-pg-primary  (postgres:16, port 5432 internal)
     │
     ├── cp-val13-node1:18997   (orchestrator_ha_server binary)
     └── cp-val13-node2:18998   (orchestrator_ha_server binary)

Both HA nodes connect to the same PostgreSQL instance. Advisory lock (pg_try_advisory_lock) governs who is the write-ready leader. The follower continuously re-tries Campaign at --campaign 200ms.

Port Allocation¶

Resource	Port	Notes
cp-val13-node1 HTTP	18997	Not used by any other lab function
cp-val13-node2 HTTP	18998	Not used by any other lab function
Docker PG	internal	Reachable via Docker network IP

Log Files¶

Each node session writes to a numbered log file (e.g., val13-node1-s0.log, val13-node1-s1.log). A new session log is created on each restart so that grep -c "acquired leadership" counts are session-local (starting from 0 after each restart).

Measurement Method¶

Failover Timing¶

pre_count  = grep -c "acquired leadership" <follower_log>
start_ms   = python3: int(time.time()*1000)
kill -TERM/-KILL <leader_pid>
Poll follower_log every 50 ms until line count > pre_count
elapsed_ms = python3: int(time.time()*1000) - start_ms
Write failover_ms=<value> to evidence file

Expected ranges:

Signal	Typical range	Explanation
SIGTERM	200–500 ms	Go runtime calls `Resign()` → explicit lock release → follower wins on next Campaign
SIGKILL	200–1000 ms	OS sends RST → PG detects dead client → releases advisory lock → follower Campaign

Why ≤ 5000 ms threshold? The advisory lock Campaign interval is 200 ms. Even with worst-case detection latency (several Campaign retries + TCP keepalive delay), 5000 ms is a conservative upper bound for local Docker networking. This threshold mirrors the workplan’s “sub-5-second failover” HA readiness criterion.

Zero-Data-Loss Method¶

A val13_probe table row (id=1, note='pre-kill') is written via direct docker exec psql while node-1 holds leadership. After failover, the same SELECT verifies the row is intact. Because the write bypasses the HA server surface and goes straight to the shared PostgreSQL instance, this slice proves shared-database durability across leader loss, not application-write continuity through node-1’s HTTP surface.

PG Crash Detection Method¶

Uses wait_for_quorum_health() (already defined in run_cli_audit_lab.sh):

docker stop val13-pg-primary → polls /v1/ha/quorum until quorum_health=lost
explicit operator recovery: docker start val13-pg-primary + wait_for_pg_container → polls until quorum_health=healthy

VAL13 10-Check Matrix¶

Check	Name	Threshold	Scenario
VAL13-01	ha_baseline_two_node	node-1 log contains `cp-val13-node1`; node-2 session-0 log has no `acquired leadership`	Leader election baseline
VAL13-02	pre_failover_data_written	SQL INSERT returns 0 exit code	Write path before kill
VAL13-03	leader_sigterm_failover	`failover_ms ≤ 5000`	Leader killed (planned)
VAL13-04	zero_data_loss	`note = pre-kill` readable after failover	Data integrity
VAL13-05	post_failover_leader_active	node-2 ha status contains `cp-val13-node2`	New leader reachable
VAL13-06	pg_crash_quorum_lost	`quorum_health = lost` within 30 s of `docker stop`	PG crash detection
VAL13-07	pg_crash_quorum_recovered	`quorum_health = healthy` after manual `docker start`	PG crash recovery
VAL13-08	rapid_leader_kill_3x	all 3 cycles: `failover_ms ≤ 5000` and post-cycle leader/data checks succeed	Rapid restart resilience
VAL13-09	leader_sigkill_disk_proxy	`failover_ms ≤ 5000` after SIGKILL	Unplanned crash / disk-fault proxy
VAL13-10	leader_stable_post_chaos	node-2 ha status contains `cp-val13-node2` after all tests	Final cluster stability

Pass/Fail Criteria¶

Outcome	Condition
PASS	All 10 checks pass
PARTIAL	Checks 3, 4, 6, 7, 9 pass (core failover + data integrity + PG crash)
FAIL	Any of checks 3, 4, or 9 fail

The three mandatory checks (VAL13-03, VAL13-04, VAL13-09) correspond directly to the workplan HA readiness claims:

Failover completes within 5 seconds (VAL13-03 + VAL13-09)
No data loss under architecture constraints (VAL13-04)

Evidence Files¶

File	Description
`probe-table-create.txt`	Output of CREATE TABLE for val13_probe
`val13-node1-s0.log`	Node-1 HA server log (initial session)
`val13-node2-s0.log`	Node-2 HA server log (initial session)
`val13-node1-s1.log`	Node-1 HA server log (restart after cycle 1)
`val13-node1-s2.log`	Node-1 HA server log (restart after cycle 2)
`val13-node2-s1.log`	Node-2 HA server log (restart after cycle 1)
`val13-node2-s2.log`	Node-2 HA server log (restart after cycle 2)
`val13-01-node1-status.txt`	`ha status` output for node-1 at baseline
`val13-01-node2-status.txt`	`ha status` output for node-2 at baseline
`val13-02-pre-kill-write.txt`	SQL INSERT result before kill
`val13-03-failover-timing.txt`	`failover_ms=<N> signal=TERM`
`val13-03-node2-status-after.txt`	`ha status` for node-2 immediately after takeover
`val13-04-data-probe.txt`	SQL SELECT result (should contain `pre-kill`)
`val13-05-write-ready.txt`	`ha status` confirming node-2 is active leader
`val13-06-quorum-lost.json`	`/v1/ha/quorum` JSON showing `quorum_health=lost`
`val13-07-quorum-healthy.json`	`/v1/ha/quorum` JSON showing `quorum_health=healthy`
`val13-07-data-after-pg-restart.txt`	SQL SELECT after PG restart (data intact)
`val13-08-cycle1.txt`	Rapid cycle 1 timing
`val13-08-cycle2.txt`	Rapid cycle 2 timing
`val13-08-cycle3.txt`	Rapid cycle 3 timing
`val13-08-rapid-summary.txt`	`cycle1_ms=N cycle2_ms=N cycle3_ms=N`
`val13-08-post-cycle-status.txt`	`ha status` after the rapid cycles (node-1 should hold leadership)
`val13-08-data-after-rapid.txt`	SQL SELECT after rapid cycles (probe row still intact)
`val13-09-sigkill-timing.txt`	`failover_ms=<N> signal=KILL`
`val13-10-final-status.txt`	`ha status` for node-2 after all chaos tests
`val13-report.txt`	Human-readable report with all 10 check results
`val13-report.json`	Machine-readable JSON report

Known Failure Modes¶

Failure	Likely Cause	Mitigation
VAL13-01 FAIL: node-1 not shown as leader	Race: node-2 starts fast enough to Campaign before node-1	Node-1 starts first and `wait_for_log` blocks until it logs “acquired leadership” before node-2 starts
VAL13-03/09 FAIL: `failover_ms=99999`	Follower log never saw “acquired leadership” within 10 s polling window	Check HA server logs for error; usually PG connectivity or binary startup issue
VAL13-06 FAIL: quorum stays healthy	`--quorum-monitor-interval` not fast enough or PG not stopped cleanly	Increase wait; check `docker ps`
VAL13-07 FAIL: quorum doesn’t recover	PG volume corruption after forced stop	`docker exec val13-pg-primary pg_isready` to check PG health
Docker not available	CI environment without Docker daemon	Function prints SKIP and exits 0; VAL13 not counted in pass total

Final Report Template¶

# VAL 13 — HA Failover Validation

Generated:     <timestamp>
Nodes:         cp-val13-node1:18997  cp-val13-node2:18998

## Scenario Results
VAL13-01 ha_baseline_two_node:         PASS
VAL13-02 pre_failover_data_written:    PASS
VAL13-03 leader_sigterm_failover:      PASS  (failover_ms=<N>, threshold=5000)
VAL13-04 zero_data_loss:               PASS
VAL13-05 post_failover_leader_active:  PASS
VAL13-06 pg_crash_quorum_lost:         PASS
VAL13-07 pg_crash_quorum_recovered:    PASS
VAL13-08 rapid_leader_kill_3x:         PASS  (c1=<N>ms c2=<N>ms c3=<N>ms)
VAL13-09 leader_sigkill_disk_proxy:    PASS  (failover_ms=<N>, threshold=5000)
VAL13-10 leader_stable_post_chaos:     PASS

## Summary
pass=10  fail=0  total=10

Gate C HA Readiness Assessment:

PASS if VAL13-03 + VAL13-09 both have failover_ms ≤ 5000 AND VAL13-04 passes
Record measured sigterm_failover_ms and sigkill_failover_ms as evidence values in the Gate C sign-off document