VAL 13 — HA Failover Validation

Status: Implemented Runner: run_ha_failover_val13_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val13/ Ports: cp-val13-node1 → 18997 · cp-val13-node2 → 18998


Purpose

Validates the HA leader-election subsystem’s behaviour under unplanned failure conditions:

  • Measures end-to-end failover latency under different kill signals

  • Verifies zero-data-loss for rows written before leader loss

  • Confirms PostgreSQL crash detection and manual quorum recovery

  • Establishes that rapid restart cycles preserve single-leader state and probe-row integrity

  • Provides an evidence-backed HA readiness result against the workplan’s Gate C criteria


Branch-Specific Rule Application

Question

Answer

Is this covered by an existing LAB?

Partially. run_ha_lab() covers graceful failover (ha failover trigger CLI, graceful Resign()). It does NOT cover: failover time measurement, unplanned kill (SIGTERM/SIGKILL), zero-data-loss verification with probe rows, PG crash/recovery detection, or rapid restart resilience.

Which LAB/evidence bundle is extended?

run_cli_audit_lab.sh — new function run_ha_failover_val13_lab() appended. Reuses start_ha_server(), wait_for_http(), wait_for_log(), wait_for_quorum_health(), wait_for_pg_container() helpers defined in the same file.

New evidence files

27 files in $EVIDENCE_DIR/val13/ — see Evidence Files table below.

Tutorial/runbook docs updated

docs/tutorials/cli-audit-lab.md §4 (slice 23), §5 (val13/ files), §6 (expected results), §8 (scope).

Reason new runner function required

The graceful-failover test in run_ha_lab() uses ha failover trigger (orchestrated resign). VAL13 needs: (a) timing telemetry around the failover gap, (b) unplanned kill paths (SIGTERM + SIGKILL), (c) PG crash/recovery observation, (d) rapid successive cycles, (e) dedicated 10-check report. These cannot be added to run_ha_lab() without breaking its backup/restore flow. A narrowly scoped function is cleaner.


Scenarios Under Test

Scenario

Mechanism

Notes

Leader killed (planned)

kill -TERM on node-1 process

Go runtime calls Resign(), explicit advisory lock release

Leader killed (unplanned)

kill -KILL on node-1 process

OS sends RST to PG, advisory lock released by PG on connection close

PostgreSQL crash on leader

docker stop val13-pg-primary

PG receives SIGTERM, checkpoints, closes all client connections

Disk fault approximation

kill -KILL on PG-connected leader

Simulates hard crash without graceful PG shutdown (no checkpoint); serves as practical proxy for disk failure causing host OOM/reset

Rapid restart resilience

3 × SIGTERM leader alternation

Verifies repeated failovers preserve a stable leader and intact probe row

Partition proxy

SIGKILL leader

No iptables available; immediate process death approximates a network partition where the leader becomes unreachable

Out-of-scope

  • Real network partitions via iptables (requires root)

  • PostgreSQL streaming replication failover (covered by run_quorum_lab() and run_ha_lab())

  • Concurrent writes under kill (racy; not deterministic in CI)

  • Full CP (autonomy-orchestrator serve) instances — HA server binary used for speed and isolation

  • Automatic rollback trigger under HA failure (deferred, no implementation)

  • Multi-region / WAN failover


Safety Guardrails

Guardrail

Detail

Fresh Docker containers

val13-pg-primary, val13-pg-vol, val13-ha-net — names distinct from pr17ha-* containers to avoid cross-lab interference

Local _v13_cleanup

Kills both HA server PIDs, removes container/volume/network before returning — called before any return 0/1

docker_cleanup() extended

Top-level cleanup trap now also removes val13-pg-* resources on EXIT

Docker availability guard

command -v docker && docker info check at function entry; prints SKIP and returns 0 if unavailable

Single PG node

--min-sync-replicas 0 — no standby required; quorum is healthy when PG is reachable, lost when PG is down. Avoids replica setup complexity and run_quorum_lab() duplication.

--campaign 200ms

Ensures follower acquires lock within ~200–400 ms of leader releasing it

--keepalive 500ms

Advisory lock keepalive; PG connection loss detected within one keepalive interval

--quorum-monitor-interval 500ms

Fast quorum state polling for VAL13-06/07


Harness Plan

Environment

val13-ha-net  (Docker bridge network, isolated)
     │
val13-pg-primary  (postgres:16, port 5432 internal)
     │
     ├── cp-val13-node1:18997   (orchestrator_ha_server binary)
     └── cp-val13-node2:18998   (orchestrator_ha_server binary)

Both HA nodes connect to the same PostgreSQL instance. Advisory lock (pg_try_advisory_lock) governs who is the write-ready leader. The follower continuously re-tries Campaign at --campaign 200ms.

Port Allocation

Resource

Port

Notes

cp-val13-node1 HTTP

18997

Not used by any other lab function

cp-val13-node2 HTTP

18998

Not used by any other lab function

Docker PG

internal

Reachable via Docker network IP

Log Files

Each node session writes to a numbered log file (e.g., val13-node1-s0.log, val13-node1-s1.log). A new session log is created on each restart so that grep -c "acquired leadership" counts are session-local (starting from 0 after each restart).


Measurement Method

Failover Timing

1. pre_count  = grep -c "acquired leadership" <follower_log>
2. start_ms   = python3: int(time.time()*1000)
3. kill -TERM/-KILL <leader_pid>
4. Poll follower_log every 50 ms until line count > pre_count
5. elapsed_ms = python3: int(time.time()*1000) - start_ms
6. Write failover_ms=<value> to evidence file

Expected ranges:

Signal

Typical range

Explanation

SIGTERM

200–500 ms

Go runtime calls Resign() → explicit lock release → follower wins on next Campaign

SIGKILL

200–1000 ms

OS sends RST → PG detects dead client → releases advisory lock → follower Campaign

Why ≤ 5000 ms threshold? The advisory lock Campaign interval is 200 ms. Even with worst-case detection latency (several Campaign retries + TCP keepalive delay), 5000 ms is a conservative upper bound for local Docker networking. This threshold mirrors the workplan’s “sub-5-second failover” HA readiness criterion.

Zero-Data-Loss Method

A val13_probe table row (id=1, note='pre-kill') is written via direct docker exec psql while node-1 holds leadership. After failover, the same SELECT verifies the row is intact. Because the write bypasses the HA server surface and goes straight to the shared PostgreSQL instance, this slice proves shared-database durability across leader loss, not application-write continuity through node-1’s HTTP surface.

PG Crash Detection Method

Uses wait_for_quorum_health() (already defined in run_cli_audit_lab.sh):

  • docker stop val13-pg-primary → polls /v1/ha/quorum until quorum_health=lost

  • explicit operator recovery: docker start val13-pg-primary + wait_for_pg_container → polls until quorum_health=healthy


VAL13 10-Check Matrix

Check

Name

Threshold

Scenario

VAL13-01

ha_baseline_two_node

node-1 log contains cp-val13-node1; node-2 session-0 log has no acquired leadership

Leader election baseline

VAL13-02

pre_failover_data_written

SQL INSERT returns 0 exit code

Write path before kill

VAL13-03

leader_sigterm_failover

failover_ms 5000

Leader killed (planned)

VAL13-04

zero_data_loss

note = pre-kill readable after failover

Data integrity

VAL13-05

post_failover_leader_active

node-2 ha status contains cp-val13-node2

New leader reachable

VAL13-06

pg_crash_quorum_lost

quorum_health = lost within 30 s of docker stop

PG crash detection

VAL13-07

pg_crash_quorum_recovered

quorum_health = healthy after manual docker start

PG crash recovery

VAL13-08

rapid_leader_kill_3x

all 3 cycles: failover_ms 5000 and post-cycle leader/data checks succeed

Rapid restart resilience

VAL13-09

leader_sigkill_disk_proxy

failover_ms 5000 after SIGKILL

Unplanned crash / disk-fault proxy

VAL13-10

leader_stable_post_chaos

node-2 ha status contains cp-val13-node2 after all tests

Final cluster stability


Pass/Fail Criteria

Outcome

Condition

PASS

All 10 checks pass

PARTIAL

Checks 3, 4, 6, 7, 9 pass (core failover + data integrity + PG crash)

FAIL

Any of checks 3, 4, or 9 fail

The three mandatory checks (VAL13-03, VAL13-04, VAL13-09) correspond directly to the workplan HA readiness claims:

  • Failover completes within 5 seconds (VAL13-03 + VAL13-09)

  • No data loss under architecture constraints (VAL13-04)


Evidence Files

File

Description

probe-table-create.txt

Output of CREATE TABLE for val13_probe

val13-node1-s0.log

Node-1 HA server log (initial session)

val13-node2-s0.log

Node-2 HA server log (initial session)

val13-node1-s1.log

Node-1 HA server log (restart after cycle 1)

val13-node1-s2.log

Node-1 HA server log (restart after cycle 2)

val13-node2-s1.log

Node-2 HA server log (restart after cycle 1)

val13-node2-s2.log

Node-2 HA server log (restart after cycle 2)

val13-01-node1-status.txt

ha status output for node-1 at baseline

val13-01-node2-status.txt

ha status output for node-2 at baseline

val13-02-pre-kill-write.txt

SQL INSERT result before kill

val13-03-failover-timing.txt

failover_ms=<N>  signal=TERM

val13-03-node2-status-after.txt

ha status for node-2 immediately after takeover

val13-04-data-probe.txt

SQL SELECT result (should contain pre-kill)

val13-05-write-ready.txt

ha status confirming node-2 is active leader

val13-06-quorum-lost.json

/v1/ha/quorum JSON showing quorum_health=lost

val13-07-quorum-healthy.json

/v1/ha/quorum JSON showing quorum_health=healthy

val13-07-data-after-pg-restart.txt

SQL SELECT after PG restart (data intact)

val13-08-cycle1.txt

Rapid cycle 1 timing

val13-08-cycle2.txt

Rapid cycle 2 timing

val13-08-cycle3.txt

Rapid cycle 3 timing

val13-08-rapid-summary.txt

cycle1_ms=N  cycle2_ms=N  cycle3_ms=N

val13-08-post-cycle-status.txt

ha status after the rapid cycles (node-1 should hold leadership)

val13-08-data-after-rapid.txt

SQL SELECT after rapid cycles (probe row still intact)

val13-09-sigkill-timing.txt

failover_ms=<N>  signal=KILL

val13-10-final-status.txt

ha status for node-2 after all chaos tests

val13-report.txt

Human-readable report with all 10 check results

val13-report.json

Machine-readable JSON report


Known Failure Modes

Failure

Likely Cause

Mitigation

VAL13-01 FAIL: node-1 not shown as leader

Race: node-2 starts fast enough to Campaign before node-1

Node-1 starts first and wait_for_log blocks until it logs “acquired leadership” before node-2 starts

VAL13-03/09 FAIL: failover_ms=99999

Follower log never saw “acquired leadership” within 10 s polling window

Check HA server logs for error; usually PG connectivity or binary startup issue

VAL13-06 FAIL: quorum stays healthy

--quorum-monitor-interval not fast enough or PG not stopped cleanly

Increase wait; check docker ps

VAL13-07 FAIL: quorum doesn’t recover

PG volume corruption after forced stop

docker exec val13-pg-primary pg_isready to check PG health

Docker not available

CI environment without Docker daemon

Function prints SKIP and exits 0; VAL13 not counted in pass total


Final Report Template

# VAL 13 — HA Failover Validation

Generated:     <timestamp>
Nodes:         cp-val13-node1:18997  cp-val13-node2:18998

## Scenario Results
VAL13-01 ha_baseline_two_node:         PASS
VAL13-02 pre_failover_data_written:    PASS
VAL13-03 leader_sigterm_failover:      PASS  (failover_ms=<N>, threshold=5000)
VAL13-04 zero_data_loss:               PASS
VAL13-05 post_failover_leader_active:  PASS
VAL13-06 pg_crash_quorum_lost:         PASS
VAL13-07 pg_crash_quorum_recovered:    PASS
VAL13-08 rapid_leader_kill_3x:         PASS  (c1=<N>ms c2=<N>ms c3=<N>ms)
VAL13-09 leader_sigkill_disk_proxy:    PASS  (failover_ms=<N>, threshold=5000)
VAL13-10 leader_stable_post_chaos:     PASS

## Summary
pass=10  fail=0  total=10

Gate C HA Readiness Assessment:

  • PASS if VAL13-03 + VAL13-09 both have failover_ms 5000 AND VAL13-04 passes

  • Record measured sigterm_failover_ms and sigkill_failover_ms as evidence values in the Gate C sign-off document