VAL 17 — Quorum Loss Validation

Status: Implemented Runner: run_quorum_loss_val17_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val17/ Port: cp-val17-node → 19003


Purpose

Validates the quorum loss detection and recovery subsystem with timed measurements and write-blocking assertions:

  • Confirms quorum_health=healthy and write_block_active=false at baseline

  • Measures end-to-end quorum loss detection latency against the workplan ≤ 60 s target

  • Verifies write_block_active=true and can_accept_protected_writes=false during loss

  • Verifies quorum_loss_reason is populated with a diagnostic message during loss

  • Verifies last_lost_at timestamp is set by the QuorumMonitor on first loss

  • Measures end-to-end quorum recovery detection latency after PostgreSQL is restored

  • Verifies write_block_active=false and can_accept_protected_writes=true after recovery

  • Verifies last_restored_at timestamp is set and detected_loss_count 1

  • Proves detected_loss_count increments correctly across a second loss/recovery cycle

  • Captures slice-local ha.quorum.lost and ha.quorum.restored audit events in the shared store


Branch-Specific Rule Application

Question

Answer

Is this covered by an existing LAB?

Partially. run_quorum_lab() (line 1434 of run_cli_audit_lab.sh) exercises healthy degraded lost healthy state transitions. It does NOT cover: loss/recovery timing measurement, write_block_active assertion, quorum_loss_reason validation, detected_loss_count progression, last_lost_at/last_restored_at timestamp verification, second-cycle count increment, or a standalone 10-check pass/fail report.

Which LAB/evidence bundle is extended?

run_cli_audit_lab.sh — new function run_quorum_loss_val17_lab() appended as slice 27. Reuses start_ha_server(), wait_for_http(), wait_for_log(), wait_for_pg_container(), and wait_for_quorum_health() helpers defined in the same file.

New evidence files

22 files in $EVIDENCE_DIR/val17/ — see Evidence Files table below.

Tutorial/runbook docs updated

docs/tutorials/cli-audit-lab.md §4 (slice 27), §5 (val17/ files), §6 (expected results), §8 (scope).

Reason new runner function required

run_quorum_lab() uses shared infrastructure (pr17ha-primary, pr17ha-standby, port 18091) and produces no structured pass/fail matrix. Injecting timing measurements, write-block assertions, and a 10-check report into run_quorum_lab() would change its sequencing (which is interleaved with the HA lab’s shared PG setup). A narrowly scoped run_quorum_loss_val17_lab() with isolated Docker infrastructure is cleaner.


Scenarios Under Test

Scenario

Mechanism

Expected Outcome

Baseline

HA server started, PG running

quorum_health=healthy, write_block_active=false

Primary PG loss (first)

docker stop val17-pg-primary

quorum_health=lost within ≤ 30,000 ms; write_block_active=true

Primary PG recovery (first)

docker start val17-pg-primary

quorum_health=healthy within ≤ 30,000 ms; write_block_active=false

Second loss/recovery cycle

docker stop + docker start

detected_loss_count 2

Why single PG (--min-sync-replicas 0) rather than primary + standby

VAL17’s workplan targets are:

  • “Quorum loss detected within 1 minute” (Gap HA-004)

  • “Cluster stops writes when quorum lost” (Gap HA-004)

  • “Automatic recovery when quorum restored” (Gap HA-004)

All three are fully observable with a single-PG, zero-replica configuration. The healthy degraded lost transition path (which requires a standby) is already exercised by run_quorum_lab() using the --min-sync-replicas 1 setup with pr17ha-primary + pr17ha-standby. VAL17 avoids duplicating that streaming-replication infrastructure and focuses instead on the measurement and assertion gaps that run_quorum_lab() does not cover.

Out-of-scope

  • healthy degraded path (standby loss with primary reachable) — covered by run_quorum_lab()

  • Streaming replication setup, pg_basebackup standby provisioning — covered by VAL14

  • Write-path HTTP gating verification (attempting POST /v1/rollouts during loss) — the HA server binary (orchestrator_ha_server) does not expose rollout endpoints; write_block_active in the quorum status JSON reflects the same underlying WriteGate state

  • Quorum loss triggered by network partition (iptables) — requires root

  • Multi-region quorum topologies


Quorum State Machine

healthy  ──── PG stops ────►  lost
  ▲                             │
  └────── PG restarts ──────────┘

degraded ──── primary stops ►  lost   (handled by run_quorum_lab, out of scope here)

State classification rules (from quorum.go):

  • healthy: CanAcceptProtectedWrites = true

  • degraded: primary reachable but CanAcceptProtectedWrites = false (lock absent, replicas short, or standby-only connection)

  • lost: DatabaseReachable = false (primary PG unreachable)

write_block_active: alias for !CanAcceptProtectedWrites. The WriteGate HTTP middleware uses the same underlying state to block incoming protected writes. write_block_active=true in the quorum status is authoritative evidence that writes are blocked.


Timing Measurement Method

1. loss_start_ms  = python3: int(time.time()*1000)
2. docker stop val17-pg-primary
3. Poll /v1/ha/quorum every 200 ms until quorum_health == "lost"
   (via wait_for_quorum_health helper, 150 iterations × 200 ms = 30 s max)
4. loss_end_ms    = python3: int(time.time()*1000)
5. loss_ms        = loss_end_ms - loss_start_ms

Recovery timing follows the same pattern:

1. recovery_start_ms  = python3: int(time.time()*1000)
2. docker start val17-pg-primary (+ wait_for_pg_container)
3. Poll /v1/ha/quorum every 200 ms until quorum_health == "healthy"
4. recovery_end_ms    = python3: int(time.time()*1000)
5. recovery_ms        = recovery_end_ms - recovery_start_ms

Threshold: ≤ 30,000 ms (30 s) for both loss and recovery detection.

Threshold rationale:

Bound

Workplan target

VAL17 threshold

Rationale

Loss detection

≤ 60 s

≤ 30,000 ms

--quorum-monitor-interval 500ms; PG connection timeout ~2-5 s in Docker; 30 s gives 15× margin

Recovery detection

≤ 60 s

≤ 30,000 ms

Same: PG ready within 3-5 s after docker start; QuorumMonitor picks up on next 500 ms tick

The 30 s threshold is conservative enough to accommodate cold Docker starts and slow CI environments while being well within the workplan’s 60 s requirement.


VAL17 10-Check Matrix

Check

Name

Threshold

Phase

VAL17-01

baseline_quorum_healthy

quorum_health=healthy; write_block_active=false

Baseline

VAL17-02

loss_detected

quorum_health=lost after docker stop

Loss detection

VAL17-03

loss_detection_timing_bound

loss_ms 30000

Timing

VAL17-04

write_blocked_during_loss

write_block_active=true; can_accept_protected_writes=false

Safety

VAL17-05

loss_reason_populated

quorum_loss_reason non-empty when quorum_health=lost

Diagnostics

VAL17-06

recovery_detected

quorum_health=healthy after docker start

Recovery

VAL17-07

recovery_timing_bound

recovery_ms 30000

Timing

VAL17-08

write_unblocked_after_recovery

write_block_active=false; can_accept_protected_writes=true; last_lost_at non-empty; last_restored_at non-empty; detected_loss_count 1

Safety + Monitor

VAL17-09

second_cycle_count_increments

After second loss/recovery completes: detected_loss_count 2

Resilience

VAL17-10

audit_events_captured

≥ 1 slice-local ha.quorum.lost + ≥ 1 slice-local ha.quorum.restored in audit store

Audit


Pass/Fail Criteria

Outcome

Condition

PASS

All 10 checks pass

PARTIAL

Checks 2, 4, 6, 8 pass (loss detected, write blocked, recovery detected, write unblocked)

FAIL

Check 2 fails (quorum loss not detected) OR check 4 fails (writes not blocked during loss)

The two mandatory checks are VAL17-02 (loss detected) and VAL17-04 (write blocked). A system that detects loss but does not block writes provides no safety guarantee. A system that blocks writes but never detects loss means the write_block_active signal is unreliable.


Evidence Files

File

Description

val17-pg-setup.txt

Docker container IP, PG URL used by HA server

val17-ha-server.log

HA server log (single session throughout all phases)

val17-01-baseline.json

/v1/ha/quorum JSON at baseline (quorum_health=healthy)

val17-01-baseline.txt

ha quorum status CLI output at baseline

val17-01-baseline-check.txt

Python assertion result: quorum_health=healthy write_block_active=False

val17-02-quorum-lost.json

/v1/ha/quorum JSON after PG stop (quorum_health=lost)

val17-02-quorum-lost.txt

ha quorum status CLI output during loss

val17-03-loss-timing.txt

loss_ms=<N>

val17-04-write-block-check.txt

Python assertion: write_block_active=True can_accept_protected_writes=False

val17-05-loss-reason.txt

Python assertion: quorum_loss_reason=<message>

val17-06-quorum-recovered.json

/v1/ha/quorum JSON after PG restart (quorum_health=healthy)

val17-06-quorum-recovered.txt

ha quorum status CLI output after recovery

val17-07-recovery-timing.txt

recovery_ms=<N>

val17-08-recovery-check.txt

Python assertion: write_block_active, can_accept_protected_writes, last_lost_at, last_restored_at, detected_loss_count

val17-09-second-loss.json

/v1/ha/quorum JSON during second loss (quorum_health=lost)

val17-09-second-recovery.json

/v1/ha/quorum JSON after second recovery

val17-09-count-check.txt

Python assertion: detected_loss_count=2 (second cycle confirmed) after second recovery succeeds

val17-10-audit-lost.json

audit query --event-type ha.quorum.lost --start-time <val17_start_time> JSON result

val17-10-audit-restored.json

audit query --event-type ha.quorum.restored --start-time <val17_start_time> JSON result

val17-10-audit-check.txt

Python assertion: lost_events=N restored_events=M

val17-report.txt

Human-readable 10-check PASS/FAIL report with loss_ms and recovery_ms

val17-report.json

Machine-readable JSON report with loss_ms, recovery_ms, pass_count


Known Failure Modes

Failure

Likely Cause

Mitigation

VAL17-01 FAIL: baseline not healthy

HA server not yet leader; QuorumMonitor first tick hasn’t run

Increase wait_for_log target; check val17-ha-server.log for acquired leadership before baseline capture

VAL17-02/03 FAIL: loss_ms=99999

wait_for_quorum_health timed out (30 s) — PG stop didn’t propagate

Check val17-ha-server.log for connection error; verify docker stop succeeded; verify --quorum-monitor-interval flag was accepted

VAL17-04 FAIL: write_block_active=false during loss

QuorumMonitor not wired to WriteGate in HA server binary; or snapshot taken before monitor tick

Check if --quorum-monitor-interval flag is supported by the binary version; re-run and check timing

VAL17-08 FAIL: last_lost_at empty

QuorumMonitor not running (no --quorum-monitor-interval passed)

Verify HA server log shows quorum monitor started; check flag spelling

VAL17-09 FAIL: detected_loss_count < 2

Second wait_for_quorum_health for “healthy” captured before monitor incremented counter

Wait an extra 500 ms after recovery before reading count; timing race

VAL17-10 FAIL: no audit events

AUTONOMY_AUDIT_DIR not set, HA server using a different audit dir, or slice-local events not emitted after val17_start_time

Verify AUTONOMY_AUDIT_DIR export; check val17-ha-server.log for audit emitter startup

Docker not available

CI environment without Docker daemon

Function prints SKIP and exits 0; VAL17 not counted in pass total


Final Report Template

# VAL 17 — Quorum Loss Validation

Generated:     <timestamp>
Node:          cp-val17-node:19003

## Timing
Loss detection:     <N> ms  (threshold=30000)
Recovery detection: <N> ms  (threshold=30000)

## Checks
VAL17-01 baseline_quorum_healthy:       PASS
VAL17-02 loss_detected:                 PASS
VAL17-03 loss_detection_timing_bound:   PASS  (loss_ms=<N>, threshold=30000)
VAL17-04 write_blocked_during_loss:     PASS
VAL17-05 loss_reason_populated:         PASS
VAL17-06 recovery_detected:             PASS
VAL17-07 recovery_timing_bound:         PASS  (recovery_ms=<N>, threshold=30000)
VAL17-08 write_unblocked_after_recovery: PASS
VAL17-09 second_cycle_count_increments: PASS
VAL17-10 audit_events_captured:         PASS

## Summary
pass=10  fail=0  total=10

Quorum Loss Readiness Assessment (Gap HA-004):

  • PASS requires VAL17-02 (loss detected) + VAL17-04 (write blocked) + VAL17-06 (recovery detected)

  • Record loss_ms and recovery_ms as baseline timing evidence against workplan ≤ 60 s target

  • Repeat VAL17 after any change to pgstore/quorum.go, QuorumMonitor interval configuration, or WriteGate middleware