VAL 17 — Quorum Loss Validation¶

Status: Implemented Runner: run_quorum_loss_val17_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val17/ Port: cp-val17-node → 19003

Purpose¶

Validates the quorum loss detection and recovery subsystem with timed measurements and write-blocking assertions:

Confirms quorum_health=healthy and write_block_active=false at baseline
Measures end-to-end quorum loss detection latency against the workplan ≤ 60 s target
Verifies write_block_active=true and can_accept_protected_writes=false during loss
Verifies quorum_loss_reason is populated with a diagnostic message during loss
Verifies last_lost_at timestamp is set by the QuorumMonitor on first loss
Measures end-to-end quorum recovery detection latency after PostgreSQL is restored
Verifies write_block_active=false and can_accept_protected_writes=true after recovery
Verifies last_restored_at timestamp is set and detected_loss_count ≥ 1
Proves detected_loss_count increments correctly across a second loss/recovery cycle
Captures slice-local ha.quorum.lost and ha.quorum.restored audit events in the shared store

Branch-Specific Rule Application¶

Question	Answer
Is this covered by an existing LAB?	Partially. `run_quorum_lab()` (line 1434 of `run_cli_audit_lab.sh`) exercises `healthy → degraded → lost → healthy` state transitions. It does NOT cover: loss/recovery timing measurement, `write_block_active` assertion, `quorum_loss_reason` validation, `detected_loss_count` progression, `last_lost_at`/`last_restored_at` timestamp verification, second-cycle count increment, or a standalone 10-check pass/fail report.
Which LAB/evidence bundle is extended?	`run_cli_audit_lab.sh` — new function `run_quorum_loss_val17_lab()` appended as slice 27. Reuses `start_ha_server()`, `wait_for_http()`, `wait_for_log()`, `wait_for_pg_container()`, and `wait_for_quorum_health()` helpers defined in the same file.
New evidence files	22 files in `$EVIDENCE_DIR/val17/` — see Evidence Files table below.
Tutorial/runbook docs updated	`docs/tutorials/cli-audit-lab.md` §4 (slice 27), §5 (val17/ files), §6 (expected results), §8 (scope).
Reason new runner function required	`run_quorum_lab()` uses shared infrastructure (`pr17ha-primary`, `pr17ha-standby`, port 18091) and produces no structured pass/fail matrix. Injecting timing measurements, write-block assertions, and a 10-check report into `run_quorum_lab()` would change its sequencing (which is interleaved with the HA lab’s shared PG setup). A narrowly scoped `run_quorum_loss_val17_lab()` with isolated Docker infrastructure is cleaner.

Scenarios Under Test¶

Scenario	Mechanism	Expected Outcome
Baseline	HA server started, PG running	`quorum_health=healthy`, `write_block_active=false`
Primary PG loss (first)	`docker stop val17-pg-primary`	`quorum_health=lost` within ≤ 30,000 ms; `write_block_active=true`
Primary PG recovery (first)	`docker start val17-pg-primary`	`quorum_health=healthy` within ≤ 30,000 ms; `write_block_active=false`
Second loss/recovery cycle	`docker stop` + `docker start`	`detected_loss_count ≥ 2`

Why single PG (`--min-sync-replicas 0`) rather than primary + standby¶

VAL17’s workplan targets are:

“Quorum loss detected within 1 minute” (Gap HA-004)
“Cluster stops writes when quorum lost” (Gap HA-004)
“Automatic recovery when quorum restored” (Gap HA-004)

All three are fully observable with a single-PG, zero-replica configuration. The healthy → degraded → lost transition path (which requires a standby) is already exercised by run_quorum_lab() using the --min-sync-replicas 1 setup with pr17ha-primary + pr17ha-standby. VAL17 avoids duplicating that streaming-replication infrastructure and focuses instead on the measurement and assertion gaps that run_quorum_lab() does not cover.

Out-of-scope¶

healthy → degraded path (standby loss with primary reachable) — covered by run_quorum_lab()
Streaming replication setup, pg_basebackup standby provisioning — covered by VAL14
Write-path HTTP gating verification (attempting POST /v1/rollouts during loss) — the HA server binary (orchestrator_ha_server) does not expose rollout endpoints; write_block_active in the quorum status JSON reflects the same underlying WriteGate state
Quorum loss triggered by network partition (iptables) — requires root
Multi-region quorum topologies

Quorum State Machine¶

healthy  ──── PG stops ────►  lost
  ▲                             │
  └────── PG restarts ──────────┘

degraded ──── primary stops ►  lost   (handled by run_quorum_lab, out of scope here)

State classification rules (from quorum.go):

healthy: CanAcceptProtectedWrites = true
degraded: primary reachable but CanAcceptProtectedWrites = false (lock absent, replicas short, or standby-only connection)
lost: DatabaseReachable = false (primary PG unreachable)

write_block_active: alias for !CanAcceptProtectedWrites. The WriteGate HTTP middleware uses the same underlying state to block incoming protected writes. write_block_active=true in the quorum status is authoritative evidence that writes are blocked.

Timing Measurement Method¶

loss_start_ms  = python3: int(time.time()*1000)
docker stop val17-pg-primary
Poll /v1/ha/quorum every 200 ms until quorum_health == "lost"
   (via wait_for_quorum_health helper, 150 iterations × 200 ms = 30 s max)
loss_end_ms    = python3: int(time.time()*1000)
loss_ms        = loss_end_ms - loss_start_ms

Recovery timing follows the same pattern:

recovery_start_ms  = python3: int(time.time()*1000)
docker start val17-pg-primary (+ wait_for_pg_container)
Poll /v1/ha/quorum every 200 ms until quorum_health == "healthy"
recovery_end_ms    = python3: int(time.time()*1000)
recovery_ms        = recovery_end_ms - recovery_start_ms

Threshold: ≤ 30,000 ms (30 s) for both loss and recovery detection.

Threshold rationale:

Bound	Workplan target	VAL17 threshold	Rationale
Loss detection	≤ 60 s	≤ 30,000 ms	`--quorum-monitor-interval 500ms`; PG connection timeout ~2-5 s in Docker; 30 s gives 15× margin
Recovery detection	≤ 60 s	≤ 30,000 ms	Same: PG ready within 3-5 s after `docker start`; QuorumMonitor picks up on next 500 ms tick

The 30 s threshold is conservative enough to accommodate cold Docker starts and slow CI environments while being well within the workplan’s 60 s requirement.

VAL17 10-Check Matrix¶

Check	Name	Threshold	Phase
VAL17-01	baseline_quorum_healthy	`quorum_health=healthy`; `write_block_active=false`	Baseline
VAL17-02	loss_detected	`quorum_health=lost` after `docker stop`	Loss detection
VAL17-03	loss_detection_timing_bound	`loss_ms ≤ 30000`	Timing
VAL17-04	write_blocked_during_loss	`write_block_active=true`; `can_accept_protected_writes=false`	Safety
VAL17-05	loss_reason_populated	`quorum_loss_reason` non-empty when `quorum_health=lost`	Diagnostics
VAL17-06	recovery_detected	`quorum_health=healthy` after `docker start`	Recovery
VAL17-07	recovery_timing_bound	`recovery_ms ≤ 30000`	Timing
VAL17-08	write_unblocked_after_recovery	`write_block_active=false`; `can_accept_protected_writes=true`; `last_lost_at` non-empty; `last_restored_at` non-empty; `detected_loss_count ≥ 1`	Safety + Monitor
VAL17-09	second_cycle_count_increments	After second loss/recovery completes: `detected_loss_count ≥ 2`	Resilience
VAL17-10	audit_events_captured	≥ 1 slice-local `ha.quorum.lost` + ≥ 1 slice-local `ha.quorum.restored` in audit store	Audit

Pass/Fail Criteria¶

Outcome	Condition
PASS	All 10 checks pass
PARTIAL	Checks 2, 4, 6, 8 pass (loss detected, write blocked, recovery detected, write unblocked)
FAIL	Check 2 fails (quorum loss not detected) OR check 4 fails (writes not blocked during loss)

The two mandatory checks are VAL17-02 (loss detected) and VAL17-04 (write blocked). A system that detects loss but does not block writes provides no safety guarantee. A system that blocks writes but never detects loss means the write_block_active signal is unreliable.

Evidence Files¶

File	Description
`val17-pg-setup.txt`	Docker container IP, PG URL used by HA server
`val17-ha-server.log`	HA server log (single session throughout all phases)
`val17-01-baseline.json`	`/v1/ha/quorum` JSON at baseline (`quorum_health=healthy`)
`val17-01-baseline.txt`	`ha quorum status` CLI output at baseline
`val17-01-baseline-check.txt`	Python assertion result: `quorum_health=healthy write_block_active=False`
`val17-02-quorum-lost.json`	`/v1/ha/quorum` JSON after PG stop (`quorum_health=lost`)
`val17-02-quorum-lost.txt`	`ha quorum status` CLI output during loss
`val17-03-loss-timing.txt`	`loss_ms=<N>`
`val17-04-write-block-check.txt`	Python assertion: `write_block_active=True can_accept_protected_writes=False`
`val17-05-loss-reason.txt`	Python assertion: `quorum_loss_reason=<message>`
`val17-06-quorum-recovered.json`	`/v1/ha/quorum` JSON after PG restart (`quorum_health=healthy`)
`val17-06-quorum-recovered.txt`	`ha quorum status` CLI output after recovery
`val17-07-recovery-timing.txt`	`recovery_ms=<N>`
`val17-08-recovery-check.txt`	Python assertion: write_block_active, can_accept_protected_writes, last_lost_at, last_restored_at, detected_loss_count
`val17-09-second-loss.json`	`/v1/ha/quorum` JSON during second loss (`quorum_health=lost`)
`val17-09-second-recovery.json`	`/v1/ha/quorum` JSON after second recovery
`val17-09-count-check.txt`	Python assertion: `detected_loss_count=2 (second cycle confirmed)` after second recovery succeeds
`val17-10-audit-lost.json`	`audit query --event-type ha.quorum.lost --start-time <val17_start_time>` JSON result
`val17-10-audit-restored.json`	`audit query --event-type ha.quorum.restored --start-time <val17_start_time>` JSON result
`val17-10-audit-check.txt`	Python assertion: `lost_events=N restored_events=M`
`val17-report.txt`	Human-readable 10-check PASS/FAIL report with `loss_ms` and `recovery_ms`
`val17-report.json`	Machine-readable JSON report with `loss_ms`, `recovery_ms`, `pass_count`

Known Failure Modes¶

Failure	Likely Cause	Mitigation
VAL17-01 FAIL: baseline not healthy	HA server not yet leader; QuorumMonitor first tick hasn’t run	Increase `wait_for_log` target; check `val17-ha-server.log` for `acquired leadership` before baseline capture
VAL17-02/03 FAIL: `loss_ms=99999`	`wait_for_quorum_health` timed out (30 s) — PG stop didn’t propagate	Check `val17-ha-server.log` for connection error; verify `docker stop` succeeded; verify `--quorum-monitor-interval` flag was accepted
VAL17-04 FAIL: `write_block_active=false` during loss	QuorumMonitor not wired to WriteGate in HA server binary; or snapshot taken before monitor tick	Check if `--quorum-monitor-interval` flag is supported by the binary version; re-run and check timing
VAL17-08 FAIL: `last_lost_at` empty	QuorumMonitor not running (no `--quorum-monitor-interval` passed)	Verify HA server log shows `quorum monitor started`; check flag spelling
VAL17-09 FAIL: `detected_loss_count < 2`	Second `wait_for_quorum_health` for “healthy” captured before monitor incremented counter	Wait an extra 500 ms after recovery before reading count; timing race
VAL17-10 FAIL: no audit events	`AUTONOMY_AUDIT_DIR` not set, HA server using a different audit dir, or slice-local events not emitted after `val17_start_time`	Verify `AUTONOMY_AUDIT_DIR` export; check `val17-ha-server.log` for audit emitter startup
Docker not available	CI environment without Docker daemon	Function prints SKIP and exits 0; VAL17 not counted in pass total

Final Report Template¶

# VAL 17 — Quorum Loss Validation

Generated:     <timestamp>
Node:          cp-val17-node:19003

## Timing
Loss detection:     <N> ms  (threshold=30000)
Recovery detection: <N> ms  (threshold=30000)

## Checks
VAL17-01 baseline_quorum_healthy:       PASS
VAL17-02 loss_detected:                 PASS
VAL17-03 loss_detection_timing_bound:   PASS  (loss_ms=<N>, threshold=30000)
VAL17-04 write_blocked_during_loss:     PASS
VAL17-05 loss_reason_populated:         PASS
VAL17-06 recovery_detected:             PASS
VAL17-07 recovery_timing_bound:         PASS  (recovery_ms=<N>, threshold=30000)
VAL17-08 write_unblocked_after_recovery: PASS
VAL17-09 second_cycle_count_increments: PASS
VAL17-10 audit_events_captured:         PASS

## Summary
pass=10  fail=0  total=10

Quorum Loss Readiness Assessment (Gap HA-004):

PASS requires VAL17-02 (loss detected) + VAL17-04 (write blocked) + VAL17-06 (recovery detected)
Record loss_ms and recovery_ms as baseline timing evidence against workplan ≤ 60 s target
Repeat VAL17 after any change to pgstore/quorum.go, QuorumMonitor interval configuration, or WriteGate middleware