VAL 17 — Quorum Loss Validation¶
Status: Implemented
Runner: run_quorum_loss_val17_lab() in scripts/labs/run_cli_audit_lab.sh
Evidence dir: $EVIDENCE_DIR/val17/
Port: cp-val17-node → 19003
Purpose¶
Validates the quorum loss detection and recovery subsystem with timed measurements and write-blocking assertions:
Confirms
quorum_health=healthyandwrite_block_active=falseat baselineMeasures end-to-end quorum loss detection latency against the workplan ≤ 60 s target
Verifies
write_block_active=trueandcan_accept_protected_writes=falseduring lossVerifies
quorum_loss_reasonis populated with a diagnostic message during lossVerifies
last_lost_attimestamp is set by the QuorumMonitor on first lossMeasures end-to-end quorum recovery detection latency after PostgreSQL is restored
Verifies
write_block_active=falseandcan_accept_protected_writes=trueafter recoveryVerifies
last_restored_attimestamp is set anddetected_loss_count ≥ 1Proves
detected_loss_countincrements correctly across a second loss/recovery cycleCaptures slice-local
ha.quorum.lostandha.quorum.restoredaudit events in the shared store
Branch-Specific Rule Application¶
Question |
Answer |
|---|---|
Is this covered by an existing LAB? |
Partially. |
Which LAB/evidence bundle is extended? |
|
New evidence files |
22 files in |
Tutorial/runbook docs updated |
|
Reason new runner function required |
|
Scenarios Under Test¶
Scenario |
Mechanism |
Expected Outcome |
|---|---|---|
Baseline |
HA server started, PG running |
|
Primary PG loss (first) |
|
|
Primary PG recovery (first) |
|
|
Second loss/recovery cycle |
|
|
Why single PG (--min-sync-replicas 0) rather than primary + standby¶
VAL17’s workplan targets are:
“Quorum loss detected within 1 minute” (Gap HA-004)
“Cluster stops writes when quorum lost” (Gap HA-004)
“Automatic recovery when quorum restored” (Gap HA-004)
All three are fully observable with a single-PG, zero-replica configuration.
The healthy → degraded → lost transition path (which requires a standby) is
already exercised by run_quorum_lab() using the --min-sync-replicas 1
setup with pr17ha-primary + pr17ha-standby. VAL17 avoids duplicating that
streaming-replication infrastructure and focuses instead on the measurement and
assertion gaps that run_quorum_lab() does not cover.
Out-of-scope¶
healthy → degradedpath (standby loss with primary reachable) — covered byrun_quorum_lab()Streaming replication setup,
pg_basebackupstandby provisioning — covered by VAL14Write-path HTTP gating verification (attempting
POST /v1/rolloutsduring loss) — the HA server binary (orchestrator_ha_server) does not expose rollout endpoints;write_block_activein the quorum status JSON reflects the same underlying WriteGate stateQuorum loss triggered by network partition (iptables) — requires root
Multi-region quorum topologies
Quorum State Machine¶
healthy ──── PG stops ────► lost
▲ │
└────── PG restarts ──────────┘
degraded ──── primary stops ► lost (handled by run_quorum_lab, out of scope here)
State classification rules (from quorum.go):
healthy:CanAcceptProtectedWrites = truedegraded: primary reachable butCanAcceptProtectedWrites = false(lock absent, replicas short, or standby-only connection)lost:DatabaseReachable = false(primary PG unreachable)
write_block_active: alias for !CanAcceptProtectedWrites. The WriteGate
HTTP middleware uses the same underlying state to block incoming protected
writes. write_block_active=true in the quorum status is authoritative
evidence that writes are blocked.
Timing Measurement Method¶
1. loss_start_ms = python3: int(time.time()*1000)
2. docker stop val17-pg-primary
3. Poll /v1/ha/quorum every 200 ms until quorum_health == "lost"
(via wait_for_quorum_health helper, 150 iterations × 200 ms = 30 s max)
4. loss_end_ms = python3: int(time.time()*1000)
5. loss_ms = loss_end_ms - loss_start_ms
Recovery timing follows the same pattern:
1. recovery_start_ms = python3: int(time.time()*1000)
2. docker start val17-pg-primary (+ wait_for_pg_container)
3. Poll /v1/ha/quorum every 200 ms until quorum_health == "healthy"
4. recovery_end_ms = python3: int(time.time()*1000)
5. recovery_ms = recovery_end_ms - recovery_start_ms
Threshold: ≤ 30,000 ms (30 s) for both loss and recovery detection.
Threshold rationale:
Bound |
Workplan target |
VAL17 threshold |
Rationale |
|---|---|---|---|
Loss detection |
≤ 60 s |
≤ 30,000 ms |
|
Recovery detection |
≤ 60 s |
≤ 30,000 ms |
Same: PG ready within 3-5 s after |
The 30 s threshold is conservative enough to accommodate cold Docker starts and slow CI environments while being well within the workplan’s 60 s requirement.
VAL17 10-Check Matrix¶
Check |
Name |
Threshold |
Phase |
|---|---|---|---|
VAL17-01 |
baseline_quorum_healthy |
|
Baseline |
VAL17-02 |
loss_detected |
|
Loss detection |
VAL17-03 |
loss_detection_timing_bound |
|
Timing |
VAL17-04 |
write_blocked_during_loss |
|
Safety |
VAL17-05 |
loss_reason_populated |
|
Diagnostics |
VAL17-06 |
recovery_detected |
|
Recovery |
VAL17-07 |
recovery_timing_bound |
|
Timing |
VAL17-08 |
write_unblocked_after_recovery |
|
Safety + Monitor |
VAL17-09 |
second_cycle_count_increments |
After second loss/recovery completes: |
Resilience |
VAL17-10 |
audit_events_captured |
≥ 1 slice-local |
Audit |
Pass/Fail Criteria¶
Outcome |
Condition |
|---|---|
PASS |
All 10 checks pass |
PARTIAL |
Checks 2, 4, 6, 8 pass (loss detected, write blocked, recovery detected, write unblocked) |
FAIL |
Check 2 fails (quorum loss not detected) OR check 4 fails (writes not blocked during loss) |
The two mandatory checks are VAL17-02 (loss detected) and VAL17-04 (write
blocked). A system that detects loss but does not block writes provides no
safety guarantee. A system that blocks writes but never detects loss means the
write_block_active signal is unreliable.
Evidence Files¶
File |
Description |
|---|---|
|
Docker container IP, PG URL used by HA server |
|
HA server log (single session throughout all phases) |
|
|
|
|
|
Python assertion result: |
|
|
|
|
|
|
|
Python assertion: |
|
Python assertion: |
|
|
|
|
|
|
|
Python assertion: write_block_active, can_accept_protected_writes, last_lost_at, last_restored_at, detected_loss_count |
|
|
|
|
|
Python assertion: |
|
|
|
|
|
Python assertion: |
|
Human-readable 10-check PASS/FAIL report with |
|
Machine-readable JSON report with |
Known Failure Modes¶
Failure |
Likely Cause |
Mitigation |
|---|---|---|
VAL17-01 FAIL: baseline not healthy |
HA server not yet leader; QuorumMonitor first tick hasn’t run |
Increase |
VAL17-02/03 FAIL: |
|
Check |
VAL17-04 FAIL: |
QuorumMonitor not wired to WriteGate in HA server binary; or snapshot taken before monitor tick |
Check if |
VAL17-08 FAIL: |
QuorumMonitor not running (no |
Verify HA server log shows |
VAL17-09 FAIL: |
Second |
Wait an extra 500 ms after recovery before reading count; timing race |
VAL17-10 FAIL: no audit events |
|
Verify |
Docker not available |
CI environment without Docker daemon |
Function prints SKIP and exits 0; VAL17 not counted in pass total |
Final Report Template¶
# VAL 17 — Quorum Loss Validation
Generated: <timestamp>
Node: cp-val17-node:19003
## Timing
Loss detection: <N> ms (threshold=30000)
Recovery detection: <N> ms (threshold=30000)
## Checks
VAL17-01 baseline_quorum_healthy: PASS
VAL17-02 loss_detected: PASS
VAL17-03 loss_detection_timing_bound: PASS (loss_ms=<N>, threshold=30000)
VAL17-04 write_blocked_during_loss: PASS
VAL17-05 loss_reason_populated: PASS
VAL17-06 recovery_detected: PASS
VAL17-07 recovery_timing_bound: PASS (recovery_ms=<N>, threshold=30000)
VAL17-08 write_unblocked_after_recovery: PASS
VAL17-09 second_cycle_count_increments: PASS
VAL17-10 audit_events_captured: PASS
## Summary
pass=10 fail=0 total=10
Quorum Loss Readiness Assessment (Gap HA-004):
PASS requires VAL17-02 (loss detected) + VAL17-04 (write blocked) + VAL17-06 (recovery detected)
Record
loss_msandrecovery_msas baseline timing evidence against workplan ≤ 60 s targetRepeat VAL17 after any change to
pgstore/quorum.go, QuorumMonitor interval configuration, or WriteGate middleware