VAL 16 — Split-Brain Chaos Validation¶
Status: Implemented
Runner: run_split_brain_chaos_val16_lab() in scripts/labs/run_cli_audit_lab.sh
Evidence dir: $EVIDENCE_DIR/val16/
Port: cp-val16-node → 19002
Purpose¶
Validates the split-brain detection and recovery subsystem under injected fault conditions:
Confirms
risk=noneat baseline (healthy single-node cluster)Verifies
risk=detectedis raised when epoch divergence is injected directly intoleadership_stateVerifies
risk=possibleis raised when unclosed ghost-node epoch rows exist inleader_epochsConfirms detection is idempotent (repeated API calls return the same risk level without side effects)
Proves
manual-reconcilestrategy exits successfully without writing to the database (dry-run)Proves
promote-leaderstrategy resolvesrisk=detectedand restoresrisk=noneVerifies user data rows are untouched by leadership metadata recovery
Confirms ghost-node
risk=possibleself-clears afterresigned_atis stamped on injected rowsCaptures
ha.split_brain.detectedandha.split_brain.recoveredaudit events in the shared storeVerifies cluster stability after all chaos scenarios complete
Branch-Specific Rule Application¶
Question |
Answer |
|---|---|
Is this covered by an existing LAB? |
Partially. |
Which LAB/evidence bundle is extended? |
|
New evidence files |
26 files in |
Tutorial/runbook docs updated |
|
Reason new runner function required |
|
Scenarios Under Test¶
Scenario |
Injection Mechanism |
Expected Outcome |
|---|---|---|
Baseline (no injection) |
None |
|
Epoch divergence |
|
|
Idempotent re-detection |
Second API call after injection |
Same |
Dry-run recovery |
|
Exit 0; |
Execute recovery |
|
Exit 0; |
Ghost-node unclosed epochs |
|
|
Ghost-node self-clearing |
|
|
Out-of-scope¶
Real network partitions via iptables (requires root/privileged containers)
Two-node genuine split-brain with separate PostgreSQL connections (would require network isolation not available in standard Docker without iptables)
Automatic rollback trigger under split-brain (deferred, no implementation)
Multi-region / WAN split-brain scenarios
Write-path corruption during split-brain window (advisory lock release governs this; not directly testable without concurrent writes and partition)
Safety and Isolation Approach¶
Guardrail |
Detail |
|---|---|
Fresh Docker containers |
|
Local |
Kills HA server PID, removes container/volume/network before returning — called before any |
|
Top-level cleanup trap also removes |
Docker availability guard |
|
Single PG node |
|
|
Stable leadership before injection; fast enough for Campaign to notice cleared epochs |
|
Advisory lock keepalive; keeps lock held during injection so HA server doesn’t Campaign and overwrite |
SQL injection scope |
Only |
Why SQL injection rather than real partition¶
The split-brain detection logic in split_brain.go evaluates two conditions:
Epoch divergence (
risk=detected):leadership_state.current_epochdoes not match in-process elector +holder_iddiffers from local node ID. Directly updatable viapsql.Unclosed epoch count (
risk=possible):leader_epochsrows withresigned_at IS NULL AND holder_id != current_holdercount > 1. Directly insertable viapsql.
Both conditions are reliably triggered and cleared via SQL without requiring a second HA server, network partition, or iptables rules. This approach is deterministic, fast, and CI-safe.
Injection Details¶
Epoch divergence (→ risk=detected)¶
UPDATE leadership_state
SET current_epoch = current_epoch + 99,
holder_id = 'val16-injected-node'
WHERE id = 1;
This simultaneously advances the epoch counter far beyond the in-process
value and changes the holder ID to a foreign node. The HA server’s next
split-brain check (triggered by the API poll) compares its in-process epoch
and holder to the DB row and raises risk=detected.
Ghost-node injection (→ risk=possible)¶
INSERT INTO leader_epochs (epoch, holder_id, acquired_at)
SELECT current_epoch + 1000, 'val16-ghost-a', NOW() FROM leadership_state WHERE id = 1;
INSERT INTO leader_epochs (epoch, holder_id, acquired_at)
SELECT current_epoch + 1001, 'val16-ghost-b', NOW() FROM leadership_state WHERE id = 1;
Two rows with resigned_at IS NULL for foreign holder IDs simulate the condition where multiple past leaders never wrote a clean resign. The HA server counts unclosed-epoch rows and raises risk=possible.
Ghost-node self-clearing¶
UPDATE leader_epochs
SET resigned_at = NOW()
WHERE holder_id IN ('val16-ghost-a', 'val16-ghost-b')
AND resigned_at IS NULL;
Stamping resigned_at causes the unclosed count to drop to 0, returning risk=none without any HA server restart.
Data Integrity Probe¶
A val16_probe table row (id=1, note='pre-inject') is inserted while the HA server holds leadership:
CREATE TABLE IF NOT EXISTS val16_probe (id INT PRIMARY KEY, note TEXT);
INSERT INTO val16_probe (id, note) VALUES (1, 'pre-inject')
ON CONFLICT (id) DO UPDATE SET note = EXCLUDED.note;
After promote-leader recovery, the row is read back:
SELECT note FROM val16_probe WHERE id = 1;
Expected: note = 'pre-inject' (unchanged). This proves that split-brain recovery operations — which only write to leadership_state and leader_epochs — do not affect user data tables.
VAL16 10-Check Matrix¶
Check |
Name |
Threshold |
Scenario |
|---|---|---|---|
VAL16-01 |
baseline_risk_none |
|
Baseline |
VAL16-02 |
epoch_inject_detected |
|
Epoch injection |
VAL16-03 |
detection_idempotent |
Repeated API call returns same |
Idempotency |
VAL16-04 |
dry_run_reconcile_ok |
|
Dry-run |
VAL16-05 |
promote_leader_recovers |
|
Execute recovery |
VAL16-06 |
data_integrity_after_recovery |
|
User data |
VAL16-07 |
ghost_node_possible |
|
Ghost-node injection |
VAL16-08 |
ghost_node_self_clears |
|
Ghost-node clearing |
VAL16-09 |
audit_events_captured |
≥ 1 |
Audit |
VAL16-10 |
post_chaos_stability |
|
Final stability |
Pass/Fail Criteria¶
Outcome |
Condition |
|---|---|
PASS |
All 10 checks pass |
PARTIAL |
Checks 2, 5, 6 pass (epoch detection, promote-leader recovery, data integrity) |
FAIL |
Check 5 fails (promote-leader did not clear detected risk) OR check 6 fails (user data corrupted) |
The mandatory check is VAL16-05 (promote-leader recovery): detection without a working recovery path is a critical defect. VAL16-06 (data integrity) is mandatory to prove recovery is scoped to metadata only.
Evidence Files¶
File |
Description |
|---|---|
|
HA server log (single normal session throughout) |
|
Docker psql output: |
|
|
|
|
|
SQL UPDATE output (epoch divergence injection) |
|
|
|
|
|
Second |
|
|
|
|
|
Python assertion result: |
|
|
|
|
|
|
|
SQL SELECT: |
|
SQL INSERT output (ghost-node epoch rows) |
|
|
|
|
|
SQL UPDATE output (stamp |
|
|
|
|
|
|
|
Python assertion result: slice-scoped event counts for both types |
|
|
|
Python assertion result: |
|
Human-readable 10-check PASS/FAIL report |
|
Machine-readable JSON report with |
Known Failure Modes¶
Failure |
Likely Cause |
Mitigation |
|---|---|---|
VAL16-01 FAIL: baseline not |
Prior test left stale state in |
Check |
VAL16-02 FAIL: risk stays |
HA server Campaign ran and overwrote injected epoch before poll detected it |
Increase |
VAL16-05 FAIL: risk stays |
|
Check |
VAL16-06 FAIL: probe row missing or mutated |
Recovery accidentally truncated/dropped user tables |
Check |
VAL16-07 FAIL: risk stays |
Ghost epoch rows have |
Verify ghost insert used distinct names ( |
VAL16-09 FAIL: no slice-scoped audit events |
|
Verify |
Docker not available |
CI environment without Docker daemon |
Function prints SKIP and exits 0; VAL16 not counted in pass total |
Final Report Template¶
# VAL 16 — Split-Brain Chaos Validation
Generated: <timestamp>
Node: cp-val16-node:19002
## Scenario Results
VAL16-01 baseline_risk_none: PASS
VAL16-02 epoch_inject_detected: PASS
VAL16-03 detection_idempotent: PASS
VAL16-04 dry_run_reconcile_ok: PASS
VAL16-05 promote_leader_recovers: PASS
VAL16-06 data_integrity_after_recovery: PASS
VAL16-07 ghost_node_possible: PASS
VAL16-08 ghost_node_self_clears: PASS
VAL16-09 audit_events_captured: PASS
VAL16-10 post_chaos_stability: PASS
## Summary
pass=10 fail=0 total=10
Split-Brain Recovery Readiness Assessment:
PASS requires VAL16-05 (promote-leader recovery) + VAL16-06 (data integrity) + VAL16-02 (detection works)
Repeat VAL16 after any change to
split_brain.go,promote_leader.go, or theleadership_state/leader_epochsschemaVAL16-04 (manual-reconcile dry-run) confirms the planning path is safe for operator use before committing a recovery action