VAL 16 — Split-Brain Chaos Validation

Status: Implemented Runner: run_split_brain_chaos_val16_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val16/ Port: cp-val16-node → 19002


Purpose

Validates the split-brain detection and recovery subsystem under injected fault conditions:

  • Confirms risk=none at baseline (healthy single-node cluster)

  • Verifies risk=detected is raised when epoch divergence is injected directly into leadership_state

  • Verifies risk=possible is raised when unclosed ghost-node epoch rows exist in leader_epochs

  • Confirms detection is idempotent (repeated API calls return the same risk level without side effects)

  • Proves manual-reconcile strategy exits successfully without writing to the database (dry-run)

  • Proves promote-leader strategy resolves risk=detected and restores risk=none

  • Verifies user data rows are untouched by leadership metadata recovery

  • Confirms ghost-node risk=possible self-clears after resigned_at is stamped on injected rows

  • Captures ha.split_brain.detected and ha.split_brain.recovered audit events in the shared store

  • Verifies cluster stability after all chaos scenarios complete


Branch-Specific Rule Application

Question

Answer

Is this covered by an existing LAB?

Partially. run_split_brain_lab() (line 1525 of run_cli_audit_lab.sh) exercises ha split-brain detect and ha split-brain recover as part of a larger shared HA bring-up/tear-down flow. It does NOT cover: structured 10-check pass/fail matrix, data integrity proof, idempotency assertion, ghost-node (risk=possible) path, ghost-node self-clearing, or a standalone evidence bundle.

Which LAB/evidence bundle is extended?

run_cli_audit_lab.sh — new function run_split_brain_chaos_val16_lab() appended as slice 26. Reuses start_ha_server(), wait_for_http(), wait_for_log(), wait_for_pg_container(), and wait_for_split_brain_risk() helpers defined in the same file.

New evidence files

26 files in $EVIDENCE_DIR/val16/ — see Evidence Files table below.

Tutorial/runbook docs updated

docs/tutorials/cli-audit-lab.md §4 (slice 26), §5 (val16/ files), §6 (expected results), §8 (scope).

Reason new runner function required

run_split_brain_lab() is embedded in shared infrastructure (pr17ha-primary, pr17-ha-net) and lacks structured assertions. Injecting isolated Docker infrastructure, data integrity probes, ghost-node paths, and a 10-check report would break the existing flow. A narrowly scoped run_split_brain_chaos_val16_lab() with isolated containers is cleaner.


Scenarios Under Test

Scenario

Injection Mechanism

Expected Outcome

Baseline (no injection)

None

risk=none

Epoch divergence

UPDATE leadership_state SET current_epoch = current_epoch + 99, holder_id = 'val16-injected-node'

risk=detected

Idempotent re-detection

Second API call after injection

Same risk=detected

Dry-run recovery

ha split-brain recover --strategy manual-reconcile

Exit 0; risk remains detected (DB not modified)

Execute recovery

ha split-brain recover --strategy promote-leader

Exit 0; risk=none restored

Ghost-node unclosed epochs

INSERT INTO leader_epochs with two rows, no resigned_at

risk=possible

Ghost-node self-clearing

UPDATE leader_epochs SET resigned_at = NOW() for injected rows

risk=none

Out-of-scope

  • Real network partitions via iptables (requires root/privileged containers)

  • Two-node genuine split-brain with separate PostgreSQL connections (would require network isolation not available in standard Docker without iptables)

  • Automatic rollback trigger under split-brain (deferred, no implementation)

  • Multi-region / WAN split-brain scenarios

  • Write-path corruption during split-brain window (advisory lock release governs this; not directly testable without concurrent writes and partition)


Safety and Isolation Approach

Guardrail

Detail

Fresh Docker containers

val16-pg-primary, val16-pg-vol, val16-ha-net — names distinct from all other lab functions to avoid cross-lab interference

Local _v16_cleanup

Kills HA server PID, removes container/volume/network before returning — called before any return 0/1

docker_cleanup() extended

Top-level cleanup trap also removes val16-pg-* resources on EXIT

Docker availability guard

command -v docker && docker info check at function entry; prints SKIP and returns 0 if unavailable

Single PG node

--min-sync-replicas 0 — no standby required; avoids replica setup complexity

--campaign 500ms

Stable leadership before injection; fast enough for Campaign to notice cleared epochs

--keepalive 500ms

Advisory lock keepalive; keeps lock held during injection so HA server doesn’t Campaign and overwrite

SQL injection scope

Only leadership_state and leader_epochs metadata tables touched — user data table val16_probe is never modified by injection or recovery

Why SQL injection rather than real partition

The split-brain detection logic in split_brain.go evaluates two conditions:

  1. Epoch divergence (risk=detected): leadership_state.current_epoch does not match in-process elector + holder_id differs from local node ID. Directly updatable via psql.

  2. Unclosed epoch count (risk=possible): leader_epochs rows with resigned_at IS NULL AND holder_id != current_holder count > 1. Directly insertable via psql.

Both conditions are reliably triggered and cleared via SQL without requiring a second HA server, network partition, or iptables rules. This approach is deterministic, fast, and CI-safe.


Injection Details

Epoch divergence (→ risk=detected)

UPDATE leadership_state
SET current_epoch = current_epoch + 99,
    holder_id = 'val16-injected-node'
WHERE id = 1;

This simultaneously advances the epoch counter far beyond the in-process value and changes the holder ID to a foreign node. The HA server’s next split-brain check (triggered by the API poll) compares its in-process epoch and holder to the DB row and raises risk=detected.

Ghost-node injection (→ risk=possible)

INSERT INTO leader_epochs (epoch, holder_id, acquired_at)
SELECT current_epoch + 1000, 'val16-ghost-a', NOW() FROM leadership_state WHERE id = 1;

INSERT INTO leader_epochs (epoch, holder_id, acquired_at)
SELECT current_epoch + 1001, 'val16-ghost-b', NOW() FROM leadership_state WHERE id = 1;

Two rows with resigned_at IS NULL for foreign holder IDs simulate the condition where multiple past leaders never wrote a clean resign. The HA server counts unclosed-epoch rows and raises risk=possible.

Ghost-node self-clearing

UPDATE leader_epochs
SET resigned_at = NOW()
WHERE holder_id IN ('val16-ghost-a', 'val16-ghost-b')
  AND resigned_at IS NULL;

Stamping resigned_at causes the unclosed count to drop to 0, returning risk=none without any HA server restart.


Data Integrity Probe

A val16_probe table row (id=1, note='pre-inject') is inserted while the HA server holds leadership:

CREATE TABLE IF NOT EXISTS val16_probe (id INT PRIMARY KEY, note TEXT);
INSERT INTO val16_probe (id, note) VALUES (1, 'pre-inject')
ON CONFLICT (id) DO UPDATE SET note = EXCLUDED.note;

After promote-leader recovery, the row is read back:

SELECT note FROM val16_probe WHERE id = 1;

Expected: note = 'pre-inject' (unchanged). This proves that split-brain recovery operations — which only write to leadership_state and leader_epochs — do not affect user data tables.


VAL16 10-Check Matrix

Check

Name

Threshold

Scenario

VAL16-01

baseline_risk_none

risk = none before any injection

Baseline

VAL16-02

epoch_inject_detected

risk = detected after epoch divergence SQL injection

Epoch injection

VAL16-03

detection_idempotent

Repeated API call returns same risk = detected

Idempotency

VAL16-04

dry_run_reconcile_ok

ha split-brain recover --strategy manual-reconcile exits 0 and follow-up risk remains detected

Dry-run

VAL16-05

promote_leader_recovers

ha split-brain recover --strategy promote-leader exits 0; risk = none after

Execute recovery

VAL16-06

data_integrity_after_recovery

val16_probe.note = 'pre-inject' after recovery

User data

VAL16-07

ghost_node_possible

risk = possible after ghost-node epoch row insertion

Ghost-node injection

VAL16-08

ghost_node_self_clears

risk = none after resigned_at stamped on ghost rows

Ghost-node clearing

VAL16-09

audit_events_captured

≥ 1 ha.split_brain.detected after slice start_time + ≥ 1 ha.split_brain.recovered for actor val16-operator after slice start_time

Audit

VAL16-10

post_chaos_stability

/v1/ha/status confirms holder_id contains cp-val16-node

Final stability


Pass/Fail Criteria

Outcome

Condition

PASS

All 10 checks pass

PARTIAL

Checks 2, 5, 6 pass (epoch detection, promote-leader recovery, data integrity)

FAIL

Check 5 fails (promote-leader did not clear detected risk) OR check 6 fails (user data corrupted)

The mandatory check is VAL16-05 (promote-leader recovery): detection without a working recovery path is a critical defect. VAL16-06 (data integrity) is mandatory to prove recovery is scoped to metadata only.


Evidence Files

File

Description

val16-ha-server.log

HA server log (single normal session throughout)

val16-probe-setup.txt

Docker psql output: CREATE TABLE + INSERT for probe row

val16-01-baseline.json

/v1/ha/split-brain JSON at baseline (risk=none)

val16-01-baseline.txt

ha split-brain detect CLI output at baseline

val16-02-epoch-inject.txt

SQL UPDATE output (epoch divergence injection)

val16-02-detected.json

/v1/ha/split-brain JSON after injection (risk=detected)

val16-02-detected.txt

ha split-brain detect CLI output after injection

val16-03-detect-repeat.json

Second /v1/ha/split-brain call (idempotency check)

val16-04-recover-dry-run.txt

ha split-brain recover --strategy manual-reconcile stdout+stderr

val16-04-risk-after-dry-run.json

/v1/ha/split-brain JSON after dry-run (risk still detected)

val16-04-risk-check.txt

Python assertion result: risk_after_dry_run=detected

val16-05-recover-execute.txt

ha split-brain recover --strategy promote-leader stdout+stderr

val16-05-recovered.json

/v1/ha/split-brain JSON after promote-leader (risk=none)

val16-05-recovered.txt

ha split-brain detect CLI output after recovery

val16-06-probe-after-recovery.txt

SQL SELECT: note from val16_probe WHERE id=1

val16-07-ghost-inject.txt

SQL INSERT output (ghost-node epoch rows)

val16-07-possible.json

/v1/ha/split-brain JSON after ghost injection (risk=possible)

val16-07-possible.txt

ha split-brain detect CLI output for risk=possible

val16-08-ghost-clear.txt

SQL UPDATE output (stamp resigned_at on ghost rows)

val16-08-cleared.json

/v1/ha/split-brain JSON after clearing (risk=none)

val16-09-audit-detected.json

audit query --event-type ha.split_brain.detected --start-time <slice_start> JSON result

val16-09-audit-recovered.json

audit query --event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start> JSON result

val16-09-audit-check.txt

Python assertion result: slice-scoped event counts for both types

val16-10-final-status.json

/v1/ha/status JSON confirming cp-val16-node holds leadership

val16-10-stability-check.txt

Python assertion result: holder_id check

val16-report.txt

Human-readable 10-check PASS/FAIL report

val16-report.json

Machine-readable JSON report with pass_count, scenario results


Known Failure Modes

Failure

Likely Cause

Mitigation

VAL16-01 FAIL: baseline not none

Prior test left stale state in leadership_state; isolation failure

Check _v16_cleanup ran before infrastructure setup; verify val16-ha-net is freshly created

VAL16-02 FAIL: risk stays none after injection

HA server Campaign ran and overwrote injected epoch before poll detected it

Increase --campaign interval; verify advisory lock is still held (check val16-ha-server.log for acquired leadership and no subsequent resigned)

VAL16-05 FAIL: risk stays detected after promote-leader

promote-leader wrote wrong epoch or wrong holder to DB

Check val16-05-recover-execute.txt for error; verify CLI received correct --orchestrator-url

VAL16-06 FAIL: probe row missing or mutated

Recovery accidentally truncated/dropped user tables

Check val16-05-recover-execute.txt for SQL output; this would be a critical defect in promote_leader.go

VAL16-07 FAIL: risk stays none after ghost injection

Ghost epoch rows have holder_id that matches the live HA server’s holder_id

Verify ghost insert used distinct names (val16-ghost-a/b); check val16-07-ghost-inject.txt

VAL16-09 FAIL: no slice-scoped audit events

AUTONOMY_AUDIT_DIR not set, HA server started before env var export, or recovery actor/start-time filters do not match this run

Verify AUTONOMY_AUDIT_DIR, val16_start_time, and val16-operator; check val16-ha-server.log for audit emitter startup message

Docker not available

CI environment without Docker daemon

Function prints SKIP and exits 0; VAL16 not counted in pass total


Final Report Template

# VAL 16 — Split-Brain Chaos Validation

Generated:     <timestamp>
Node:          cp-val16-node:19002

## Scenario Results
VAL16-01 baseline_risk_none:        PASS
VAL16-02 epoch_inject_detected:     PASS
VAL16-03 detection_idempotent:      PASS
VAL16-04 dry_run_reconcile_ok:      PASS
VAL16-05 promote_leader_recovers:   PASS
VAL16-06 data_integrity_after_recovery: PASS
VAL16-07 ghost_node_possible:       PASS
VAL16-08 ghost_node_self_clears:    PASS
VAL16-09 audit_events_captured:     PASS
VAL16-10 post_chaos_stability:      PASS

## Summary
pass=10  fail=0  total=10

Split-Brain Recovery Readiness Assessment:

  • PASS requires VAL16-05 (promote-leader recovery) + VAL16-06 (data integrity) + VAL16-02 (detection works)

  • Repeat VAL16 after any change to split_brain.go, promote_leader.go, or the leadership_state/leader_epochs schema

  • VAL16-04 (manual-reconcile dry-run) confirms the planning path is safe for operator use before committing a recovery action