VAL 16 — Split-Brain Chaos Validation¶

Status: Implemented Runner: run_split_brain_chaos_val16_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val16/ Port: cp-val16-node → 19002

Purpose¶

Validates the split-brain detection and recovery subsystem under injected fault conditions:

Confirms risk=none at baseline (healthy single-node cluster)
Verifies risk=detected is raised when epoch divergence is injected directly into leadership_state
Verifies risk=possible is raised when unclosed ghost-node epoch rows exist in leader_epochs
Confirms detection is idempotent (repeated API calls return the same risk level without side effects)
Proves manual-reconcile strategy exits successfully without writing to the database (dry-run)
Proves promote-leader strategy resolves risk=detected and restores risk=none
Verifies user data rows are untouched by leadership metadata recovery
Confirms ghost-node risk=possible self-clears after resigned_at is stamped on injected rows
Captures ha.split_brain.detected and ha.split_brain.recovered audit events in the shared store
Verifies cluster stability after all chaos scenarios complete

Branch-Specific Rule Application¶

Question	Answer
Is this covered by an existing LAB?	Partially. `run_split_brain_lab()` (line 1525 of `run_cli_audit_lab.sh`) exercises `ha split-brain detect` and `ha split-brain recover` as part of a larger shared HA bring-up/tear-down flow. It does NOT cover: structured 10-check pass/fail matrix, data integrity proof, idempotency assertion, ghost-node (`risk=possible`) path, ghost-node self-clearing, or a standalone evidence bundle.
Which LAB/evidence bundle is extended?	`run_cli_audit_lab.sh` — new function `run_split_brain_chaos_val16_lab()` appended as slice 26. Reuses `start_ha_server()`, `wait_for_http()`, `wait_for_log()`, `wait_for_pg_container()`, and `wait_for_split_brain_risk()` helpers defined in the same file.
New evidence files	26 files in `$EVIDENCE_DIR/val16/` — see Evidence Files table below.
Tutorial/runbook docs updated	`docs/tutorials/cli-audit-lab.md` §4 (slice 26), §5 (val16/ files), §6 (expected results), §8 (scope).
Reason new runner function required	`run_split_brain_lab()` is embedded in shared infrastructure (`pr17ha-primary`, `pr17-ha-net`) and lacks structured assertions. Injecting isolated Docker infrastructure, data integrity probes, ghost-node paths, and a 10-check report would break the existing flow. A narrowly scoped `run_split_brain_chaos_val16_lab()` with isolated containers is cleaner.

Scenarios Under Test¶

Scenario	Injection Mechanism	Expected Outcome
Baseline (no injection)	None	`risk=none`
Epoch divergence	`UPDATE leadership_state SET current_epoch = current_epoch + 99, holder_id = 'val16-injected-node'`	`risk=detected`
Idempotent re-detection	Second API call after injection	Same `risk=detected`
Dry-run recovery	`ha split-brain recover --strategy manual-reconcile`	Exit 0; `risk` remains `detected` (DB not modified)
Execute recovery	`ha split-brain recover --strategy promote-leader`	Exit 0; `risk=none` restored
Ghost-node unclosed epochs	`INSERT INTO leader_epochs` with two rows, no `resigned_at`	`risk=possible`
Ghost-node self-clearing	`UPDATE leader_epochs SET resigned_at = NOW()` for injected rows	`risk=none`

Out-of-scope¶

Real network partitions via iptables (requires root/privileged containers)
Two-node genuine split-brain with separate PostgreSQL connections (would require network isolation not available in standard Docker without iptables)
Automatic rollback trigger under split-brain (deferred, no implementation)
Multi-region / WAN split-brain scenarios
Write-path corruption during split-brain window (advisory lock release governs this; not directly testable without concurrent writes and partition)

Safety and Isolation Approach¶

Guardrail	Detail
Fresh Docker containers	`val16-pg-primary`, `val16-pg-vol`, `val16-ha-net` — names distinct from all other lab functions to avoid cross-lab interference
Local `_v16_cleanup`	Kills HA server PID, removes container/volume/network before returning — called before any `return 0/1`
`docker_cleanup()` extended	Top-level cleanup trap also removes `val16-pg-*` resources on EXIT
Docker availability guard	`command -v docker && docker info` check at function entry; prints SKIP and returns 0 if unavailable
Single PG node	`--min-sync-replicas 0` — no standby required; avoids replica setup complexity
`--campaign 500ms`	Stable leadership before injection; fast enough for Campaign to notice cleared epochs
`--keepalive 500ms`	Advisory lock keepalive; keeps lock held during injection so HA server doesn’t Campaign and overwrite
SQL injection scope	Only `leadership_state` and `leader_epochs` metadata tables touched — user data table `val16_probe` is never modified by injection or recovery

Why SQL injection rather than real partition¶

The split-brain detection logic in split_brain.go evaluates two conditions:

Epoch divergence (risk=detected): leadership_state.current_epoch does not match in-process elector + holder_id differs from local node ID. Directly updatable via psql.
Unclosed epoch count (risk=possible): leader_epochs rows with resigned_at IS NULL AND holder_id != current_holder count > 1. Directly insertable via psql.

Both conditions are reliably triggered and cleared via SQL without requiring a second HA server, network partition, or iptables rules. This approach is deterministic, fast, and CI-safe.

Injection Details¶

Epoch divergence (→ risk=detected)¶

UPDATE leadership_state
SET current_epoch = current_epoch + 99,
    holder_id = 'val16-injected-node'
WHERE id = 1;

This simultaneously advances the epoch counter far beyond the in-process value and changes the holder ID to a foreign node. The HA server’s next split-brain check (triggered by the API poll) compares its in-process epoch and holder to the DB row and raises risk=detected.

Ghost-node injection (→ risk=possible)¶

INSERT INTO leader_epochs (epoch, holder_id, acquired_at)
SELECT current_epoch + 1000, 'val16-ghost-a', NOW() FROM leadership_state WHERE id = 1;

INSERT INTO leader_epochs (epoch, holder_id, acquired_at)
SELECT current_epoch + 1001, 'val16-ghost-b', NOW() FROM leadership_state WHERE id = 1;

Two rows with resigned_at IS NULL for foreign holder IDs simulate the condition where multiple past leaders never wrote a clean resign. The HA server counts unclosed-epoch rows and raises risk=possible.

Ghost-node self-clearing¶

UPDATE leader_epochs
SET resigned_at = NOW()
WHERE holder_id IN ('val16-ghost-a', 'val16-ghost-b')
  AND resigned_at IS NULL;

Stamping resigned_at causes the unclosed count to drop to 0, returning risk=none without any HA server restart.

Data Integrity Probe¶

A val16_probe table row (id=1, note='pre-inject') is inserted while the HA server holds leadership:

CREATE TABLE IF NOT EXISTS val16_probe (id INT PRIMARY KEY, note TEXT);
INSERT INTO val16_probe (id, note) VALUES (1, 'pre-inject')
ON CONFLICT (id) DO UPDATE SET note = EXCLUDED.note;

After promote-leader recovery, the row is read back:

SELECT note FROM val16_probe WHERE id = 1;

Expected: note = 'pre-inject' (unchanged). This proves that split-brain recovery operations — which only write to leadership_state and leader_epochs — do not affect user data tables.

VAL16 10-Check Matrix¶

Check	Name	Threshold	Scenario
VAL16-01	baseline_risk_none	`risk = none` before any injection	Baseline
VAL16-02	epoch_inject_detected	`risk = detected` after epoch divergence SQL injection	Epoch injection
VAL16-03	detection_idempotent	Repeated API call returns same `risk = detected`	Idempotency
VAL16-04	dry_run_reconcile_ok	`ha split-brain recover --strategy manual-reconcile` exits 0 and follow-up risk remains `detected`	Dry-run
VAL16-05	promote_leader_recovers	`ha split-brain recover --strategy promote-leader` exits 0; `risk = none` after	Execute recovery
VAL16-06	data_integrity_after_recovery	`val16_probe.note = 'pre-inject'` after recovery	User data
VAL16-07	ghost_node_possible	`risk = possible` after ghost-node epoch row insertion	Ghost-node injection
VAL16-08	ghost_node_self_clears	`risk = none` after `resigned_at` stamped on ghost rows	Ghost-node clearing
VAL16-09	audit_events_captured	≥ 1 `ha.split_brain.detected` after slice `start_time` + ≥ 1 `ha.split_brain.recovered` for actor `val16-operator` after slice `start_time`	Audit
VAL16-10	post_chaos_stability	`/v1/ha/status` confirms `holder_id` contains `cp-val16-node`	Final stability

Pass/Fail Criteria¶

Outcome	Condition
PASS	All 10 checks pass
PARTIAL	Checks 2, 5, 6 pass (epoch detection, promote-leader recovery, data integrity)
FAIL	Check 5 fails (promote-leader did not clear detected risk) OR check 6 fails (user data corrupted)

The mandatory check is VAL16-05 (promote-leader recovery): detection without a working recovery path is a critical defect. VAL16-06 (data integrity) is mandatory to prove recovery is scoped to metadata only.

Evidence Files¶

File	Description
`val16-ha-server.log`	HA server log (single normal session throughout)
`val16-probe-setup.txt`	Docker psql output: `CREATE TABLE` + `INSERT` for probe row
`val16-01-baseline.json`	`/v1/ha/split-brain` JSON at baseline (`risk=none`)
`val16-01-baseline.txt`	`ha split-brain detect` CLI output at baseline
`val16-02-epoch-inject.txt`	SQL UPDATE output (epoch divergence injection)
`val16-02-detected.json`	`/v1/ha/split-brain` JSON after injection (`risk=detected`)
`val16-02-detected.txt`	`ha split-brain detect` CLI output after injection
`val16-03-detect-repeat.json`	Second `/v1/ha/split-brain` call (idempotency check)
`val16-04-recover-dry-run.txt`	`ha split-brain recover --strategy manual-reconcile` stdout+stderr
`val16-04-risk-after-dry-run.json`	`/v1/ha/split-brain` JSON after dry-run (risk still `detected`)
`val16-04-risk-check.txt`	Python assertion result: `risk_after_dry_run=detected`
`val16-05-recover-execute.txt`	`ha split-brain recover --strategy promote-leader` stdout+stderr
`val16-05-recovered.json`	`/v1/ha/split-brain` JSON after promote-leader (`risk=none`)
`val16-05-recovered.txt`	`ha split-brain detect` CLI output after recovery
`val16-06-probe-after-recovery.txt`	SQL SELECT: `note` from `val16_probe WHERE id=1`
`val16-07-ghost-inject.txt`	SQL INSERT output (ghost-node epoch rows)
`val16-07-possible.json`	`/v1/ha/split-brain` JSON after ghost injection (`risk=possible`)
`val16-07-possible.txt`	`ha split-brain detect` CLI output for `risk=possible`
`val16-08-ghost-clear.txt`	SQL UPDATE output (stamp `resigned_at` on ghost rows)
`val16-08-cleared.json`	`/v1/ha/split-brain` JSON after clearing (`risk=none`)
`val16-09-audit-detected.json`	`audit query --event-type ha.split_brain.detected --start-time <slice_start>` JSON result
`val16-09-audit-recovered.json`	`audit query --event-type ha.split_brain.recovered --actor val16-operator --start-time <slice_start>` JSON result
`val16-09-audit-check.txt`	Python assertion result: slice-scoped event counts for both types
`val16-10-final-status.json`	`/v1/ha/status` JSON confirming `cp-val16-node` holds leadership
`val16-10-stability-check.txt`	Python assertion result: `holder_id` check
`val16-report.txt`	Human-readable 10-check PASS/FAIL report
`val16-report.json`	Machine-readable JSON report with `pass_count`, scenario results

Known Failure Modes¶

Failure	Likely Cause	Mitigation
VAL16-01 FAIL: baseline not `none`	Prior test left stale state in `leadership_state`; isolation failure	Check `_v16_cleanup` ran before infrastructure setup; verify `val16-ha-net` is freshly created
VAL16-02 FAIL: risk stays `none` after injection	HA server Campaign ran and overwrote injected epoch before poll detected it	Increase `--campaign` interval; verify advisory lock is still held (check `val16-ha-server.log` for `acquired leadership` and no subsequent `resigned`)
VAL16-05 FAIL: risk stays `detected` after promote-leader	`promote-leader` wrote wrong epoch or wrong holder to DB	Check `val16-05-recover-execute.txt` for error; verify CLI received correct `--orchestrator-url`
VAL16-06 FAIL: probe row missing or mutated	Recovery accidentally truncated/dropped user tables	Check `val16-05-recover-execute.txt` for SQL output; this would be a critical defect in `promote_leader.go`
VAL16-07 FAIL: risk stays `none` after ghost injection	Ghost epoch rows have `holder_id` that matches the live HA server’s `holder_id`	Verify ghost insert used distinct names (`val16-ghost-a/b`); check `val16-07-ghost-inject.txt`
VAL16-09 FAIL: no slice-scoped audit events	`AUTONOMY_AUDIT_DIR` not set, HA server started before env var export, or recovery actor/start-time filters do not match this run	Verify `AUTONOMY_AUDIT_DIR`, `val16_start_time`, and `val16-operator`; check `val16-ha-server.log` for audit emitter startup message
Docker not available	CI environment without Docker daemon	Function prints SKIP and exits 0; VAL16 not counted in pass total

Final Report Template¶

# VAL 16 — Split-Brain Chaos Validation

Generated:     <timestamp>
Node:          cp-val16-node:19002

## Scenario Results
VAL16-01 baseline_risk_none:        PASS
VAL16-02 epoch_inject_detected:     PASS
VAL16-03 detection_idempotent:      PASS
VAL16-04 dry_run_reconcile_ok:      PASS
VAL16-05 promote_leader_recovers:   PASS
VAL16-06 data_integrity_after_recovery: PASS
VAL16-07 ghost_node_possible:       PASS
VAL16-08 ghost_node_self_clears:    PASS
VAL16-09 audit_events_captured:     PASS
VAL16-10 post_chaos_stability:      PASS

## Summary
pass=10  fail=0  total=10

Split-Brain Recovery Readiness Assessment:

PASS requires VAL16-05 (promote-leader recovery) + VAL16-06 (data integrity) + VAL16-02 (detection works)
Repeat VAL16 after any change to split_brain.go, promote_leader.go, or the leadership_state/leader_epochs schema
VAL16-04 (manual-reconcile dry-run) confirms the planning path is safe for operator use before committing a recovery action