Split-Brain Detection and Recovery¶
Audience: operators managing an AutonomyOps HA control-plane cluster.
What is a split-brain condition?¶
A split-brain occurs when two or more control-plane nodes both believe they hold write authority simultaneously. The AutonomyOps epoch-fence mechanism prevents this from causing actual data corruption (stale-epoch writes are rejected), but the cluster must be brought back to a single authoritative leader before normal operation can resume.
The system detects split-brain using three criteria:
Epoch divergence — two nodes report different current epochs via the leader health endpoint.
Holder mismatch — the
leadership_staterow’sholder_iddoes not match the node that currently holds the advisory lock.Unclosed epochs — more than one epoch row in
leader_epochshasresigned_at = NULL, indicating multiple leaders started epochs without clean closeout.
1. Detect a split-brain condition¶
autonomy ha split-brain detect \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
Example — split-brain detected:
Split-Brain Status
Split-Brain Detected: true
Detected At: 2026-03-18T11:00:00Z
Criteria
Epoch Divergence: true (node-1 reports epoch 8; node-2 reports epoch 8)
Holder Mismatch: true (lock held by node-2; leadership_state.holder_id = node-1)
Unclosed Epochs: true (2 epochs have resigned_at = NULL)
Explanation:
node-1 started epoch 8 but crashed before writing resigned_at.
node-2 campaigned and also acquired epoch 8 (epoch counter not yet incremented).
With --output json:
autonomy ha split-brain detect \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--output json | jq .
{
"split_brain_detected": true,
"detected_at": "2026-03-18T11:00:00Z",
"epoch_divergence": true,
"holder_mismatch": true,
"unclosed_epochs": 2
}
If split_brain_detected: false — the cluster is healthy. No further action needed.
2. Choose a recovery strategy¶
Two strategies are available:
Strategy |
When to use |
|---|---|
|
One node is clearly the correct leader (most recent epoch, replication is healthy from it). Force-resign the stale node and confirm the correct node as leader. |
|
The situation is ambiguous — both nodes have made writes since divergence, or you need to inspect the epoch history manually before choosing. Resets the elector state without promoting either node. |
Decision guidance:
Does one node clearly have the higher epoch number and a healthy replication stream?
│
├─ YES → Use promote-leader, targeting the node with the correct epoch.
│
└─ NO (ambiguous, both nodes at same epoch, both have writes)
→ Use manual-reconcile. Then inspect leader_epochs and promotion_decisions
before manually triggering a failover from the chosen node.
3. Apply recovery: promote-leader¶
autonomy ha split-brain recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--operator "$AUTONOMY_OPERATOR" \
--strategy promote-leader \
--reason "node-2 has higher epoch and healthy replication; node-1 stale"
For a confirmed split-brain where the stale node may resist the resign, add --force:
autonomy ha split-brain recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--operator "$AUTONOMY_OPERATOR" \
--strategy promote-leader \
--reason "node-1 crash-recovered; node-2 is the correct leader" \
--force
Expected response:
Split-Brain Recovery
Strategy: promote-leader
Recovered At: 2026-03-18T11:10:00Z
Message: stale leader resigned; node-2 confirmed as leader for epoch 9
Safety note on --force: The force-resign uses a conditional WHERE holder_id = $1
clause. It is safe to call from a stale leader without overwriting the new leader’s
holder_id record. However, always confirm the new leader’s state after using --force.
4. Apply recovery: manual-reconcile¶
autonomy ha split-brain recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--operator "$AUTONOMY_OPERATOR" \
--strategy manual-reconcile \
--reason "investigating both nodes before choosing leader"
Expected response:
Split-Brain Recovery
Strategy: manual-reconcile
Recovered At: 2026-03-18T11:05:00Z
Message: elector state reset; both nodes in standby mode; trigger failover manually
After manual-reconcile, neither node holds write authority. You must:
Inspect epoch history to determine which node’s writes are authoritative:
SELECT epoch, holder_id, acquired_at, resigned_at FROM leader_epochs ORDER BY epoch DESC LIMIT 10;
Inspect promotion decisions for any decisions made during the split period:
SELECT pd.*, le.holder_id AS decided_by_leader FROM promotion_decisions pd JOIN leader_epochs le ON pd.epoch = le.epoch WHERE pd.decided_at > '<split-start-time>' ORDER BY pd.decided_at;
Manually trigger a failover from the correct node:
autonomy ha failover trigger \ --orchestrator-url "http://cp-node-2:8888" \ --operator "$AUTONOMY_OPERATOR" \ --reason "post-reconcile: node-2 elected as correct leader"
5. Post-recovery verification¶
# Run detect again to confirm split-brain is resolved
autonomy ha split-brain detect \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# Expect: split_brain_detected: false
# Confirm single write-ready leader
autonomy ha status \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# Expect: Role: leader, Write-Ready: true, Session lock held: true
Check the audit log for the recovery event:
autonomy audit query \
--audit-dir "$AUTONOMY_AUDIT_DIR" \
--category ha \
--limit 10
# Expect: ha.split_brain.recovered event
Known gaps¶
No automated split-brain detection alert: The detection endpoint is synchronous and must be polled. A background monitor that emits
ha.split_brain.detectedon a timer is a follow-on item.No automatic recovery: All recovery strategies require operator invocation. Automatic promotion of the node with the highest epoch is not implemented.
Epoch divergence between two running nodes requires querying both: The detection endpoint queries the single node it’s connected to. To confirm divergence, the operator must also query the other node’s
/v1/health/leaderendpoint directly.