Split-Brain Detection and Recovery

Audience: operators managing an AutonomyOps HA control-plane cluster.

What is a split-brain condition?

A split-brain occurs when two or more control-plane nodes both believe they hold write authority simultaneously. The AutonomyOps epoch-fence mechanism prevents this from causing actual data corruption (stale-epoch writes are rejected), but the cluster must be brought back to a single authoritative leader before normal operation can resume.

The system detects split-brain using three criteria:

  1. Epoch divergence — two nodes report different current epochs via the leader health endpoint.

  2. Holder mismatch — the leadership_state row’s holder_id does not match the node that currently holds the advisory lock.

  3. Unclosed epochs — more than one epoch row in leader_epochs has resigned_at = NULL, indicating multiple leaders started epochs without clean closeout.


1. Detect a split-brain condition

autonomy ha split-brain detect \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

Example — split-brain detected:

Split-Brain Status
  Split-Brain Detected: true
  Detected At:          2026-03-18T11:00:00Z

Criteria
  Epoch Divergence:     true   (node-1 reports epoch 8; node-2 reports epoch 8)
  Holder Mismatch:      true   (lock held by node-2; leadership_state.holder_id = node-1)
  Unclosed Epochs:      true   (2 epochs have resigned_at = NULL)

  Explanation:
    node-1 started epoch 8 but crashed before writing resigned_at.
    node-2 campaigned and also acquired epoch 8 (epoch counter not yet incremented).

With --output json:

autonomy ha split-brain detect \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --output json | jq .
{
  "split_brain_detected": true,
  "detected_at": "2026-03-18T11:00:00Z",
  "epoch_divergence": true,
  "holder_mismatch": true,
  "unclosed_epochs": 2
}

If split_brain_detected: false — the cluster is healthy. No further action needed.


2. Choose a recovery strategy

Two strategies are available:

Strategy

When to use

promote-leader

One node is clearly the correct leader (most recent epoch, replication is healthy from it). Force-resign the stale node and confirm the correct node as leader.

manual-reconcile

The situation is ambiguous — both nodes have made writes since divergence, or you need to inspect the epoch history manually before choosing. Resets the elector state without promoting either node.

Decision guidance:

Does one node clearly have the higher epoch number and a healthy replication stream?
│
├─ YES → Use promote-leader, targeting the node with the correct epoch.
│
└─ NO (ambiguous, both nodes at same epoch, both have writes)
    → Use manual-reconcile. Then inspect leader_epochs and promotion_decisions
      before manually triggering a failover from the chosen node.

3. Apply recovery: promote-leader

autonomy ha split-brain recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --strategy promote-leader \
  --reason "node-2 has higher epoch and healthy replication; node-1 stale"

For a confirmed split-brain where the stale node may resist the resign, add --force:

autonomy ha split-brain recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --strategy promote-leader \
  --reason "node-1 crash-recovered; node-2 is the correct leader" \
  --force

Expected response:

Split-Brain Recovery
  Strategy:     promote-leader
  Recovered At: 2026-03-18T11:10:00Z
  Message:      stale leader resigned; node-2 confirmed as leader for epoch 9

Safety note on --force: The force-resign uses a conditional WHERE holder_id = $1 clause. It is safe to call from a stale leader without overwriting the new leader’s holder_id record. However, always confirm the new leader’s state after using --force.


4. Apply recovery: manual-reconcile

autonomy ha split-brain recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --strategy manual-reconcile \
  --reason "investigating both nodes before choosing leader"

Expected response:

Split-Brain Recovery
  Strategy:     manual-reconcile
  Recovered At: 2026-03-18T11:05:00Z
  Message:      elector state reset; both nodes in standby mode; trigger failover manually

After manual-reconcile, neither node holds write authority. You must:

  1. Inspect epoch history to determine which node’s writes are authoritative:

    SELECT epoch, holder_id, acquired_at, resigned_at
    FROM leader_epochs ORDER BY epoch DESC LIMIT 10;
    
  2. Inspect promotion decisions for any decisions made during the split period:

    SELECT pd.*, le.holder_id AS decided_by_leader
    FROM promotion_decisions pd
    JOIN leader_epochs le ON pd.epoch = le.epoch
    WHERE pd.decided_at > '<split-start-time>'
    ORDER BY pd.decided_at;
    
  3. Manually trigger a failover from the correct node:

    autonomy ha failover trigger \
      --orchestrator-url "http://cp-node-2:8888" \
      --operator "$AUTONOMY_OPERATOR" \
      --reason "post-reconcile: node-2 elected as correct leader"
    

5. Post-recovery verification

# Run detect again to confirm split-brain is resolved
autonomy ha split-brain detect \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# Expect: split_brain_detected: false

# Confirm single write-ready leader
autonomy ha status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# Expect: Role: leader, Write-Ready: true, Session lock held: true

Check the audit log for the recovery event:

autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --category ha \
  --limit 10
# Expect: ha.split_brain.recovered event

Known gaps

  • No automated split-brain detection alert: The detection endpoint is synchronous and must be polled. A background monitor that emits ha.split_brain.detected on a timer is a follow-on item.

  • No automatic recovery: All recovery strategies require operator invocation. Automatic promotion of the node with the highest epoch is not implemented.

  • Epoch divergence between two running nodes requires querying both: The detection endpoint queries the single node it’s connected to. To confirm divergence, the operator must also query the other node’s /v1/health/leader endpoint directly.