Manual Failover Procedure¶
Audience: operators managing an AutonomyOps control-plane in HA mode (primary + streaming replica).
When to use this runbook¶
Planned maintenance on the current leader node (OS patching, hardware replacement).
The current leader is showing elevated latency or errors and the operator wants to shift write authority to a replica.
As part of a split-brain recovery (see Split-Brain Recovery).
Prerequisites¶
Running HA control-plane with at least one streaming replica.
Replication lag is not in the critical range (> 10s by default). The failover endpoint rejects the request if lag is critical to prevent data loss.
AUTONOMY_ORCHESTRATOR_URLpointing at the current leader node.AUTONOMY_OPERATORset.
1. Pre-failover health check¶
Always verify the cluster state before triggering a failover.
autonomy ha status \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
Expected output (healthy leader):
Role: leader
Session lock held: true
Write-Ready: true
Queried At: 2026-03-18T10:00:00Z
Leader
Holder ID: cp-node-1
Epoch: 7
Acquired At: 2026-03-18T08:30:00Z
Quorum
Write-Ready: true
Sync Replicas: 1
Replication
Lag Status: ok
Max Write Lag: 12ms
Max Replay Lag: 18ms
Replicas: 1
Stop if:
Session lock heldisfalse— this node is not the leader; connect to the actual leader.Replication / Lag Statusiscritical— lag is too high; fix replication first.Quorum / Sync Replicasis0and your cluster requires synchronous writes.
2. Trigger the failover¶
autonomy ha failover trigger \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--operator "$AUTONOMY_OPERATOR" \
--reason "planned maintenance: OS patching on cp-node-1"
Expected output:
Failover triggered
Previous Holder: cp-node-1
Previous Epoch: 7
Resigned At: 2026-03-18T10:05:00Z
Message: leader resigned gracefully; successor will campaign for epoch 8
Or with --output json:
{
"success": true,
"previous_holder": "cp-node-1",
"previous_epoch": 7,
"resigned_at": "2026-03-18T10:05:00Z",
"message": "leader resigned gracefully; successor will campaign for epoch 8"
}
5. Post-failover checklist¶
New leader reports
Write-Ready: true.Epoch incremented by exactly 1 (no gap).
Replication lag returns to
okafter the brief transition window.Rollout BatchPromoter on the new leader is running (check logs for
cprollout: batch promoter started).Audit event
ha.failover.completedappears in the slog output.
Failure branches¶
409 — precondition not met: not the leader¶
The node you targeted is not the current leader.
# Find the current leader
curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/leader" | jq .holder_id
Re-run the failover command against the leader’s endpoint.
409 — precondition not met: replication lag critical¶
The replica is too far behind. Wait for lag to recover, then retry. To check lag threshold configuration:
curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" | jq '{lag_status, max_replay_lag}'
Default thresholds: warn > 5s, critical > 10s. These are configurable via
ReplicationLagWarn / ReplicationLagMax in the control-plane config.
Successor does not acquire leadership within 30 seconds¶
Check that at least one replica has an active Campaign loop running.
Check PostgreSQL connectivity from the replica to the primary.
Check
pg_stat_activityfor lingering sessions from the old leader holding the advisory lock:SELECT pid, application_name, state FROM pg_stat_activity WHERE application_name LIKE 'cp-%';
A session-level advisory lock (
pg_try_advisory_lock) is released automatically when the PostgreSQL session disconnects. If the old leader crashed, terminate its session.
Known gaps¶
No
--forceflag: The failover endpoint rejects requests when replication lag is critical. A--forceoverride for emergency failover during a lag spike is a follow-on item.No automated failover: Leader failure does not automatically trigger a Campaign on a replica — it relies on the next Campaign loop cycle. Automated failover (triggered by keepalive failure detection) is a follow-on item.
Write gap during transition: There is a brief period between
Resignand the successor’sCampaign(typically < 1s) during which the cluster has no write authority. Callers should retry protected writes with backoff during this window.