Manual Failover Procedure

Audience: operators managing an AutonomyOps control-plane in HA mode (primary + streaming replica).

When to use this runbook

  • Planned maintenance on the current leader node (OS patching, hardware replacement).

  • The current leader is showing elevated latency or errors and the operator wants to shift write authority to a replica.

  • As part of a split-brain recovery (see Split-Brain Recovery).

Prerequisites

  • Running HA control-plane with at least one streaming replica.

  • Replication lag is not in the critical range (> 10s by default). The failover endpoint rejects the request if lag is critical to prevent data loss.

  • AUTONOMY_ORCHESTRATOR_URL pointing at the current leader node.

  • AUTONOMY_OPERATOR set.


1. Pre-failover health check

Always verify the cluster state before triggering a failover.

autonomy ha status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

Expected output (healthy leader):

Role:               leader
Session lock held:  true
Write-Ready:        true
Queried At:         2026-03-18T10:00:00Z

Leader
  Holder ID:        cp-node-1
  Epoch:            7
  Acquired At:      2026-03-18T08:30:00Z

Quorum
  Write-Ready:      true
  Sync Replicas:    1

Replication
  Lag Status:       ok
  Max Write Lag:    12ms
  Max Replay Lag:   18ms
  Replicas:         1

Stop if:

  • Session lock held is false — this node is not the leader; connect to the actual leader.

  • Replication / Lag Status is critical — lag is too high; fix replication first.

  • Quorum / Sync Replicas is 0 and your cluster requires synchronous writes.


2. Trigger the failover

autonomy ha failover trigger \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --reason "planned maintenance: OS patching on cp-node-1"

Expected output:

Failover triggered
  Previous Holder: cp-node-1
  Previous Epoch:  7
  Resigned At:     2026-03-18T10:05:00Z
  Message:         leader resigned gracefully; successor will campaign for epoch 8

Or with --output json:

{
  "success": true,
  "previous_holder": "cp-node-1",
  "previous_epoch": 7,
  "resigned_at": "2026-03-18T10:05:00Z",
  "message": "leader resigned gracefully; successor will campaign for epoch 8"
}

3. Confirm the new leader has acquired write authority

The successor will campaign for the advisory lock within its campaign interval (typically < 1 second). Query the new leader’s endpoint (or use a load-balancer VIP that routes to any live node and check which one reports session_lock_held: true).

# Poll until a new leader appears
watch -n 2 'autonomy ha status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" 2>&1 | head -6'

Success criteria:

  • A node other than cp-node-1 reports Session lock held: true.

  • Epoch has incremented by 1 (from 7 to 8 in this example).

  • Write-Ready is true on the new leader.

Alternatively, query the health endpoint directly:

curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/leader" | jq .
# Expect: "holder_id": "cp-node-2", "current_epoch": 8, "session_lock_held": true

4. Verify the old leader released authority

On the old leader node (or via its endpoint if it is still reachable):

curl -sf "http://cp-node-1:8888/v1/health/write-ready" | jq .
# Expect: {"ready": false}

5. Post-failover checklist

  • New leader reports Write-Ready: true.

  • Epoch incremented by exactly 1 (no gap).

  • Replication lag returns to ok after the brief transition window.

  • Rollout BatchPromoter on the new leader is running (check logs for cprollout: batch promoter started).

  • Audit event ha.failover.completed appears in the slog output.


Failure branches

409 — precondition not met: not the leader

The node you targeted is not the current leader.

# Find the current leader
curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/leader" | jq .holder_id

Re-run the failover command against the leader’s endpoint.

409 — precondition not met: replication lag critical

The replica is too far behind. Wait for lag to recover, then retry. To check lag threshold configuration:

curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" | jq '{lag_status, max_replay_lag}'

Default thresholds: warn > 5s, critical > 10s. These are configurable via ReplicationLagWarn / ReplicationLagMax in the control-plane config.

Successor does not acquire leadership within 30 seconds

  1. Check that at least one replica has an active Campaign loop running.

  2. Check PostgreSQL connectivity from the replica to the primary.

  3. Check pg_stat_activity for lingering sessions from the old leader holding the advisory lock:

    SELECT pid, application_name, state FROM pg_stat_activity
    WHERE application_name LIKE 'cp-%';
    

    A session-level advisory lock (pg_try_advisory_lock) is released automatically when the PostgreSQL session disconnects. If the old leader crashed, terminate its session.


Known gaps

  • No --force flag: The failover endpoint rejects requests when replication lag is critical. A --force override for emergency failover during a lag spike is a follow-on item.

  • No automated failover: Leader failure does not automatically trigger a Campaign on a replica — it relies on the next Campaign loop cycle. Automated failover (triggered by keepalive failure detection) is a follow-on item.

  • Write gap during transition: There is a brief period between Resign and the successor’s Campaign (typically < 1s) during which the cluster has no write authority. Callers should retry protected writes with backoff during this window.