Quorum-Loss Recovery¶

Audience: operators managing an AutonomyOps HA control-plane cluster.

What is quorum loss?¶

The AutonomyOps control-plane defines quorum as the condition under which the cluster can safely accept protected writes. Quorum is evaluated based on:

Write-readiness: The leader node holds the advisory lock and is connected to the PostgreSQL primary.
Sync replica count: The number of synchronous streaming replicas meets the configured MinSyncReplicas threshold.
Replication lag: Maximum replay lag across all replicas is below the warning threshold.

Quorum loss is detected by the background QuorumMonitor (default poll interval: 30s) and emits an ha.quorum.lost audit event. Restoration emits ha.quorum.restored.

1. Check current quorum status¶

autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

Expected output (healthy quorum):

Quorum Status
  Write-Ready:    true
  Sync Replicas:  1
  Lag Status:     ok
  Queried At:     2026-03-18T12:00:00Z

Expected output (quorum lost):

Quorum Status
  Write-Ready:    false
  Sync Replicas:  0
  Lag Status:     critical
  Queried At:     2026-03-18T12:05:00Z

  Warning: quorum conditions are not met. Protected writes are blocked.

With --output json:

autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --output json | jq .

{
  "write_ready": false,
  "sync_replica_count": 0,
  "lag_status": "critical",
  "queried_at": "2026-03-18T12:05:00Z"
}

2. Diagnose the cause of quorum loss¶

Cause A: Sync replica disconnected¶

# Check replication status from PostgreSQL
psql "$POSTGRES_URL" -c "
  SELECT application_name, state, sync_state,
         write_lag, flush_lag, replay_lag
  FROM pg_stat_replication;
"

If the output is empty or sync_state is not sync, the synchronous replica is disconnected. Likely causes:

Network partition between primary and replica.
Replica process crashed (check replica system logs).
PostgreSQL replication slot filled up, causing replica to lag too far and disconnect.

Resolution:

Check replica health:

ssh cp-replica "systemctl status postgresql"

If the replica process is stopped, restart it and verify it reconnects:

-- After restart, on primary:
SELECT application_name, state, sync_state FROM pg_stat_replication;
-- Expect: sync_state = sync

If the replication slot is bloated (wal_receiver_status.received_lsn very far behind):
```
SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
```
If lag is extreme (many GB), you may need to rebuild the replica with pg_basebackup.

Cause B: Replication lag in critical range¶

curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" \
  | jq '{lag_status, max_write_lag, max_flush_lag, max_replay_lag}'

If lag_status: critical (replay lag > 10s default), the cluster is preventing writes to protect data integrity.

Resolution:

Identify the source of lag: heavy write load on primary, network congestion, or slow disk on replica.
Reduce write load if possible and wait for replica to catch up.

Monitor lag convergence:

watch -n 5 'curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" | jq .max_replay_lag'

Default lag thresholds: warn > 5s, critical > 10s. Both are configurable.

Cause C: Leader node lost write authority (not the primary)¶

curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/write-ready" | jq .
# If ready: false, check why the leader lost write authority
curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/leader" | jq .

If session_lock_held is false, the leader lost its advisory lock. A new Campaign cycle should re-acquire it within the KeepaliveInterval (typically 5s). If it does not:

Check PostgreSQL connectivity from the leader:
```
psql "$POSTGRES_URL" -c "SELECT 1;"
```

Check leader logs for keepalive failure events:

cp.leader.keepalive_failures_total  — should be 0 or very low in steady state

3. Monitor quorum restoration¶

The QuorumMonitor emits state-change events every 30s (configurable). Monitor the audit log for ha.quorum.restored:

autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --category ha \
  --limit 20

Or poll quorum status directly:

watch -n 10 'autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" 2>&1'

4. Emergency: operating with degraded quorum¶

If quorum loss is prolonged and the release deployment must continue:

Assess risk: Write-blocked means no stage promotions, no new rollout plans can be published, and no HA leader changes can be written.
Do not lower MinSyncReplicas to 0 in production without explicit approval — this removes the synchronous replication guarantee and risks data loss on primary failure.
Consider a read-only degraded mode: The control-plane read endpoints (GET /v1/rollouts, GET /v1/health/*) continue to work without write authority. Operators can inspect state but cannot modify it.
Escalate: If quorum cannot be restored within the SLO window, escalate to infrastructure for PostgreSQL primary failover.

Known gaps¶

No automated quorum-loss recovery: The QuorumMonitor detects state changes and emits audit events, but does not take automatic corrective action. All recovery is manual.
No alerting integration: The monitor emits slog events only. Wiring to a Prometheus alert (e.g. cp_quorum_write_ready == 0 for > 1m) is a follow-on item.
Single detection criterion for MinSyncReplicas: The system evaluates sync replica count but does not distinguish between a single offline replica (recoverable) and a permanent replica decommission. Both present as quorum loss.