Quorum-Loss Recovery¶
Audience: operators managing an AutonomyOps HA control-plane cluster.
What is quorum loss?¶
The AutonomyOps control-plane defines quorum as the condition under which the cluster can safely accept protected writes. Quorum is evaluated based on:
Write-readiness: The leader node holds the advisory lock and is connected to the PostgreSQL primary.
Sync replica count: The number of synchronous streaming replicas meets the configured
MinSyncReplicasthreshold.Replication lag: Maximum replay lag across all replicas is below the warning threshold.
Quorum loss is detected by the background QuorumMonitor (default poll interval: 30s)
and emits an ha.quorum.lost audit event. Restoration emits ha.quorum.restored.
1. Check current quorum status¶
autonomy ha quorum status \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
Expected output (healthy quorum):
Quorum Status
Write-Ready: true
Sync Replicas: 1
Lag Status: ok
Queried At: 2026-03-18T12:00:00Z
Expected output (quorum lost):
Quorum Status
Write-Ready: false
Sync Replicas: 0
Lag Status: critical
Queried At: 2026-03-18T12:05:00Z
Warning: quorum conditions are not met. Protected writes are blocked.
With --output json:
autonomy ha quorum status \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--output json | jq .
{
"write_ready": false,
"sync_replica_count": 0,
"lag_status": "critical",
"queried_at": "2026-03-18T12:05:00Z"
}
2. Diagnose the cause of quorum loss¶
Cause A: Sync replica disconnected¶
# Check replication status from PostgreSQL
psql "$POSTGRES_URL" -c "
SELECT application_name, state, sync_state,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
"
If the output is empty or sync_state is not sync, the synchronous replica is
disconnected. Likely causes:
Network partition between primary and replica.
Replica process crashed (check replica system logs).
PostgreSQL replication slot filled up, causing replica to lag too far and disconnect.
Resolution:
Check replica health:
ssh cp-replica "systemctl status postgresql"
If the replica process is stopped, restart it and verify it reconnects:
-- After restart, on primary: SELECT application_name, state, sync_state FROM pg_stat_replication; -- Expect: sync_state = sync
If the replication slot is bloated (wal_receiver_status.received_lsn very far behind):
SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag FROM pg_replication_slots;
If lag is extreme (many GB), you may need to rebuild the replica with
pg_basebackup.
Cause B: Replication lag in critical range¶
curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" \
| jq '{lag_status, max_write_lag, max_flush_lag, max_replay_lag}'
If lag_status: critical (replay lag > 10s default), the cluster is preventing writes
to protect data integrity.
Resolution:
Identify the source of lag: heavy write load on primary, network congestion, or slow disk on replica.
Reduce write load if possible and wait for replica to catch up.
Monitor lag convergence:
watch -n 5 'curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" | jq .max_replay_lag'
Default lag thresholds: warn > 5s, critical > 10s. Both are configurable.
3. Monitor quorum restoration¶
The QuorumMonitor emits state-change events every 30s (configurable). Monitor the
audit log for ha.quorum.restored:
autonomy audit query \
--audit-dir "$AUTONOMY_AUDIT_DIR" \
--category ha \
--limit 20
Or poll quorum status directly:
watch -n 10 'autonomy ha quorum status \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" 2>&1'
4. Emergency: operating with degraded quorum¶
If quorum loss is prolonged and the release deployment must continue:
Assess risk: Write-blocked means no stage promotions, no new rollout plans can be published, and no HA leader changes can be written.
Do not lower
MinSyncReplicasto 0 in production without explicit approval — this removes the synchronous replication guarantee and risks data loss on primary failure.Consider a read-only degraded mode: The control-plane read endpoints (
GET /v1/rollouts,GET /v1/health/*) continue to work without write authority. Operators can inspect state but cannot modify it.Escalate: If quorum cannot be restored within the SLO window, escalate to infrastructure for PostgreSQL primary failover.
Known gaps¶
No automated quorum-loss recovery: The
QuorumMonitordetects state changes and emits audit events, but does not take automatic corrective action. All recovery is manual.No alerting integration: The monitor emits slog events only. Wiring to a Prometheus alert (e.g.
cp_quorum_write_ready == 0for > 1m) is a follow-on item.Single detection criterion for
MinSyncReplicas: The system evaluates sync replica count but does not distinguish between a single offline replica (recoverable) and a permanent replica decommission. Both present as quorum loss.