Quorum-Loss Recovery

Audience: operators managing an AutonomyOps HA control-plane cluster.

What is quorum loss?

The AutonomyOps control-plane defines quorum as the condition under which the cluster can safely accept protected writes. Quorum is evaluated based on:

  1. Write-readiness: The leader node holds the advisory lock and is connected to the PostgreSQL primary.

  2. Sync replica count: The number of synchronous streaming replicas meets the configured MinSyncReplicas threshold.

  3. Replication lag: Maximum replay lag across all replicas is below the warning threshold.

Quorum loss is detected by the background QuorumMonitor (default poll interval: 30s) and emits an ha.quorum.lost audit event. Restoration emits ha.quorum.restored.


1. Check current quorum status

autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

Expected output (healthy quorum):

Quorum Status
  Write-Ready:    true
  Sync Replicas:  1
  Lag Status:     ok
  Queried At:     2026-03-18T12:00:00Z

Expected output (quorum lost):

Quorum Status
  Write-Ready:    false
  Sync Replicas:  0
  Lag Status:     critical
  Queried At:     2026-03-18T12:05:00Z

  Warning: quorum conditions are not met. Protected writes are blocked.

With --output json:

autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --output json | jq .
{
  "write_ready": false,
  "sync_replica_count": 0,
  "lag_status": "critical",
  "queried_at": "2026-03-18T12:05:00Z"
}

2. Diagnose the cause of quorum loss

Cause A: Sync replica disconnected

# Check replication status from PostgreSQL
psql "$POSTGRES_URL" -c "
  SELECT application_name, state, sync_state,
         write_lag, flush_lag, replay_lag
  FROM pg_stat_replication;
"

If the output is empty or sync_state is not sync, the synchronous replica is disconnected. Likely causes:

  • Network partition between primary and replica.

  • Replica process crashed (check replica system logs).

  • PostgreSQL replication slot filled up, causing replica to lag too far and disconnect.

Resolution:

  1. Check replica health:

    ssh cp-replica "systemctl status postgresql"
    
  2. If the replica process is stopped, restart it and verify it reconnects:

    -- After restart, on primary:
    SELECT application_name, state, sync_state FROM pg_stat_replication;
    -- Expect: sync_state = sync
    
  3. If the replication slot is bloated (wal_receiver_status.received_lsn very far behind):

    SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
    FROM pg_replication_slots;
    

    If lag is extreme (many GB), you may need to rebuild the replica with pg_basebackup.

Cause B: Replication lag in critical range

curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" \
  | jq '{lag_status, max_write_lag, max_flush_lag, max_replay_lag}'

If lag_status: critical (replay lag > 10s default), the cluster is preventing writes to protect data integrity.

Resolution:

  • Identify the source of lag: heavy write load on primary, network congestion, or slow disk on replica.

  • Reduce write load if possible and wait for replica to catch up.

  • Monitor lag convergence:

    watch -n 5 'curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/replication" | jq .max_replay_lag'
    
  • Default lag thresholds: warn > 5s, critical > 10s. Both are configurable.

Cause C: Leader node lost write authority (not the primary)

curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/write-ready" | jq .
# If ready: false, check why the leader lost write authority
curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/health/leader" | jq .

If session_lock_held is false, the leader lost its advisory lock. A new Campaign cycle should re-acquire it within the KeepaliveInterval (typically 5s). If it does not:

  • Check PostgreSQL connectivity from the leader:

    psql "$POSTGRES_URL" -c "SELECT 1;"
    
  • Check leader logs for keepalive failure events:

    cp.leader.keepalive_failures_total  — should be 0 or very low in steady state
    

3. Monitor quorum restoration

The QuorumMonitor emits state-change events every 30s (configurable). Monitor the audit log for ha.quorum.restored:

autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --category ha \
  --limit 20

Or poll quorum status directly:

watch -n 10 'autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" 2>&1'

4. Emergency: operating with degraded quorum

If quorum loss is prolonged and the release deployment must continue:

  1. Assess risk: Write-blocked means no stage promotions, no new rollout plans can be published, and no HA leader changes can be written.

  2. Do not lower MinSyncReplicas to 0 in production without explicit approval — this removes the synchronous replication guarantee and risks data loss on primary failure.

  3. Consider a read-only degraded mode: The control-plane read endpoints (GET /v1/rollouts, GET /v1/health/*) continue to work without write authority. Operators can inspect state but cannot modify it.

  4. Escalate: If quorum cannot be restored within the SLO window, escalate to infrastructure for PostgreSQL primary failover.


Known gaps

  • No automated quorum-loss recovery: The QuorumMonitor detects state changes and emits audit events, but does not take automatic corrective action. All recovery is manual.

  • No alerting integration: The monitor emits slog events only. Wiring to a Prometheus alert (e.g. cp_quorum_write_ready == 0 for > 1m) is a follow-on item.

  • Single detection criterion for MinSyncReplicas: The system evaluates sync replica count but does not distinguish between a single offline replica (recoverable) and a permanent replica decommission. Both present as quorum loss.