Emergency Rollback Procedure

Audience: on-call operators responding to a production incident requiring immediate rollback of one or more AutonomyOps subsystems.

Quick-start: unified orchestrator

PR-29-followup-b introduces autonomy rollback as a single entry point for rollout, HA, and relay rollback actions. Use it when you want a consistent command surface and automatic audit record emission.

# 1. Preview safety class and limitations before acting
autonomy rollback preview --target rollout_plan
autonomy rollback preview --target ha_leader_resign

# 2. Execute (emits rollback.executed audit record)
export AUTONOMY_OPERATOR="oncall@example.com"

autonomy rollback execute \
  --target rollout_plan \
  --resource "$PLAN_ID" \
  --strategy rollback \
  --reason "INCIDENT-$(date +%Y%m%d): elevated error rate" \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# 3. Verify result in audit trail
autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --event-type rollback.executed \
  --output json

Orchestrated targets: rollout_plan, rollout_stage, ha_leader_resign. Manual targets (use edgectl): relay_deadletter — see Phase 3 below.


Overview

AutonomyOps rollback operations span three subsystems, each with different safety classifications:

Subsystem

Operation

Safety Class

Reversible?

Fleet Rollout

Cancel/rollback plan

Terminal

No — rolled_back is permanent

Fleet Rollout

Halt a stage

Terminal (stage only)

No — stage is permanently halted

HA Control-Plane

Resign leader

Reversible

Yes — a successor can campaign

Edge Relay

Force deadletter

Idempotent

Safe to repeat; retry required to resume

Key guarantee: Safety class terminal means the system will never return to the pre-rollback state through the normal operation path. Terminal actions require explicit operator confirmation before proceeding.


Phase 0: Triage and scoping (< 2 minutes)

Before executing any rollback:

  1. Identify the affected subsystem(s):

    • Fleet rollout stuck or deploying bad artifact?

    • Control-plane write authority lost?

    • Edge relay delivery completely blocked?

  2. Collect current state (generate a support bundle):

    
    

autonomy support-bundle generate
–output /tmp/incident-$(date +%Y%m%d-%H%M%S).tar.gz
–orchestrator-url “$AUTONOMY_ORCHESTRATOR_URL”
–config-file /etc/autonomy/config.yaml
–audit-dir “$AUTONOMY_AUDIT_DIR”


3. **Confirm who is on-call and has authorized fleet visibility** (the current
HTTP/CLI RBAC model gates these recovery surfaces on a known operator identity
with `fleet:read`, or on an explicitly audited break-glass session).

---

## Phase 1: Fleet rollout emergency stop

### 1a. Stop a harmful deployment (rollback — terminal)

Use when: canary metrics indicate the deployed artifact is causing harm and you need an
immediate record of the rollback decision.

```bash
# Identify the active plan
autonomy rollout plan list \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--output json | jq '.[] | select(.phase == "active") | .plan_id'
# Roll back the active plan (terminal — cannot be restarted)
PLAN_ID="<plan-id-from-above>"
autonomy rollout recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id "$PLAN_ID" \
  --strategy rollback \
  --reason "INCIDENT-$(date +%Y%m%d): elevated error rate on canary nodes"

Expected: "new_phase": "rolled_back".

Limitation: Nodes that already activated the artifact continue running it. The control-plane does not push revert commands. Manual node remediation may be required.

1b. Cancel an unpublished or non-activated plan

If the plan has not yet activated on any node:

autonomy rollout plan cancel \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id "$PLAN_ID"

1c. Halt the current stage only

If the issue is contained to the current deployment stage and you want to stop expansion without rolling back the entire plan:

curl -sf -X POST \
  "${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/${PLAN_ID}/halt" \
  -H "Content-Type: application/json" \
  -d '{"reason": "INCIDENT: halting stage expansion pending investigation"}' | jq .

Phase 2: Control-plane emergency actions

2a. Emergency leader resignation (HA failover)

Use when: the current leader node is unhealthy and you need write authority to move to a healthy replica immediately.

autonomy ha failover trigger \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --reason "INCIDENT: emergency failover — leader node memory pressure"

Safety: This is reversible. The old leader can re-campaign once it is healthy. The write gap during transition is typically < 1 second.

2b. Split-brain resolution (if detected during incident)

# Step 1: detect
autonomy ha split-brain detect \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# Step 2: recover with the appropriate strategy
autonomy ha split-brain recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --strategy promote-leader \
  --reason "INCIDENT: split-brain during leader host failure" \
  --force

See Split-Brain Recovery for full decision tree.


Phase 3: Edge relay emergency actions

3a. Suspend relay delivery (bandwidth throttle to zero)

If relay delivery is causing harm (e.g. flooding a constrained link):

edgectl relay config set-bandwidth \
  --bytes-per-second 1 \
  --socket /run/edged/rpc.sock

Setting rate to 1 byte/second effectively suspends delivery without terminating the relay subsystem. Restore with:

edgectl relay config set-bandwidth --bytes-per-second 0 --socket /run/edged/rpc.sock

3b. Force-deadletter a segment that is causing repeated failures

If a specific segment is blocking the delivery queue:

# Inspect the entry first
edgectl relay deadletter inspect SEG-ID PEER-ID --socket /run/edged/rpc.sock

# Force into deadletter (requires the entry to already be in a non-terminal state)
# This happens automatically via ForceDeadletter for SEGMENT_MISSING paths,
# but can also be triggered by stopping delivery and letting TransitionFailed exhaust retries.

For immediate purge of a harmful deadletter entry:

edgectl relay deadletter purge \
  --segment-id SEG-ID \
  --peer-id PEER-ID \
  --force \
  --socket /run/edged/rpc.sock

Safety: Purge is permanent for the delivery record. The segment data in local storage is not removed — only the outbound tracking record.


Phase 4: Post-incident verification

After executing rollback actions, verify the system is in a known stable state:

# 1. Confirm rollout plan is in expected terminal phase
autonomy rollout plan describe \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id "$PLAN_ID" \
  --output json | jq .phase

# 2. Confirm HA cluster has a single write-ready leader
autonomy ha status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# 3. Confirm quorum is restored
autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# 4. Confirm relay delivery queue is draining
edgectl relay status --socket /run/edged/rpc.sock

# 5. Query audit trail for rollback events
autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --category rollout \
  --limit 10

autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --category ha \
  --limit 10

Phase 5: Incident documentation

Generate a final support bundle after stabilization:

  autonomy support-bundle generate \
  --output /tmp/incident-$(date +%Y%m%d-%H%M%S)-post-recovery.tar.gz \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --config-file /etc/autonomy/config.yaml \
  --audit-dir "$AUTONOMY_AUDIT_DIR"

Export the audit log covering the incident window:

autonomy audit export \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --format json \
  > /tmp/incident-audit-$(date +%Y%m%d).json

Rollback safety reference

Operation

Kind

Safety Class

Reversible

Repeatable

rollout plan cancel

KindRolloutPlan

Terminal

No

No (already terminal)

rollout recover --strategy rollback

KindRolloutPlan

Terminal

No

No (already terminal)

rollout halt

KindRolloutStage

Terminal (stage)

No

Idempotent if already halted

ha failover trigger

KindHALeaderResign

Reversible

Yes

Yes (new Campaign)

relay deadletter retry

KindRelayDeadletter

Idempotent

Yes (safe to repeat)

relay deadletter purge

Permanent

No


Known gaps

  • No automated incident detection: All rollback triggers are operator-initiated. The system does not auto-rollback on health gate failure or error rate threshold breach. Automated rollback is a follow-on item.

  • relay_deadletter not orchestrated: autonomy rollback execute does not dispatch relay deadletter actions. Use edgectl relay deadletter retry|purge directly. autonomy rollback preview --target relay_deadletter shows safety information.

  • Artifact revert not implemented: Rolled-back rollout plans stop further deployment but do not push revert artifacts to nodes that already activated. Node remediation must be performed out of band.

  • Relay suspension is config-only: Setting bytes-per-second 1 throttles delivery but does not cleanly drain inflight workers. For a hard stop, edged must be restarted. Inflight records become ABANDONED on restart and re-enter Scheduled.