Emergency Rollback Procedure¶

Audience: on-call operators responding to a production incident requiring immediate rollback of one or more AutonomyOps subsystems.

Quick-start: unified orchestrator¶

PR-29-followup-b introduces autonomy rollback as a single entry point for rollout, HA, and relay rollback actions. Use it when you want a consistent command surface and automatic audit record emission.

# 1. Preview safety class and limitations before acting
autonomy rollback preview --target rollout_plan
autonomy rollback preview --target ha_leader_resign

# 2. Execute (emits rollback.executed audit record)
export AUTONOMY_OPERATOR="oncall@example.com"

autonomy rollback execute \
  --target rollout_plan \
  --resource "$PLAN_ID" \
  --strategy rollback \
  --reason "INCIDENT-$(date +%Y%m%d): elevated error rate" \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# 3. Verify result in audit trail
autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --event-type rollback.executed \
  --output json

Orchestrated targets: rollout_plan, rollout_stage, ha_leader_resign. Manual targets (use edgectl): relay_deadletter — see Phase 3 below.

Overview¶

AutonomyOps rollback operations span three subsystems, each with different safety classifications:

Subsystem	Operation	Safety Class	Reversible?
Fleet Rollout	Cancel/rollback plan	Terminal	No — `rolled_back` is permanent
Fleet Rollout	Halt a stage	Terminal (stage only)	No — stage is permanently halted
HA Control-Plane	Resign leader	Reversible	Yes — a successor can campaign
Edge Relay	Force deadletter	Idempotent	Safe to repeat; retry required to resume

Key guarantee: Safety class terminal means the system will never return to the pre-rollback state through the normal operation path. Terminal actions require explicit operator confirmation before proceeding.

Phase 0: Triage and scoping (< 2 minutes)¶

Before executing any rollback:

Identify the affected subsystem(s):
- Fleet rollout stuck or deploying bad artifact?
- Control-plane write authority lost?
- Edge relay delivery completely blocked?
Collect current state (generate a support bundle):

autonomy support-bundle generate
–output /tmp/incident-$(date +%Y%m%d-%H%M%S).tar.gz
–orchestrator-url “$AUTONOMY_ORCHESTRATOR_URL”
–config-file /etc/autonomy/config.yaml
–audit-dir “$AUTONOMY_AUDIT_DIR”

3. **Confirm who is on-call and has authorized fleet visibility** (the current
HTTP/CLI RBAC model gates these recovery surfaces on a known operator identity
with `fleet:read`, or on an explicitly audited break-glass session).

---

## Phase 1: Fleet rollout emergency stop

### 1a. Stop a harmful deployment (rollback — terminal)

Use when: canary metrics indicate the deployed artifact is causing harm and you need an
immediate record of the rollback decision.

```bash
# Identify the active plan
autonomy rollout plan list \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--output json | jq '.[] | select(.phase == "active") | .plan_id'

# Roll back the active plan (terminal — cannot be restarted)
PLAN_ID="<plan-id-from-above>"
autonomy rollout recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id "$PLAN_ID" \
  --strategy rollback \
  --reason "INCIDENT-$(date +%Y%m%d): elevated error rate on canary nodes"

Expected: "new_phase": "rolled_back".

Limitation: Nodes that already activated the artifact continue running it. The control-plane does not push revert commands. Manual node remediation may be required.

1b. Cancel an unpublished or non-activated plan¶

If the plan has not yet activated on any node:

autonomy rollout plan cancel \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id "$PLAN_ID"

1c. Halt the current stage only¶

If the issue is contained to the current deployment stage and you want to stop expansion without rolling back the entire plan:

curl -sf -X POST \
  "${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/${PLAN_ID}/halt" \
  -H "Content-Type: application/json" \
  -d '{"reason": "INCIDENT: halting stage expansion pending investigation"}' | jq .

Phase 2: Control-plane emergency actions¶

2a. Emergency leader resignation (HA failover)¶

Use when: the current leader node is unhealthy and you need write authority to move to a healthy replica immediately.

autonomy ha failover trigger \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --reason "INCIDENT: emergency failover — leader node memory pressure"

Safety: This is reversible. The old leader can re-campaign once it is healthy. The write gap during transition is typically < 1 second.

2b. Split-brain resolution (if detected during incident)¶

# Step 1: detect
autonomy ha split-brain detect \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# Step 2: recover with the appropriate strategy
autonomy ha split-brain recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --operator "$AUTONOMY_OPERATOR" \
  --strategy promote-leader \
  --reason "INCIDENT: split-brain during leader host failure" \
  --force

See Split-Brain Recovery for full decision tree.

Phase 3: Edge relay emergency actions¶

3a. Suspend relay delivery (bandwidth throttle to zero)¶

If relay delivery is causing harm (e.g. flooding a constrained link):

edgectl relay config set-bandwidth \
  --bytes-per-second 1 \
  --socket /run/edged/rpc.sock

Setting rate to 1 byte/second effectively suspends delivery without terminating the relay subsystem. Restore with:

edgectl relay config set-bandwidth --bytes-per-second 0 --socket /run/edged/rpc.sock

3b. Force-deadletter a segment that is causing repeated failures¶

If a specific segment is blocking the delivery queue:

# Inspect the entry first
edgectl relay deadletter inspect SEG-ID PEER-ID --socket /run/edged/rpc.sock

# Force into deadletter (requires the entry to already be in a non-terminal state)
# This happens automatically via ForceDeadletter for SEGMENT_MISSING paths,
# but can also be triggered by stopping delivery and letting TransitionFailed exhaust retries.

For immediate purge of a harmful deadletter entry:

edgectl relay deadletter purge \
  --segment-id SEG-ID \
  --peer-id PEER-ID \
  --force \
  --socket /run/edged/rpc.sock

Safety: Purge is permanent for the delivery record. The segment data in local storage is not removed — only the outbound tracking record.

Phase 4: Post-incident verification¶

After executing rollback actions, verify the system is in a known stable state:

# 1. Confirm rollout plan is in expected terminal phase
autonomy rollout plan describe \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id "$PLAN_ID" \
  --output json | jq .phase

# 2. Confirm HA cluster has a single write-ready leader
autonomy ha status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# 3. Confirm quorum is restored
autonomy ha quorum status \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"

# 4. Confirm relay delivery queue is draining
edgectl relay status --socket /run/edged/rpc.sock

# 5. Query audit trail for rollback events
autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --category rollout \
  --limit 10

autonomy audit query \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --category ha \
  --limit 10

Phase 5: Incident documentation¶

Generate a final support bundle after stabilization:

  autonomy support-bundle generate \
  --output /tmp/incident-$(date +%Y%m%d-%H%M%S)-post-recovery.tar.gz \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --config-file /etc/autonomy/config.yaml \
  --audit-dir "$AUTONOMY_AUDIT_DIR"

Export the audit log covering the incident window:

autonomy audit export \
  --audit-dir "$AUTONOMY_AUDIT_DIR" \
  --format json \
  > /tmp/incident-audit-$(date +%Y%m%d).json

Rollback safety reference¶

Operation	Kind	Safety Class	Reversible	Repeatable
`rollout plan cancel`	KindRolloutPlan	Terminal	No	No (already terminal)
`rollout recover --strategy rollback`	KindRolloutPlan	Terminal	No	No (already terminal)
`rollout halt`	KindRolloutStage	Terminal (stage)	No	Idempotent if already halted
`ha failover trigger`	KindHALeaderResign	Reversible	Yes	Yes (new Campaign)
`relay deadletter retry`	KindRelayDeadletter	Idempotent	—	Yes (safe to repeat)
`relay deadletter purge`	—	Permanent	No	—

Known gaps¶

No automated incident detection: All rollback triggers are operator-initiated. The system does not auto-rollback on health gate failure or error rate threshold breach. Automated rollback is a follow-on item.
relay_deadletter not orchestrated: autonomy rollback execute does not dispatch relay deadletter actions. Use edgectl relay deadletter retry|purge directly. autonomy rollback preview --target relay_deadletter shows safety information.
Artifact revert not implemented: Rolled-back rollout plans stop further deployment but do not push revert artifacts to nodes that already activated. Node remediation must be performed out of band.
Relay suspension is config-only: Setting bytes-per-second 1 throttles delivery but does not cleanly drain inflight workers. For a hard stop, edged must be restarted. Inflight records become ABANDONED on restart and re-enter Scheduled.