Emergency Rollback Procedure¶
Audience: on-call operators responding to a production incident requiring immediate rollback of one or more AutonomyOps subsystems.
Quick-start: unified orchestrator¶
PR-29-followup-b introduces autonomy rollback as a single entry point for
rollout, HA, and relay rollback actions. Use it when you want a consistent
command surface and automatic audit record emission.
# 1. Preview safety class and limitations before acting
autonomy rollback preview --target rollout_plan
autonomy rollback preview --target ha_leader_resign
# 2. Execute (emits rollback.executed audit record)
export AUTONOMY_OPERATOR="oncall@example.com"
autonomy rollback execute \
--target rollout_plan \
--resource "$PLAN_ID" \
--strategy rollback \
--reason "INCIDENT-$(date +%Y%m%d): elevated error rate" \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# 3. Verify result in audit trail
autonomy audit query \
--audit-dir "$AUTONOMY_AUDIT_DIR" \
--event-type rollback.executed \
--output json
Orchestrated targets: rollout_plan, rollout_stage, ha_leader_resign.
Manual targets (use edgectl): relay_deadletter — see Phase 3 below.
Overview¶
AutonomyOps rollback operations span three subsystems, each with different safety classifications:
Subsystem |
Operation |
Safety Class |
Reversible? |
|---|---|---|---|
Fleet Rollout |
Cancel/rollback plan |
Terminal |
No — |
Fleet Rollout |
Halt a stage |
Terminal (stage only) |
No — stage is permanently halted |
HA Control-Plane |
Resign leader |
Reversible |
Yes — a successor can campaign |
Edge Relay |
Force deadletter |
Idempotent |
Safe to repeat; retry required to resume |
Key guarantee: Safety class terminal means the system will never return to the
pre-rollback state through the normal operation path. Terminal actions require explicit
operator confirmation before proceeding.
Phase 0: Triage and scoping (< 2 minutes)¶
Before executing any rollback:
Identify the affected subsystem(s):
Fleet rollout stuck or deploying bad artifact?
Control-plane write authority lost?
Edge relay delivery completely blocked?
Collect current state (generate a support bundle):
autonomy support-bundle generate
–output /tmp/incident-$(date +%Y%m%d-%H%M%S).tar.gz
–orchestrator-url “$AUTONOMY_ORCHESTRATOR_URL”
–config-file /etc/autonomy/config.yaml
–audit-dir “$AUTONOMY_AUDIT_DIR”
3. **Confirm who is on-call and has authorized fleet visibility** (the current
HTTP/CLI RBAC model gates these recovery surfaces on a known operator identity
with `fleet:read`, or on an explicitly audited break-glass session).
---
## Phase 1: Fleet rollout emergency stop
### 1a. Stop a harmful deployment (rollback — terminal)
Use when: canary metrics indicate the deployed artifact is causing harm and you need an
immediate record of the rollback decision.
```bash
# Identify the active plan
autonomy rollout plan list \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--output json | jq '.[] | select(.phase == "active") | .plan_id'
# Roll back the active plan (terminal — cannot be restarted)
PLAN_ID="<plan-id-from-above>"
autonomy rollout recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id "$PLAN_ID" \
--strategy rollback \
--reason "INCIDENT-$(date +%Y%m%d): elevated error rate on canary nodes"
Expected: "new_phase": "rolled_back".
Limitation: Nodes that already activated the artifact continue running it. The control-plane does not push revert commands. Manual node remediation may be required.
1b. Cancel an unpublished or non-activated plan¶
If the plan has not yet activated on any node:
autonomy rollout plan cancel \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id "$PLAN_ID"
1c. Halt the current stage only¶
If the issue is contained to the current deployment stage and you want to stop expansion without rolling back the entire plan:
curl -sf -X POST \
"${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/${PLAN_ID}/halt" \
-H "Content-Type: application/json" \
-d '{"reason": "INCIDENT: halting stage expansion pending investigation"}' | jq .
Phase 2: Control-plane emergency actions¶
2a. Emergency leader resignation (HA failover)¶
Use when: the current leader node is unhealthy and you need write authority to move to a healthy replica immediately.
autonomy ha failover trigger \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--operator "$AUTONOMY_OPERATOR" \
--reason "INCIDENT: emergency failover — leader node memory pressure"
Safety: This is reversible. The old leader can re-campaign once it is healthy. The write gap during transition is typically < 1 second.
2b. Split-brain resolution (if detected during incident)¶
# Step 1: detect
autonomy ha split-brain detect \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# Step 2: recover with the appropriate strategy
autonomy ha split-brain recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--operator "$AUTONOMY_OPERATOR" \
--strategy promote-leader \
--reason "INCIDENT: split-brain during leader host failure" \
--force
See Split-Brain Recovery for full decision tree.
Phase 3: Edge relay emergency actions¶
3a. Suspend relay delivery (bandwidth throttle to zero)¶
If relay delivery is causing harm (e.g. flooding a constrained link):
edgectl relay config set-bandwidth \
--bytes-per-second 1 \
--socket /run/edged/rpc.sock
Setting rate to 1 byte/second effectively suspends delivery without terminating the relay subsystem. Restore with:
edgectl relay config set-bandwidth --bytes-per-second 0 --socket /run/edged/rpc.sock
3b. Force-deadletter a segment that is causing repeated failures¶
If a specific segment is blocking the delivery queue:
# Inspect the entry first
edgectl relay deadletter inspect SEG-ID PEER-ID --socket /run/edged/rpc.sock
# Force into deadletter (requires the entry to already be in a non-terminal state)
# This happens automatically via ForceDeadletter for SEGMENT_MISSING paths,
# but can also be triggered by stopping delivery and letting TransitionFailed exhaust retries.
For immediate purge of a harmful deadletter entry:
edgectl relay deadletter purge \
--segment-id SEG-ID \
--peer-id PEER-ID \
--force \
--socket /run/edged/rpc.sock
Safety: Purge is permanent for the delivery record. The segment data in local storage is not removed — only the outbound tracking record.
Phase 4: Post-incident verification¶
After executing rollback actions, verify the system is in a known stable state:
# 1. Confirm rollout plan is in expected terminal phase
autonomy rollout plan describe \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id "$PLAN_ID" \
--output json | jq .phase
# 2. Confirm HA cluster has a single write-ready leader
autonomy ha status \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# 3. Confirm quorum is restored
autonomy ha quorum status \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# 4. Confirm relay delivery queue is draining
edgectl relay status --socket /run/edged/rpc.sock
# 5. Query audit trail for rollback events
autonomy audit query \
--audit-dir "$AUTONOMY_AUDIT_DIR" \
--category rollout \
--limit 10
autonomy audit query \
--audit-dir "$AUTONOMY_AUDIT_DIR" \
--category ha \
--limit 10
Phase 5: Incident documentation¶
Generate a final support bundle after stabilization:
autonomy support-bundle generate \
--output /tmp/incident-$(date +%Y%m%d-%H%M%S)-post-recovery.tar.gz \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--config-file /etc/autonomy/config.yaml \
--audit-dir "$AUTONOMY_AUDIT_DIR"
Export the audit log covering the incident window:
autonomy audit export \
--audit-dir "$AUTONOMY_AUDIT_DIR" \
--format json \
> /tmp/incident-audit-$(date +%Y%m%d).json
Rollback safety reference¶
Operation |
Kind |
Safety Class |
Reversible |
Repeatable |
|---|---|---|---|---|
|
KindRolloutPlan |
Terminal |
No |
No (already terminal) |
|
KindRolloutPlan |
Terminal |
No |
No (already terminal) |
|
KindRolloutStage |
Terminal (stage) |
No |
Idempotent if already halted |
|
KindHALeaderResign |
Reversible |
Yes |
Yes (new Campaign) |
|
KindRelayDeadletter |
Idempotent |
— |
Yes (safe to repeat) |
|
— |
Permanent |
No |
— |
Known gaps¶
No automated incident detection: All rollback triggers are operator-initiated. The system does not auto-rollback on health gate failure or error rate threshold breach. Automated rollback is a follow-on item.
relay_deadletter not orchestrated:
autonomy rollback executedoes not dispatch relay deadletter actions. Useedgectl relay deadletter retry|purgedirectly.autonomy rollback preview --target relay_deadlettershows safety information.Artifact revert not implemented: Rolled-back rollout plans stop further deployment but do not push revert artifacts to nodes that already activated. Node remediation must be performed out of band.
Relay suspension is config-only: Setting
bytes-per-second 1throttles delivery but does not cleanly drain inflight workers. For a hard stop,edgedmust be restarted. Inflight records becomeABANDONEDon restart and re-enterScheduled.