Fleet Rollout Recovery¶
Audience: operators managing fleet software deployments via the AutonomyOps control-plane.
When to use this runbook¶
A rollout plan is stuck and has not advanced for longer than the expected stage window.
A health gate has permanently blocked a stage and no auto-resolution is expected.
Canary metrics indicate the release is harmful and the plan must be halted.
An operator needs to force-skip a non-critical stage to unblock downstream stages.
Prerequisites¶
AUTONOMY_ORCHESTRATOR_URLset or--orchestrator-urlflag available.AUTONOMY_OPERATORset to your operator identity (required for audit records).curlandjqif you need to inspect raw API responses.
1. Identify stuck plans¶
Using the stuck-detection endpoint¶
curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/stuck" | jq .
Example response:
{
"stuck_plans": [
{
"plan_id": "plan-abc123",
"phase": "active",
"current_stage_id": "stage-03",
"stuck_since": "2026-03-18T08:00:00Z",
"stuck_reason": "health gate has not passed for 45 minutes"
}
]
}
Using the CLI to list all plans¶
autonomy rollout plan list --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# Inspect a specific plan's current status
autonomy rollout plan describe \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id plan-abc123
2. Decision tree¶
Is the current release harmful (crash rate elevated, canary failing)?
│
├─ YES → Strategy: rollback (terminal — plan transitions to rolled_back)
│ See §3.1. No new nodes will activate. Already-activated nodes keep running.
│
└─ NO → Is the blocking gate expected to clear on its own?
│
├─ YES (transient telemetry gap, evaluator hiccup)
│ → Strategy: retry (reversible — plan resets to active, re-evaluates)
│ See §3.2.
│
└─ NO (non-critical gate, operator has manually validated the release)
→ Strategy: skip_failed (terminal for the current stage — force-promotes it)
See §3.3.
3. Recovery strategies¶
All three strategies are available through autonomy rollout recover
and via POST /v1/rollouts/{plan_id}/recover.
3.1 Rollback (terminal)¶
Safety class: Terminal. The plan transitions to rolled_back and cannot be restarted.
Nodes that already activated the artifact continue running it — the control-plane does
not push a revert command.
autonomy rollout recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id plan-abc123 \
--strategy rollback \
--reason "canary error rate exceeded 5% threshold for 10 minutes"
Expected response:
{
"plan_id": "plan-abc123",
"strategy": "rollback",
"previous_phase": "active",
"new_phase": "rolled_back",
"reason": "canary error rate exceeded 5% threshold for 10 minutes",
"recovered_at": "2026-03-18T09:15:00Z"
}
Verify:
autonomy rollout plan describe \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id plan-abc123
# Phase should now be: rolled_back
3.2 Retry (reversible)¶
Safety class: Reversible. The plan phase resets to active and the BatchPromoter
re-evaluates the current stage on its next tick. Previously promoted stages remain
promoted. Use when the blocking condition is transient.
autonomy rollout recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id plan-abc123 \
--strategy retry \
--reason "telemetry gap resolved; re-evaluating stage-03"
Expected response:
{
"plan_id": "plan-abc123",
"strategy": "retry",
"previous_phase": "active",
"new_phase": "active",
"reason": "telemetry gap resolved; re-evaluating stage-03",
"recovered_at": "2026-03-18T09:20:00Z"
}
Monitor: Watch the plan phase change as the BatchPromoter re-evaluates:
watch -n 10 'curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/plan-abc123" | jq .phase'
3.3 Skip failed stage (terminal for stage)¶
Safety class: Terminal for the current stage. The stuck stage is force-promoted and the next stage (or plan completion) is triggered. Use only when the operator has independently validated the release.
autonomy rollout recover \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id plan-abc123 \
--strategy skip_failed \
--reason "stage-03 gate is a non-critical SLO check; release validated via external dashboard"
Expected response:
{
"plan_id": "plan-abc123",
"strategy": "skip_failed",
"previous_phase": "active",
"new_phase": "active",
"skipped_stage_id": "stage-03",
"next_stage_id": "stage-04",
"reason": "...",
"recovered_at": "2026-03-18T09:25:00Z"
}
If the skipped stage was the last stage, new_phase will be completed.
4. Halting a stage manually¶
To halt the current stage without rolling back the entire plan (e.g. to pause activation):
curl -sf -X POST \
"${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/plan-abc123/halt" \
-H "Content-Type: application/json" \
-d '{"reason": "pausing activation while investigating telemetry anomaly"}' | jq .
Note: halted is a terminal stage phase. The stage cannot be re-opened. To resume
the plan, use the retry recovery strategy (which resets the plan phase but does not
revert already-halted stages).
5. Cancelling a plan (full terminal stop)¶
autonomy rollout plan cancel issues DELETE /v1/rollouts/{plan_id} which transitions
the plan to cancelled (terminal).
autonomy rollout plan cancel \
--orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
--plan-id plan-abc123
Use this instead of rollback when the release was never activated on any node and you
simply want to discard the plan without the rollback audit trail.
6. Precondition failures¶
Error |
Cause |
Resolution |
|---|---|---|
|
Plan is already |
No action needed; plan is already stopped |
|
For |
|
|
For |
Check plan status; may already be at completion |
Known gaps¶
No automatic rollback: The system does not automatically trigger rollback when health gates fail beyond a threshold. All recovery is operator-initiated.
Artifact revert not implemented: Rolling back a plan does not push revert commands to nodes that already activated the artifact. Those nodes continue running the deployed version.