Fleet Rollout Recovery

Audience: operators managing fleet software deployments via the AutonomyOps control-plane.

When to use this runbook

  • A rollout plan is stuck and has not advanced for longer than the expected stage window.

  • A health gate has permanently blocked a stage and no auto-resolution is expected.

  • Canary metrics indicate the release is harmful and the plan must be halted.

  • An operator needs to force-skip a non-critical stage to unblock downstream stages.

Prerequisites

  • AUTONOMY_ORCHESTRATOR_URL set or --orchestrator-url flag available.

  • AUTONOMY_OPERATOR set to your operator identity (required for audit records).

  • curl and jq if you need to inspect raw API responses.


1. Identify stuck plans

Using the stuck-detection endpoint

curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/stuck" | jq .

Example response:

{
  "stuck_plans": [
    {
      "plan_id": "plan-abc123",
      "phase": "active",
      "current_stage_id": "stage-03",
      "stuck_since": "2026-03-18T08:00:00Z",
      "stuck_reason": "health gate has not passed for 45 minutes"
    }
  ]
}

Using the CLI to list all plans

autonomy rollout plan list --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL"
# Inspect a specific plan's current status
autonomy rollout plan describe \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id plan-abc123

2. Decision tree

Is the current release harmful (crash rate elevated, canary failing)?
│
├─ YES → Strategy: rollback (terminal — plan transitions to rolled_back)
│        See §3.1. No new nodes will activate. Already-activated nodes keep running.
│
└─ NO → Is the blocking gate expected to clear on its own?
         │
         ├─ YES (transient telemetry gap, evaluator hiccup)
         │   → Strategy: retry (reversible — plan resets to active, re-evaluates)
         │     See §3.2.
         │
         └─ NO (non-critical gate, operator has manually validated the release)
             → Strategy: skip_failed (terminal for the current stage — force-promotes it)
               See §3.3.

3. Recovery strategies

All three strategies are available through autonomy rollout recover and via POST /v1/rollouts/{plan_id}/recover.

3.1 Rollback (terminal)

Safety class: Terminal. The plan transitions to rolled_back and cannot be restarted. Nodes that already activated the artifact continue running it — the control-plane does not push a revert command.

autonomy rollout recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id plan-abc123 \
  --strategy rollback \
  --reason "canary error rate exceeded 5% threshold for 10 minutes"

Expected response:

{
  "plan_id": "plan-abc123",
  "strategy": "rollback",
  "previous_phase": "active",
  "new_phase": "rolled_back",
  "reason": "canary error rate exceeded 5% threshold for 10 minutes",
  "recovered_at": "2026-03-18T09:15:00Z"
}

Verify:

autonomy rollout plan describe \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id plan-abc123
# Phase should now be: rolled_back

3.2 Retry (reversible)

Safety class: Reversible. The plan phase resets to active and the BatchPromoter re-evaluates the current stage on its next tick. Previously promoted stages remain promoted. Use when the blocking condition is transient.

autonomy rollout recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id plan-abc123 \
  --strategy retry \
  --reason "telemetry gap resolved; re-evaluating stage-03"

Expected response:

{
  "plan_id": "plan-abc123",
  "strategy": "retry",
  "previous_phase": "active",
  "new_phase": "active",
  "reason": "telemetry gap resolved; re-evaluating stage-03",
  "recovered_at": "2026-03-18T09:20:00Z"
}

Monitor: Watch the plan phase change as the BatchPromoter re-evaluates:

watch -n 10 'curl -sf "${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/plan-abc123" | jq .phase'

3.3 Skip failed stage (terminal for stage)

Safety class: Terminal for the current stage. The stuck stage is force-promoted and the next stage (or plan completion) is triggered. Use only when the operator has independently validated the release.

autonomy rollout recover \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id plan-abc123 \
  --strategy skip_failed \
  --reason "stage-03 gate is a non-critical SLO check; release validated via external dashboard"

Expected response:

{
  "plan_id": "plan-abc123",
  "strategy": "skip_failed",
  "previous_phase": "active",
  "new_phase": "active",
  "skipped_stage_id": "stage-03",
  "next_stage_id": "stage-04",
  "reason": "...",
  "recovered_at": "2026-03-18T09:25:00Z"
}

If the skipped stage was the last stage, new_phase will be completed.


4. Halting a stage manually

To halt the current stage without rolling back the entire plan (e.g. to pause activation):

curl -sf -X POST \
  "${AUTONOMY_ORCHESTRATOR_URL}/v1/rollouts/plan-abc123/halt" \
  -H "Content-Type: application/json" \
  -d '{"reason": "pausing activation while investigating telemetry anomaly"}' | jq .

Note: halted is a terminal stage phase. The stage cannot be re-opened. To resume the plan, use the retry recovery strategy (which resets the plan phase but does not revert already-halted stages).


5. Cancelling a plan (full terminal stop)

autonomy rollout plan cancel issues DELETE /v1/rollouts/{plan_id} which transitions the plan to cancelled (terminal).

autonomy rollout plan cancel \
  --orchestrator-url "$AUTONOMY_ORCHESTRATOR_URL" \
  --plan-id plan-abc123

Use this instead of rollback when the release was never activated on any node and you simply want to discard the plan without the rollback audit trail.


6. Precondition failures

Error

Cause

Resolution

precondition not met: plan is in terminal phase

Plan is already rolled_back, cancelled, or completed

No action needed; plan is already stopped

precondition not met: plan is paused

For retry: plan must be resumed first

POST /v1/rollouts/{id}/resume

precondition not met: no active stage to skip

For skip_failed: plan has no current open stage

Check plan status; may already be at completion


Known gaps

  • No automatic rollback: The system does not automatically trigger rollback when health gates fail beyond a threshold. All recovery is operator-initiated.

  • Artifact revert not implemented: Rolling back a plan does not push revert commands to nodes that already activated the artifact. Those nodes continue running the deployed version.