AutonomyOps Operator Runbooks

These runbooks document the current implemented behavior of the AutonomyOps ADK operator surfaces. Every command, flag, and expected output here reflects code that is deployed and testable. Known gaps where manual intervention is still required are explicitly marked.

Index

#

Runbook

Surface

Scope

01

Fleet Rollout Recovery

autonomy rollout

Stuck plan recovery, stage halt, rollback strategy

02

Gate Approval Workflow

autonomy rollout gate + API

Manual gate approval for blocked stages

03

Manual Failover Procedure

autonomy ha

Graceful leader resignation

04

Split-Brain Detection and Recovery

autonomy ha split-brain

Detect divergence, apply recovery strategy

05

Quorum-Loss Recovery

autonomy ha quorum

Evaluate quorum status, restore sync replicas

06

Deadletter Inspection and Retry

edgectl relay deadletter

List, inspect, retry, purge deadletter entries

07

Bandwidth Troubleshooting

edgectl relay config

Diagnose throttle events, adjust limits

08

Certificate Rotation Procedure (self-hosted tier)

autonomy cert

Issue, rotate, and inspect edge certificates

09

RBAC Role Assignment

autonomy rbac

Create roles, assign operators, enforce permissions

10

Support Bundle Generation

autonomy support-bundle

Collect diagnostics for incident triage

11

Emergency Rollback Procedure

All surfaces

Cross-subsystem emergency stop and rollback

Prerequisites (all runbooks)

  • CLI binary: autonomy (control-plane operations) and edgectl (edge node operations)

  • Environment variables (or equivalent flags):

    • AUTONOMY_ORCHESTRATOR_URL — base URL of the control-plane HTTP API

    • AUTONOMY_OPERATOR — operator identity for audit records

    • AUTONOMY_RBAC_DIR — path to RBAC store when RBAC enforcement is enabled

  • jq for JSON output formatting (optional but recommended)

Known global limitations

  • RBAC enforcement is on by default. Set AUTONOMY_RBAC_ENFORCEMENT=0 only for a temporary migration or break-fix scenario, and restore enforcement immediately after.

  • Audit events are written to slog and, on control-plane paths that initialize the PostgreSQL audit emitter at startup, to the audit_events table (append-only, INV-AUDIT-01). Use autonomy audit query --pg-url / autonomy audit export --pg-url (or AUTONOMY_AUDIT_PG_URL) for database-backed operator queries, and autonomy audit prune --older-than Nd for operator-initiated retention enforcement. When no --pg-url is set the file-backed emitter (AUTONOMY_AUDIT_DIR) remains active as fallback.