Failure Modes and Recovery

This page documents the failure modes that operators are most likely to encounter, their exact symptoms, and the correct recovery steps. Each scenario follows the same structure:

  • Symptom — what the operator sees at the terminal

  • Cause — what is actually broken

  • Recovery — concrete steps, in order

All recovery paths are fail-closed: partial or ambiguous state is never silently accepted.


1. OCI Registry / ORAS Unavailable

Symptom

bundle pull: bundle: pull ghcr.io/org/robot-behavior:v1.2.0: ...
  dial tcp: lookup ghcr.io: no such host
  — or —
  connection refused
  — or —
  TLS handshake timeout

bundle verify produces a similar error if the registry is unreachable during digest resolution (Step 2 of the 4-step pipeline).

Cause

The registry host is not reachable from the current network. Common causes:

Situation

Likely Error Substring

Air-gapped host, no route to registry

no such host / i/o timeout

Registry container stopped

connection refused

mTLS misconfiguration

TLS handshake / certificate

Wrong port or path

404 Not Found

Recovery

Option A — use a pre-pulled tarball (recommended for air-gapped / offline)

If the bundle tarball was pulled on a connected machine and transferred to the target host, stage it directly into the local bundle store without a registry round-trip:

autonomy bundle stage /path/to/robot-behavior-v1.2.0.tar

This copies the tarball into $AUTONOMY_BUNDLE_STORE (default ~/.local/share/autonomy/bundles) under the version declared in the bundle’s manifest.json. No OCI registry contact is made.

Option B — debug registry connectivity

# Verify DNS resolution
nslookup ghcr.io

# Verify TCP connectivity
curl -v https://ghcr.io/v2/

# For a local insecure registry (e.g. during development)
autonomy bundle pull localhost:5000/demo-bundle:v0.1 --allow-insecure-registry

Option C — air-gap build workflow

Pull and package on a connected machine, then transfer:

# On connected machine:
autonomy bundle pull ghcr.io/org/robot-behavior:v1.2.0 --out robot-v1.2.0.tar

# Transfer to air-gapped host (scp, USB, etc.), then:
autonomy bundle stage robot-v1.2.0.tar

Supply-chain integrity after staging

bundle stage does not run cosign verification. On the connected machine, run bundle verify before transferring:

autonomy bundle verify ghcr.io/org/robot-behavior:v1.2.0 \
    --pub-key ./keys/cosign.pub --require-lock --require-policy

Only transfer bundles that pass bundle verify.


2. Docker Daemon Unavailable

Symptom

demo gazebo: Docker daemon not reachable — start Docker or use --local for in-process simulation

or, for ROS2 governed execution:

run ros2.launch: neither Docker nor the ros2 binary is available.
  Install Docker for full governance, or install ROS2 for native mode.

or:

run ros2.launch: container path selected but --image not set.
  Provide a versioned adk-ros2-runtime image, e.g.:
    --image ghcr.io/autonomyops/adk-ros2-runtime:v1.0.0

Cause

The Docker daemon is not running, not in PATH, or the socket is not accessible. The ADK uses dualpath.Resolve() to detect Docker at runtime — it probes the Docker socket before every container-mode operation.

Recovery

Option A — start Docker

# Linux (systemd)
sudo systemctl start docker

# macOS / Docker Desktop
open -a Docker

Option B — use in-process / local mode (no Docker required)

All demo commands support --local to run the full scenario in-process:

autonomy demo policy         # always in-process; no Docker needed
autonomy demo gazebo --local # in-process Gazebo simulation, no containers

For autonomy run ros2.launch, the native fallback runs if ros2 is in PATH:

# Install ROS 2 Humble (Ubuntu 22.04)
sudo apt-get install ros-humble-ros-base

# Re-run — dualpath falls back to native ros2 automatically
autonomy run ros2.launch my_package my_launch.py

Option C — verify Docker socket access

autonomy status reports orchestrator health only, not the execution mode. To check whether Docker is reachable directly:

docker info          # prints daemon info when Docker is up
docker ps            # should succeed without sudo

If docker ps requires sudo, add the current user to the docker group:

sudo usermod -aG docker $USER
newgrp docker        # apply in the current shell without logout

Degraded-mode warning

When ros2 run falls back to the native path, the runtime emits:

[WARN] ros2: REDUCED-GOVERNANCE native mode — container isolation and resource limits not enforced

This is expected in development environments. Production deployments must use container mode (--image set and Docker running).


3. WAL Corruption or Partial WAL Read

The WAL (Write-Ahead Log) stores telemetry events durably. It is append-only and fsynced on every write. The recovery path is fail-hard by design: the runtime refuses to start when WAL invariants are violated rather than silently accepting corrupt data.

Symptom

Invariant violation (fail-hard)

telemetry/wal: SEQ_GAP — frame Seq=3 follows Seq=1 (expected 2)
  — or —
telemetry/wal: SAFESEQ_GT_MAXSEQ — safe_seq=7 but maxSeq=5
  — or —
telemetry/wal: WAL_CORRUPT_INVALID_JSON — frame at Seq=4 contains invalid JSON

The structured log also emits:

{"level":"ERROR","msg":"WAL_RECOVERY_FAIL_HARD","cause":"SEQ_GAP","dir":"/var/lib/autonomy/telemetry","error":"..."}

Missing safe_seq in steady-state

telemetry/wal: safe_seq missing in steady-state at "/var/lib/autonomy/telemetry/telemetry.safe_seq";
  to upgrade from a pre-Phase-1 WAL set AUTONOMYOPS_WAL_LEGACY_UPGRADE=1;
  to accept telemetry loss after deleting telemetry.wal set AUTONOMYOPS_WAL_OPERATOR_RESET=1

Partial tail (non-fatal)

A truncated frame at EOF (crash mid-write) is silently repaired by truncating to the last complete frame. This is not a fail-hard and requires no operator action.

Cause Codes

Code

Invariant

Meaning

SEQ_GAP

R2

Frame sequence numbers are not contiguous in the committed range

FIRST_SEQ_NOT_ONE

R1

First WAL frame Seq ≠ 1

SAFESEQ_NOT_FOUND

R4

Frame at safe_seq not found in the WAL

SAFESEQ_GT_MAXSEQ

R3

safe_seq > maxSeq (reorder bug)

WAL_CORRUPT_INVALID_JSON

Structurally valid frame but unparseable JSON payload

Recovery

Path 1 — Legacy upgrade (WAL exists, telemetry.safe_seq missing)

This applies when upgrading from a pre-Phase-1 build that did not write telemetry.safe_seq. All existing WAL frames are treated as committed:

AUTONOMYOPS_WAL_LEGACY_UPGRADE=1 autonomy-orchestrator serve

Unset the env var after the first successful start. The runtime will write telemetry.safe_seq and subsequent starts will not require the override.

Path 2 — Operator reset (accept telemetry data loss)

Use this path when telemetry.wal is corrupt and the invariant violation cannot be repaired. This discards all telemetry history.

# 1. Stop the runtime.
sudo systemctl stop autonomy-orchestrator

# 2. Back up the corrupt WAL for post-mortem analysis.
cp -r /var/lib/autonomy/telemetry /var/lib/autonomy/telemetry.corrupt-$(date +%Y%m%d)

# 3. Delete the WAL and safe_seq files.
rm /var/lib/autonomy/telemetry/telemetry.wal
rm /var/lib/autonomy/telemetry/telemetry.safe_seq

# 4. Restart with the reset override.
AUTONOMYOPS_WAL_OPERATOR_RESET=1 autonomy-orchestrator serve

Unset the env var after the first successful start. The runtime emits a WAL_RESET_BY_OPERATOR structured log entry at WARN level as a mandatory audit marker.

Important: The operator reset affects telemetry only. All activation invariants (lock fingerprint, policy evaluation, RBAC enforcement) remain in effect. No policy or rollout state is discarded.

Path 3 — WAL invariant violation with data intact

If you suspect filesystem corruption (not a software bug), check the underlying storage first before resetting:

# Check filesystem integrity
sudo fsck -n /dev/sdXY

# Check for storage hardware errors
sudo dmesg | grep -E "I/O error|ata|scsi|blk_update_request"

After confirming storage health, use Path 2 to reset.

Diagnosis command

autonomy wal status
autonomy wal inspect --since 1h --kind error

4. Orchestrator Unreachable

Symptom

During demos (autonomy demo gazebo container mode):

demo gazebo: docker compose up: exit status 1
hint: if Docker Compose fails to start the orchestrator service, re-run with --local for an in-process demo

During fleet/rollout commands:

Get "http://localhost:8080/v1/fleet/summary": dial tcp 127.0.0.1:8080: connect: connection refused
  — or —
Get "http://localhost:8080/v1/ha/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

During log streaming (autonomy logs --follow):

logs: stream interrupted (connection reset by peer)
hint: re-run 'autonomy logs --follow' to reconnect

Cause

The autonomy-orchestrator process is not running, or is listening on a different address than the CLI is configured to use.

Recovery

Check orchestrator health

# Default listen address is 0.0.0.0:8888
curl http://localhost:8888/v1/health

# Use the CLI health check (reads AUTONOMY_ORCHESTRATOR_URL or config)
autonomy status

# Custom address
AUTONOMY_ORCHESTRATOR_URL=http://myhost:8888 autonomy ha status

Start the orchestrator

The orchestrator uses SQLite by default (no --pg-url flag exists). Data is written to --data-dir (default: $XDG_CACHE_HOME/autonomyops/orchestrator).

# Standalone (development) — listens on 0.0.0.0:8888
autonomy-orchestrator serve --data-dir /var/lib/autonomy

# Custom listen address
autonomy-orchestrator serve --listen 127.0.0.1:8888 --data-dir /var/lib/autonomy

# Systemd-managed
sudo systemctl start autonomy-orchestrator
sudo systemctl status autonomy-orchestrator

Demo-specific: use in-process mode

For demos, the orchestrator is always optional — run in-process with --local:

autonomy demo gazebo --local   # full Gazebo policy scenario, no containers, no orchestrator
autonomy demo policy           # always in-process; needs no external services

Check AUTONOMY_ORCHESTRATOR_URL

Fleet, rollout, logs, HA, and audit commands use AUTONOMY_ORCHESTRATOR_URL to locate the orchestrator (default: none — must be configured explicitly):

# Check what the CLI resolves to
autonomy status

# Set it explicitly
export AUTONOMY_ORCHESTRATOR_URL=http://localhost:8888

# Or persist it
autonomy config set orchestrator.url http://localhost:8888

AUTONOMY_RUNTIME_URL is a different env var

AUTONOMY_RUNTIME_URL is injected into subprocess environments by autonomy run and autonomy ros2 run as the governance callback address for the policy runtime service (default listen: 127.0.0.1:7777). It has no relation to the orchestrator. Check it only when debugging autonomy run or autonomy ros2 run subprocess governance callbacks — not for orchestrator connectivity issues.

HA cluster: verify leader is healthy

In a multi-node HA cluster, the active leader handles all write requests. A connection refused to a follower node is expected when the leader has moved. Check which node holds the leader token:

autonomy ha status

Then direct CLI calls to the current leader address.


See Also