Failure Modes and Recovery¶
This page documents the failure modes that operators are most likely to encounter, their exact symptoms, and the correct recovery steps. Each scenario follows the same structure:
Symptom — what the operator sees at the terminal
Cause — what is actually broken
Recovery — concrete steps, in order
All recovery paths are fail-closed: partial or ambiguous state is never silently accepted.
3. WAL Corruption or Partial WAL Read¶
The WAL (Write-Ahead Log) stores telemetry events durably. It is append-only and fsynced on every write. The recovery path is fail-hard by design: the runtime refuses to start when WAL invariants are violated rather than silently accepting corrupt data.
Symptom¶
Invariant violation (fail-hard)
telemetry/wal: SEQ_GAP — frame Seq=3 follows Seq=1 (expected 2)
— or —
telemetry/wal: SAFESEQ_GT_MAXSEQ — safe_seq=7 but maxSeq=5
— or —
telemetry/wal: WAL_CORRUPT_INVALID_JSON — frame at Seq=4 contains invalid JSON
The structured log also emits:
{"level":"ERROR","msg":"WAL_RECOVERY_FAIL_HARD","cause":"SEQ_GAP","dir":"/var/lib/autonomy/telemetry","error":"..."}
Missing safe_seq in steady-state
telemetry/wal: safe_seq missing in steady-state at "/var/lib/autonomy/telemetry/telemetry.safe_seq";
to upgrade from a pre-Phase-1 WAL set AUTONOMYOPS_WAL_LEGACY_UPGRADE=1;
to accept telemetry loss after deleting telemetry.wal set AUTONOMYOPS_WAL_OPERATOR_RESET=1
Partial tail (non-fatal)
A truncated frame at EOF (crash mid-write) is silently repaired by truncating to the last complete frame. This is not a fail-hard and requires no operator action.
Cause Codes¶
Code |
Invariant |
Meaning |
|---|---|---|
|
R2 |
Frame sequence numbers are not contiguous in the committed range |
|
R1 |
First WAL frame |
|
R4 |
Frame at |
|
R3 |
|
|
— |
Structurally valid frame but unparseable JSON payload |
Recovery¶
Path 1 — Legacy upgrade (WAL exists, telemetry.safe_seq missing)
This applies when upgrading from a pre-Phase-1 build that did not write telemetry.safe_seq.
All existing WAL frames are treated as committed:
AUTONOMYOPS_WAL_LEGACY_UPGRADE=1 autonomy-orchestrator serve
Unset the env var after the first successful start. The runtime will write
telemetry.safe_seq and subsequent starts will not require the override.
Path 2 — Operator reset (accept telemetry data loss)
Use this path when telemetry.wal is corrupt and the invariant violation cannot be
repaired. This discards all telemetry history.
# 1. Stop the runtime.
sudo systemctl stop autonomy-orchestrator
# 2. Back up the corrupt WAL for post-mortem analysis.
cp -r /var/lib/autonomy/telemetry /var/lib/autonomy/telemetry.corrupt-$(date +%Y%m%d)
# 3. Delete the WAL and safe_seq files.
rm /var/lib/autonomy/telemetry/telemetry.wal
rm /var/lib/autonomy/telemetry/telemetry.safe_seq
# 4. Restart with the reset override.
AUTONOMYOPS_WAL_OPERATOR_RESET=1 autonomy-orchestrator serve
Unset the env var after the first successful start. The runtime emits a
WAL_RESET_BY_OPERATOR structured log entry at WARN level as a mandatory
audit marker.
Important: The operator reset affects telemetry only. All activation invariants (lock fingerprint, policy evaluation, RBAC enforcement) remain in effect. No policy or rollout state is discarded.
Path 3 — WAL invariant violation with data intact
If you suspect filesystem corruption (not a software bug), check the underlying storage first before resetting:
# Check filesystem integrity
sudo fsck -n /dev/sdXY
# Check for storage hardware errors
sudo dmesg | grep -E "I/O error|ata|scsi|blk_update_request"
After confirming storage health, use Path 2 to reset.
Diagnosis command
autonomy wal status
autonomy wal inspect --since 1h --kind error
4. Orchestrator Unreachable¶
Symptom¶
During demos (autonomy demo gazebo container mode):
demo gazebo: docker compose up: exit status 1
hint: if Docker Compose fails to start the orchestrator service, re-run with --local for an in-process demo
During fleet/rollout commands:
Get "http://localhost:8080/v1/fleet/summary": dial tcp 127.0.0.1:8080: connect: connection refused
— or —
Get "http://localhost:8080/v1/ha/status": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
During log streaming (autonomy logs --follow):
logs: stream interrupted (connection reset by peer)
hint: re-run 'autonomy logs --follow' to reconnect
Cause¶
The autonomy-orchestrator process is not running, or is listening on a different address
than the CLI is configured to use.
Recovery¶
Check orchestrator health
# Default listen address is 0.0.0.0:8888
curl http://localhost:8888/v1/health
# Use the CLI health check (reads AUTONOMY_ORCHESTRATOR_URL or config)
autonomy status
# Custom address
AUTONOMY_ORCHESTRATOR_URL=http://myhost:8888 autonomy ha status
Start the orchestrator
The orchestrator uses SQLite by default (no --pg-url flag exists).
Data is written to --data-dir (default: $XDG_CACHE_HOME/autonomyops/orchestrator).
# Standalone (development) — listens on 0.0.0.0:8888
autonomy-orchestrator serve --data-dir /var/lib/autonomy
# Custom listen address
autonomy-orchestrator serve --listen 127.0.0.1:8888 --data-dir /var/lib/autonomy
# Systemd-managed
sudo systemctl start autonomy-orchestrator
sudo systemctl status autonomy-orchestrator
Demo-specific: use in-process mode
For demos, the orchestrator is always optional — run in-process with --local:
autonomy demo gazebo --local # full Gazebo policy scenario, no containers, no orchestrator
autonomy demo policy # always in-process; needs no external services
Check AUTONOMY_ORCHESTRATOR_URL
Fleet, rollout, logs, HA, and audit commands use AUTONOMY_ORCHESTRATOR_URL to
locate the orchestrator (default: none — must be configured explicitly):
# Check what the CLI resolves to
autonomy status
# Set it explicitly
export AUTONOMY_ORCHESTRATOR_URL=http://localhost:8888
# Or persist it
autonomy config set orchestrator.url http://localhost:8888
AUTONOMY_RUNTIME_URL is a different env var
AUTONOMY_RUNTIME_URL is injected into subprocess environments by autonomy run
and autonomy ros2 run as the governance callback address for the policy runtime
service (default listen: 127.0.0.1:7777). It has no relation to the orchestrator.
Check it only when debugging autonomy run or autonomy ros2 run subprocess
governance callbacks — not for orchestrator connectivity issues.
HA cluster: verify leader is healthy
In a multi-node HA cluster, the active leader handles all write requests. A connection refused to a follower node is expected when the leader has moved. Check which node holds the leader token:
autonomy ha status
Then direct CLI calls to the current leader address.