Tutorial 03 — Crash and Recovery: WAL + Safe-Point¶
Objective: Demonstrate that the autonomy runtime’s Write-Ahead Log (WAL) preserves all telemetry events through process crashes, collector outages, and incomplete drains — with deterministic fail-hard behavior on corruption.
What you will demonstrate:
Accumulate tool-call events in the WAL while the OTLP collector is offline
Verify that events are not deleted on failed drain attempts (durability invariant)
Restart the runtime after a forced kill and confirm WAL recovery
Drain events in priority order (errors → decisions → lifecycle) after recovery
Trigger WAL fail-hard by corrupting the safe-point file and observe the error
Time: ~15 minutes
Architecture¶
Runtime Process
│
│ POST /v1/tool
▼
┌───────────┐
│ Tool │
│ Server │
└─────┬─────┘
│ Append (fsync)
▼
┌────────────────┐
│ telemetry.wal │ ◄── length-prefixed JSONL frames
│ (on disk) │ 4-byte BE uint32 length + JSON
└────────────────┘
│
│ (when collector available)
▼
┌────────────────┐
│ telemetry.pos │ ◄── consumer-only sequence marker
└────────────────┘
│
▼
┌────────────────────┐
│ OTLP Collector │ ◄── may be offline
│ (or otel-sink) │
└────────────────────┘
Recovery path (on OpenWAL):
┌────────────────────┐
│ telemetry.safe_seq │ ◄── 8-byte LE uint64 (last durable sequence)
└─────────┬──────────┘
│ Truncate WAL to safe_seq boundary
▼
┌───────────────────────────────────────┐
│ Fail-hard if: │
│ · safe_seq file missing (not first-run)│
│ · sequence gap detected │
│ · first seq ≠ 1 │
│ · invalid JSON frame │
└───────────────────────────────────────┘
WAL Frame Format¶
The WAL uses a length-prefixed binary format (not raw JSONL):
┌────────────────────┬───────────────────────────────────────────┐
│ 4 bytes (BE uint32)│ N bytes (JSON payload) │
│ frame length │ {"seq":N,"written_at":"...","event":{...}}│
└────────────────────┴───────────────────────────────────────────┘
Each Append() call fsyncs the file descriptor before returning, guaranteeing that
events are durable before the tool call response is sent to the caller.
Evidence:
telemetry/wal.go:Append(),telemetry/wal.go:OpenWAL()
Step 0: Start the Stack and Load Policy¶
make demo-up
make demo-run-unsigned # build bundle, load policy, restart runtime
Confirm the runtime is healthy:
curl -s http://localhost:7777/health | jq .
Step 1: Drill — OTLP Collector Offline, WAL Accumulates¶
Stop the OTLP sink (simulates collector unreachable):
docker compose -f demo/docker-compose.yml stop otel-sink
Generate 5 tool calls that will accumulate in the WAL:
# 3 × allow (tool.echo)
for i in 1 2 3; do
curl -s -X POST http://localhost:7777/v1/tool \
-H 'Content-Type: application/json' \
-d "{\"kind\":\"tool.echo\",\"params\":{\"message\":\"offline-call-$i\"}}" | jq .decision
done
# 2 × deny (tool.shell) — these generate EventKindDecision with decision=deny
for i in 1 2; do
curl -s -X POST http://localhost:7777/v1/tool \
-H 'Content-Type: application/json' \
-d '{"kind":"tool.shell","params":{"command":"id"}}' | jq .decision
done
Expected output (interleaved):
"allow"
"allow"
"allow"
"deny"
"deny"
Check WAL accumulation:
docker compose -f demo/docker-compose.yml exec -T runtime \
autonomy telemetry export --dir /data/wal --out -
Expected: 5+ JSONL entries (each line is a WAL entry with seq, written_at, event.kind):
{"seq":1,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"allow","kind":"tool.echo",...}}}
{"seq":2,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"allow",...}}}
...
{"seq":5,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"deny","kind":"tool.shell",...}}}
Check WAL event count:
docker compose -f demo/docker-compose.yml exec -T runtime \
autonomy telemetry export --dir /data/wal --out - | grep -c '"seq"'
Expected: 5 (or more, depending on lifecycle events from the runtime startup)
Step 2: Attempt Drain While Collector Is Offline¶
Try draining to the stopped collector:
docker compose -f demo/docker-compose.yml exec -T runtime \
autonomy telemetry drain \
--dir /data/wal \
--endpoint http://otel-sink:4318
Expected output (drain fails, but does NOT delete events):
[drain] sending batch 1 of 1...
[drain] OTLP error: connection refused
drain failed: could not reach endpoint
Verify events are still in the WAL after the failed drain:
docker compose -f demo/docker-compose.yml exec -T runtime \
autonomy telemetry export --dir /data/wal --out - | grep -c '"seq"'
Expected: still 5 — events are NOT deleted on drain failure.
Durability invariant: The WAL writer (
telemetry/wal.go:Append()) fsyncs before returning.autonomy telemetry drain(cmd/autonomy/commands/telemetry.go) reads fromtelemetry.posand advancesposonly after a successful drain. A failed send leavesposunchanged, so the next drain attempt starts from the same sequence. Evidence:TestWALSurvivesCollectorDownintelemetry/buffer_test.go, Drill 3 indemo/scripts/05_failure_drills.sh
Step 3: Restart the Collector and Drain in Priority Order¶
Restart the OTLP sink:
docker compose -f demo/docker-compose.yml start otel-sink
sleep 3 # wait for sink to accept connections
Drain in priority order (errors → decisions → lifecycle):
docker compose -f demo/docker-compose.yml exec -T runtime \
autonomy telemetry drain \
--dir /data/wal \
--endpoint http://otel-sink:4318
Expected output:
[drain] priority queue: 5 events (0 errors, 5 decisions, 0 lifecycle)
[drain] sending batch 1 (5 events)...
[drain] batch 1 accepted: 5 events
drain complete: 5 events sent
Verify events reached the control plane:
curl -s "http://localhost:8888/v1/events?event_type=ai.policy.decision&limit=10" | jq '{count: .count}'
Expected: {"count": 5}
Step 4: Drill — Forced Kill and WAL Recovery¶
Simulate a crash by killing the runtime process mid-operation:
# Start a tool call that will succeed:
curl -s -X POST http://localhost:7777/v1/tool \
-H 'Content-Type: application/json' \
-d '{"kind":"tool.echo","params":{"message":"pre-crash"}}' | jq .decision
# Kill the runtime container (SIGKILL):
docker compose -f demo/docker-compose.yml kill runtime
The WAL is durable — the pre-crash event was fsynced before the response was returned.
Restart the runtime:
docker compose -f demo/docker-compose.yml start runtime
sleep 3
curl -s http://localhost:7777/health | jq .
Expected:
{"status":"ok","mode":"normal"}
The runtime restarted cleanly. Internally, OpenWAL() ran the recovery logic:
Read
telemetry.safe_seq— the last durable sequence numberScan frames up to
safe_seq, verifying sequence continuityTruncate the WAL to the safe-point boundary
Re-open for appending
No events were lost — the pre-crash event is still in the WAL. Verify:
docker compose -f demo/docker-compose.yml exec -T runtime \
autonomy telemetry export --dir /data/wal --out - | python3 -c "
import sys, json
for line in sys.stdin:
e = json.loads(line)
print(f'seq={e[\"seq\"]} kind={e[\"event\"][\"kind\"]} written={e[\"written_at\"][:19]}')"
Step 5: WAL Fail-Hard Behavior on Corruption¶
The WAL is designed to fail-hard (not silently skip) on structural corruption. This prevents silent data loss in adversarial or storage-fault scenarios.
Error causes that trigger fail-hard:
Constant |
Trigger |
|---|---|
|
|
|
A frame’s Seq is not exactly |
|
First frame Seq is not 1 |
|
Frame payload is not valid JSON |
|
Safe-point references a sequence that doesn’t exist in the WAL |
Recovery environment variables (operators only):
AUTONOMYOPS_WAL_LEGACY_UPGRADE=1— Allow first-run whensafe_seqis missing (one-shot migration path for pre-safe-seq WALs)
AUTONOMYOPS_WAL_OPERATOR_RESET=1— Allow reset when both WAL andsafe_seqare deleted (manual disaster recovery only)These are escape hatches, not normal operating modes. Evidence:
telemetry/wal.go:legacyUpgradeEnvVar,operatorResetEnvVar
To observe fail-hard behavior:
# Simulate missing safe_seq file:
docker compose -f demo/docker-compose.yml stop runtime
# Find the WAL directory bind-mounted into the container:
ls demo/data/wal/
# Rename (not delete) the safe_seq file to simulate corruption:
mv demo/data/wal/telemetry.safe_seq demo/data/wal/telemetry.safe_seq.BAK
# Try to start runtime — it must fail:
docker compose -f demo/docker-compose.yml start runtime
sleep 2
docker compose -f demo/docker-compose.yml logs runtime | grep -i "wal\|safe_seq\|SAFESEQ"
Expected log output (runtime fails to start):
FATAL wal recovery failed: SAFESEQ_NOT_FOUND: telemetry.safe_seq missing and AUTONOMYOPS_WAL_LEGACY_UPGRADE not set
Restore and restart:
mv demo/data/wal/telemetry.safe_seq.BAK demo/data/wal/telemetry.safe_seq
docker compose -f demo/docker-compose.yml start runtime
sleep 3
curl -s http://localhost:7777/health | jq .status
Expected: "ok"
Step 6: Supply-Chain Tamper Detection (Drill 4)¶
This drill demonstrates that a tampered lock artifact is detected before policy is loaded — preventing the runtime from running with unverified binaries.
# The drill is scripted — run it directly:
bash demo/scripts/05_failure_drills.sh
Drill 4 within that script:
Pulls the lock sidecar from the OCI registry
Flips one hex nibble in
agent_artifact.digestRe-attaches the tampered lock (without re-signing)
Runs
autonomy verify --require-lock— must fail withErrDigestMismatchor signature validation errorEnables
AUTONOMY_STRICT_MODE=1on the runtime containerConfirms tool calls return
"decision":"deny"in strict mode
Expected drill output:
[DRILL PASS] supply-chain: tampered lock artifact rejected by verify
[DRILL PASS] strict mode: tool calls denied when verification fails
Strict mode behavior: When
AUTONOMY_STRICT_MODE=1, the runtime starts withdenyAllEvaluator{}regardless of loaded policy bundles, and returns"mode":"strict"in/health. This is the fail-closed posture. Evidence:cmd/autonomy/commands/runtime.go,runtime/server.go:handleHealth()
Automated Version¶
The full WAL + drill suite:
make demo-up
make demo-run-unsigned # load policy
make demo-offline-drain # offline accumulation + priority drain
make demo-drills # all 5 failure-injection drills
Troubleshooting¶
Symptom |
Cause |
Fix |
|---|---|---|
WAL export shows 0 entries |
Runtime WAL dir is wrong |
Check bind mount: |
Drain still fails after otel-sink restart |
Sink not yet accepting connections |
|
SAFESEQ_NOT_FOUND |
Missing |
Restore backup or set |
Events not appearing in control plane |
OTel pipeline not connected |
Check collector logs: |
|
Event type filter too narrow |
Try |
What Just Happened¶
Stopped the OTLP collector and confirmed that tool-call events accumulated durably in the WAL (fsynced before each response)
Confirmed that a failed drain attempt does NOT delete events from the WAL
Restarted the collector and drained 5 events in priority order (errors first)
Demonstrated WAL recovery after a forced kill (safe-point truncation)
Triggered WAL fail-hard by removing
telemetry.safe_seq, observed the fatal errorRan Drill 4 to show that a tampered supply chain is detected before policy load
Evidence Links¶
Claim |
File |
Symbol |
|---|---|---|
WAL frame format |
|
Frame struct, |
Safe-point durability |
|
|
Fail-hard causes |
|
|
Drain does not delete on failure |
|
|
WAL durability test |
|
|
Drill 3 (OTLP failure) |
|
Drill 3 section |
Tamper detection |
|
|
Strict mode |
|
|
Next Tutorial¶
Tutorial 04 — OS Replacement Survival and Mission Runtime Reconstruction