Tutorial 03 — Crash and Recovery: WAL + Safe-Point¶

Objective: Demonstrate that the autonomy runtime’s Write-Ahead Log (WAL) preserves all telemetry events through process crashes, collector outages, and incomplete drains — with deterministic fail-hard behavior on corruption.

What you will demonstrate:

Accumulate tool-call events in the WAL while the OTLP collector is offline
Verify that events are not deleted on failed drain attempts (durability invariant)
Restart the runtime after a forced kill and confirm WAL recovery
Drain events in priority order (errors → decisions → lifecycle) after recovery
Trigger WAL fail-hard by corrupting the safe-point file and observe the error

Time: ~15 minutes

Architecture¶

                     Runtime Process
                           │
                           │ POST /v1/tool
                           ▼
                     ┌───────────┐
                     │  Tool     │
                     │  Server   │
                     └─────┬─────┘
                           │ Append (fsync)
                           ▼
                   ┌────────────────┐
                   │  telemetry.wal  │ ◄── length-prefixed JSONL frames
                   │  (on disk)     │     4-byte BE uint32 length + JSON
                   └────────────────┘
                           │
                           │ (when collector available)
                           ▼
                   ┌────────────────┐
                   │ telemetry.pos   │ ◄── consumer-only sequence marker
                   └────────────────┘
                           │
                           ▼
                   ┌────────────────────┐
                   │  OTLP Collector    │ ◄── may be offline
                   │  (or otel-sink)    │
                   └────────────────────┘

Recovery path (on OpenWAL):
   ┌────────────────────┐
   │ telemetry.safe_seq │ ◄── 8-byte LE uint64 (last durable sequence)
   └─────────┬──────────┘
             │ Truncate WAL to safe_seq boundary
             ▼
   ┌───────────────────────────────────────┐
   │ Fail-hard if:                         │
   │  · safe_seq file missing (not first-run)│
   │  · sequence gap detected              │
   │  · first seq ≠ 1                     │
   │  · invalid JSON frame                 │
   └───────────────────────────────────────┘

WAL Frame Format¶

The WAL uses a length-prefixed binary format (not raw JSONL):

┌────────────────────┬───────────────────────────────────────────┐
│  4 bytes (BE uint32)│  N bytes (JSON payload)                   │
│  frame length      │  {"seq":N,"written_at":"...","event":{...}}│
└────────────────────┴───────────────────────────────────────────┘

Each Append() call fsyncs the file descriptor before returning, guaranteeing that events are durable before the tool call response is sent to the caller.

Evidence: telemetry/wal.go:Append(), telemetry/wal.go:OpenWAL()

Step 0: Start the Stack and Load Policy¶

make demo-up
make demo-run-unsigned  # build bundle, load policy, restart runtime

Confirm the runtime is healthy:

curl -s http://localhost:7777/health | jq .

Step 1: Drill — OTLP Collector Offline, WAL Accumulates¶

Stop the OTLP sink (simulates collector unreachable):

docker compose -f demo/docker-compose.yml stop otel-sink

Generate 5 tool calls that will accumulate in the WAL:

# 3 × allow (tool.echo)
for i in 1 2 3; do
  curl -s -X POST http://localhost:7777/v1/tool \
    -H 'Content-Type: application/json' \
    -d "{\"kind\":\"tool.echo\",\"params\":{\"message\":\"offline-call-$i\"}}" | jq .decision
done

# 2 × deny (tool.shell) — these generate EventKindDecision with decision=deny
for i in 1 2; do
  curl -s -X POST http://localhost:7777/v1/tool \
    -H 'Content-Type: application/json' \
    -d '{"kind":"tool.shell","params":{"command":"id"}}' | jq .decision
done

Expected output (interleaved):

"allow"
"allow"
"allow"
"deny"
"deny"

Check WAL accumulation:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out -

Expected: 5+ JSONL entries (each line is a WAL entry with seq, written_at, event.kind):

{"seq":1,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"allow","kind":"tool.echo",...}}}
{"seq":2,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"allow",...}}}
...
{"seq":5,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"deny","kind":"tool.shell",...}}}

Check WAL event count:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out - | grep -c '"seq"'

Expected: 5 (or more, depending on lifecycle events from the runtime startup)

Step 2: Attempt Drain While Collector Is Offline¶

Try draining to the stopped collector:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry drain \
  --dir /data/wal \
  --endpoint http://otel-sink:4318

Expected output (drain fails, but does NOT delete events):

[drain] sending batch 1 of 1...
[drain] OTLP error: connection refused
drain failed: could not reach endpoint

Verify events are still in the WAL after the failed drain:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out - | grep -c '"seq"'

Expected: still 5 — events are NOT deleted on drain failure.

Durability invariant: The WAL writer (telemetry/wal.go:Append()) fsyncs before returning. autonomy telemetry drain (cmd/autonomy/commands/telemetry.go) reads from telemetry.pos and advances pos only after a successful drain. A failed send leaves pos unchanged, so the next drain attempt starts from the same sequence. Evidence: TestWALSurvivesCollectorDown in telemetry/buffer_test.go, Drill 3 in demo/scripts/05_failure_drills.sh

Step 3: Restart the Collector and Drain in Priority Order¶

Restart the OTLP sink:

docker compose -f demo/docker-compose.yml start otel-sink
sleep 3  # wait for sink to accept connections

Drain in priority order (errors → decisions → lifecycle):

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry drain \
  --dir /data/wal \
  --endpoint http://otel-sink:4318

Expected output:

[drain] priority queue: 5 events (0 errors, 5 decisions, 0 lifecycle)
[drain] sending batch 1 (5 events)...
[drain] batch 1 accepted: 5 events
drain complete: 5 events sent

Verify events reached the control plane:

curl -s "http://localhost:8888/v1/events?event_type=ai.policy.decision&limit=10" | jq '{count: .count}'

Expected: {"count": 5}

Step 4: Drill — Forced Kill and WAL Recovery¶

Simulate a crash by killing the runtime process mid-operation:

# Start a tool call that will succeed:
curl -s -X POST http://localhost:7777/v1/tool \
  -H 'Content-Type: application/json' \
  -d '{"kind":"tool.echo","params":{"message":"pre-crash"}}' | jq .decision

# Kill the runtime container (SIGKILL):
docker compose -f demo/docker-compose.yml kill runtime

The WAL is durable — the pre-crash event was fsynced before the response was returned.

Restart the runtime:

docker compose -f demo/docker-compose.yml start runtime
sleep 3
curl -s http://localhost:7777/health | jq .

Expected:

{"status":"ok","mode":"normal"}

The runtime restarted cleanly. Internally, OpenWAL() ran the recovery logic:

Read telemetry.safe_seq — the last durable sequence number
Scan frames up to safe_seq, verifying sequence continuity
Truncate the WAL to the safe-point boundary
Re-open for appending

No events were lost — the pre-crash event is still in the WAL. Verify:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out - | python3 -c "
import sys, json
for line in sys.stdin:
    e = json.loads(line)
    print(f'seq={e[\"seq\"]} kind={e[\"event\"][\"kind\"]} written={e[\"written_at\"][:19]}')"

Step 5: WAL Fail-Hard Behavior on Corruption¶

The WAL is designed to fail-hard (not silently skip) on structural corruption. This prevents silent data loss in adversarial or storage-fault scenarios.

Error causes that trigger fail-hard:

Constant	Trigger
`SAFESEQ_NOT_FOUND`	`telemetry.safe_seq` file missing (not first-run)
`SEQ_GAP`	A frame’s Seq is not exactly `prev_seq + 1`
`FIRST_SEQ_NOT_ONE`	First frame Seq is not 1
`WAL_CORRUPT_INVALID_JSON`	Frame payload is not valid JSON
`SAFESEQ_GT_MAXSEQ`	Safe-point references a sequence that doesn’t exist in the WAL

Recovery environment variables (operators only):

AUTONOMYOPS_WAL_LEGACY_UPGRADE=1 — Allow first-run when safe_seq is missing (one-shot migration path for pre-safe-seq WALs)

AUTONOMYOPS_WAL_OPERATOR_RESET=1 — Allow reset when both WAL and safe_seq are deleted (manual disaster recovery only)

These are escape hatches, not normal operating modes. Evidence: telemetry/wal.go:legacyUpgradeEnvVar, operatorResetEnvVar

To observe fail-hard behavior:

# Simulate missing safe_seq file:
docker compose -f demo/docker-compose.yml stop runtime

# Find the WAL directory bind-mounted into the container:
ls demo/data/wal/

# Rename (not delete) the safe_seq file to simulate corruption:
mv demo/data/wal/telemetry.safe_seq demo/data/wal/telemetry.safe_seq.BAK

# Try to start runtime — it must fail:
docker compose -f demo/docker-compose.yml start runtime
sleep 2
docker compose -f demo/docker-compose.yml logs runtime | grep -i "wal\|safe_seq\|SAFESEQ"

Expected log output (runtime fails to start):

FATAL wal recovery failed: SAFESEQ_NOT_FOUND: telemetry.safe_seq missing and AUTONOMYOPS_WAL_LEGACY_UPGRADE not set

Restore and restart:

mv demo/data/wal/telemetry.safe_seq.BAK demo/data/wal/telemetry.safe_seq
docker compose -f demo/docker-compose.yml start runtime
sleep 3
curl -s http://localhost:7777/health | jq .status

Expected: "ok"

Step 6: Supply-Chain Tamper Detection (Drill 4)¶

This drill demonstrates that a tampered lock artifact is detected before policy is loaded — preventing the runtime from running with unverified binaries.

# The drill is scripted — run it directly:
bash demo/scripts/05_failure_drills.sh

Drill 4 within that script:

Pulls the lock sidecar from the OCI registry
Flips one hex nibble in agent_artifact.digest
Re-attaches the tampered lock (without re-signing)
Runs autonomy verify --require-lock — must fail with ErrDigestMismatch or signature validation error
Enables AUTONOMY_STRICT_MODE=1 on the runtime container
Confirms tool calls return "decision":"deny" in strict mode

Expected drill output:

[DRILL PASS] supply-chain: tampered lock artifact rejected by verify
[DRILL PASS] strict mode: tool calls denied when verification fails

Strict mode behavior: When AUTONOMY_STRICT_MODE=1, the runtime starts with denyAllEvaluator{} regardless of loaded policy bundles, and returns "mode":"strict" in /health. This is the fail-closed posture. Evidence: cmd/autonomy/commands/runtime.go, runtime/server.go:handleHealth()

Automated Version¶

The full WAL + drill suite:

make demo-up
make demo-run-unsigned    # load policy
make demo-offline-drain   # offline accumulation + priority drain
make demo-drills          # all 5 failure-injection drills

Troubleshooting¶

Symptom	Cause	Fix
WAL export shows 0 entries	Runtime WAL dir is wrong	Check bind mount: `docker compose exec runtime ls /data/wal/`
Drain still fails after otel-sink restart	Sink not yet accepting connections	`sleep 5` then retry
SAFESEQ_NOT_FOUND	Missing `telemetry.safe_seq`	Restore backup or set `AUTONOMYOPS_WAL_LEGACY_UPGRADE=1` for first-run
Events not appearing in control plane	OTel pipeline not connected	Check collector logs: `docker compose logs otel-collector`
`count: 0` after drain	Event type filter too narrow	Try `GET /v1/events?limit=20` (no filter)

What Just Happened¶

Stopped the OTLP collector and confirmed that tool-call events accumulated durably in the WAL (fsynced before each response)
Confirmed that a failed drain attempt does NOT delete events from the WAL
Restarted the collector and drained 5 events in priority order (errors first)
Demonstrated WAL recovery after a forced kill (safe-point truncation)
Triggered WAL fail-hard by removing telemetry.safe_seq, observed the fatal error
Ran Drill 4 to show that a tampered supply chain is detected before policy load

Evidence Links¶

Claim	File	Symbol
WAL frame format	`telemetry/wal.go`	Frame struct, `Append()`
Safe-point durability	`telemetry/wal.go`	`safeSeqFileName`, `OpenWAL()`
Fail-hard causes	`telemetry/wal.go`	`causeSeqGap`, `causeSafeSeqNotFound`, etc.
Drain does not delete on failure	`cmd/autonomy/commands/telemetry.go`	`telemetryDrainCmd()` (`LoadPos` / `SavePos`)
WAL durability test	`telemetry/buffer_test.go`	`TestWALSurvivesCollectorDown`
Drill 3 (OTLP failure)	`demo/scripts/05_failure_drills.sh`	Drill 3 section
Tamper detection	`oci/sign/verify_tamper_test.go`	`TestVerify_TamperedAgentDigest`
Strict mode	`cmd/autonomy/commands/runtime.go`	`AUTONOMY_STRICT_MODE`

Next Tutorial¶

Tutorial 04 — OS Replacement Survival and Mission Runtime Reconstruction