Tutorial 03 — Crash and Recovery: WAL + Safe-Point

Objective: Demonstrate that the autonomy runtime’s Write-Ahead Log (WAL) preserves all telemetry events through process crashes, collector outages, and incomplete drains — with deterministic fail-hard behavior on corruption.

What you will demonstrate:

  • Accumulate tool-call events in the WAL while the OTLP collector is offline

  • Verify that events are not deleted on failed drain attempts (durability invariant)

  • Restart the runtime after a forced kill and confirm WAL recovery

  • Drain events in priority order (errors → decisions → lifecycle) after recovery

  • Trigger WAL fail-hard by corrupting the safe-point file and observe the error

Time: ~15 minutes


Architecture

                     Runtime Process
                           │
                           │ POST /v1/tool
                           ▼
                     ┌───────────┐
                     │  Tool     │
                     │  Server   │
                     └─────┬─────┘
                           │ Append (fsync)
                           ▼
                   ┌────────────────┐
                   │  telemetry.wal  │ ◄── length-prefixed JSONL frames
                   │  (on disk)     │     4-byte BE uint32 length + JSON
                   └────────────────┘
                           │
                           │ (when collector available)
                           ▼
                   ┌────────────────┐
                   │ telemetry.pos   │ ◄── consumer-only sequence marker
                   └────────────────┘
                           │
                           ▼
                   ┌────────────────────┐
                   │  OTLP Collector    │ ◄── may be offline
                   │  (or otel-sink)    │
                   └────────────────────┘

Recovery path (on OpenWAL):
   ┌────────────────────┐
   │ telemetry.safe_seq │ ◄── 8-byte LE uint64 (last durable sequence)
   └─────────┬──────────┘
             │ Truncate WAL to safe_seq boundary
             ▼
   ┌───────────────────────────────────────┐
   │ Fail-hard if:                         │
   │  · safe_seq file missing (not first-run)│
   │  · sequence gap detected              │
   │  · first seq ≠ 1                     │
   │  · invalid JSON frame                 │
   └───────────────────────────────────────┘

WAL Frame Format

The WAL uses a length-prefixed binary format (not raw JSONL):

┌────────────────────┬───────────────────────────────────────────┐
│  4 bytes (BE uint32)│  N bytes (JSON payload)                   │
│  frame length      │  {"seq":N,"written_at":"...","event":{...}}│
└────────────────────┴───────────────────────────────────────────┘

Each Append() call fsyncs the file descriptor before returning, guaranteeing that events are durable before the tool call response is sent to the caller.

Evidence: telemetry/wal.go:Append(), telemetry/wal.go:OpenWAL()


Step 0: Start the Stack and Load Policy

make demo-up
make demo-run-unsigned  # build bundle, load policy, restart runtime

Confirm the runtime is healthy:

curl -s http://localhost:7777/health | jq .

Step 1: Drill — OTLP Collector Offline, WAL Accumulates

Stop the OTLP sink (simulates collector unreachable):

docker compose -f demo/docker-compose.yml stop otel-sink

Generate 5 tool calls that will accumulate in the WAL:

# 3 × allow (tool.echo)
for i in 1 2 3; do
  curl -s -X POST http://localhost:7777/v1/tool \
    -H 'Content-Type: application/json' \
    -d "{\"kind\":\"tool.echo\",\"params\":{\"message\":\"offline-call-$i\"}}" | jq .decision
done

# 2 × deny (tool.shell) — these generate EventKindDecision with decision=deny
for i in 1 2; do
  curl -s -X POST http://localhost:7777/v1/tool \
    -H 'Content-Type: application/json' \
    -d '{"kind":"tool.shell","params":{"command":"id"}}' | jq .decision
done

Expected output (interleaved):

"allow"
"allow"
"allow"
"deny"
"deny"

Check WAL accumulation:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out -

Expected: 5+ JSONL entries (each line is a WAL entry with seq, written_at, event.kind):

{"seq":1,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"allow","kind":"tool.echo",...}}}
{"seq":2,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"allow",...}}}
...
{"seq":5,"written_at":"2026-...","event":{"kind":"autonomy.decision","attrs":{"decision":"deny","kind":"tool.shell",...}}}

Check WAL event count:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out - | grep -c '"seq"'

Expected: 5 (or more, depending on lifecycle events from the runtime startup)


Step 2: Attempt Drain While Collector Is Offline

Try draining to the stopped collector:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry drain \
  --dir /data/wal \
  --endpoint http://otel-sink:4318

Expected output (drain fails, but does NOT delete events):

[drain] sending batch 1 of 1...
[drain] OTLP error: connection refused
drain failed: could not reach endpoint

Verify events are still in the WAL after the failed drain:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out - | grep -c '"seq"'

Expected: still 5 — events are NOT deleted on drain failure.

Durability invariant: The WAL writer (telemetry/wal.go:Append()) fsyncs before returning. autonomy telemetry drain (cmd/autonomy/commands/telemetry.go) reads from telemetry.pos and advances pos only after a successful drain. A failed send leaves pos unchanged, so the next drain attempt starts from the same sequence. Evidence: TestWALSurvivesCollectorDown in telemetry/buffer_test.go, Drill 3 in demo/scripts/05_failure_drills.sh


Step 3: Restart the Collector and Drain in Priority Order

Restart the OTLP sink:

docker compose -f demo/docker-compose.yml start otel-sink
sleep 3  # wait for sink to accept connections

Drain in priority order (errors → decisions → lifecycle):

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry drain \
  --dir /data/wal \
  --endpoint http://otel-sink:4318

Expected output:

[drain] priority queue: 5 events (0 errors, 5 decisions, 0 lifecycle)
[drain] sending batch 1 (5 events)...
[drain] batch 1 accepted: 5 events
drain complete: 5 events sent

Verify events reached the control plane:

curl -s "http://localhost:8888/v1/events?event_type=ai.policy.decision&limit=10" | jq '{count: .count}'

Expected: {"count": 5}


Step 4: Drill — Forced Kill and WAL Recovery

Simulate a crash by killing the runtime process mid-operation:

# Start a tool call that will succeed:
curl -s -X POST http://localhost:7777/v1/tool \
  -H 'Content-Type: application/json' \
  -d '{"kind":"tool.echo","params":{"message":"pre-crash"}}' | jq .decision

# Kill the runtime container (SIGKILL):
docker compose -f demo/docker-compose.yml kill runtime

The WAL is durable — the pre-crash event was fsynced before the response was returned.

Restart the runtime:

docker compose -f demo/docker-compose.yml start runtime
sleep 3
curl -s http://localhost:7777/health | jq .

Expected:

{"status":"ok","mode":"normal"}

The runtime restarted cleanly. Internally, OpenWAL() ran the recovery logic:

  1. Read telemetry.safe_seq — the last durable sequence number

  2. Scan frames up to safe_seq, verifying sequence continuity

  3. Truncate the WAL to the safe-point boundary

  4. Re-open for appending

No events were lost — the pre-crash event is still in the WAL. Verify:

docker compose -f demo/docker-compose.yml exec -T runtime \
  autonomy telemetry export --dir /data/wal --out - | python3 -c "
import sys, json
for line in sys.stdin:
    e = json.loads(line)
    print(f'seq={e[\"seq\"]} kind={e[\"event\"][\"kind\"]} written={e[\"written_at\"][:19]}')"

Step 5: WAL Fail-Hard Behavior on Corruption

The WAL is designed to fail-hard (not silently skip) on structural corruption. This prevents silent data loss in adversarial or storage-fault scenarios.

Error causes that trigger fail-hard:

Constant

Trigger

SAFESEQ_NOT_FOUND

telemetry.safe_seq file missing (not first-run)

SEQ_GAP

A frame’s Seq is not exactly prev_seq + 1

FIRST_SEQ_NOT_ONE

First frame Seq is not 1

WAL_CORRUPT_INVALID_JSON

Frame payload is not valid JSON

SAFESEQ_GT_MAXSEQ

Safe-point references a sequence that doesn’t exist in the WAL

Recovery environment variables (operators only):

  • AUTONOMYOPS_WAL_LEGACY_UPGRADE=1 — Allow first-run when safe_seq is missing (one-shot migration path for pre-safe-seq WALs)

  • AUTONOMYOPS_WAL_OPERATOR_RESET=1 — Allow reset when both WAL and safe_seq are deleted (manual disaster recovery only)

These are escape hatches, not normal operating modes. Evidence: telemetry/wal.go:legacyUpgradeEnvVar, operatorResetEnvVar

To observe fail-hard behavior:

# Simulate missing safe_seq file:
docker compose -f demo/docker-compose.yml stop runtime

# Find the WAL directory bind-mounted into the container:
ls demo/data/wal/

# Rename (not delete) the safe_seq file to simulate corruption:
mv demo/data/wal/telemetry.safe_seq demo/data/wal/telemetry.safe_seq.BAK

# Try to start runtime — it must fail:
docker compose -f demo/docker-compose.yml start runtime
sleep 2
docker compose -f demo/docker-compose.yml logs runtime | grep -i "wal\|safe_seq\|SAFESEQ"

Expected log output (runtime fails to start):

FATAL wal recovery failed: SAFESEQ_NOT_FOUND: telemetry.safe_seq missing and AUTONOMYOPS_WAL_LEGACY_UPGRADE not set

Restore and restart:

mv demo/data/wal/telemetry.safe_seq.BAK demo/data/wal/telemetry.safe_seq
docker compose -f demo/docker-compose.yml start runtime
sleep 3
curl -s http://localhost:7777/health | jq .status

Expected: "ok"


Step 6: Supply-Chain Tamper Detection (Drill 4)

This drill demonstrates that a tampered lock artifact is detected before policy is loaded — preventing the runtime from running with unverified binaries.

# The drill is scripted — run it directly:
bash demo/scripts/05_failure_drills.sh

Drill 4 within that script:

  1. Pulls the lock sidecar from the OCI registry

  2. Flips one hex nibble in agent_artifact.digest

  3. Re-attaches the tampered lock (without re-signing)

  4. Runs autonomy verify --require-lock — must fail with ErrDigestMismatch or signature validation error

  5. Enables AUTONOMY_STRICT_MODE=1 on the runtime container

  6. Confirms tool calls return "decision":"deny" in strict mode

Expected drill output:

[DRILL PASS] supply-chain: tampered lock artifact rejected by verify
[DRILL PASS] strict mode: tool calls denied when verification fails

Strict mode behavior: When AUTONOMY_STRICT_MODE=1, the runtime starts with denyAllEvaluator{} regardless of loaded policy bundles, and returns "mode":"strict" in /health. This is the fail-closed posture. Evidence: cmd/autonomy/commands/runtime.go, runtime/server.go:handleHealth()


Automated Version

The full WAL + drill suite:

make demo-up
make demo-run-unsigned    # load policy
make demo-offline-drain   # offline accumulation + priority drain
make demo-drills          # all 5 failure-injection drills

Troubleshooting

Symptom

Cause

Fix

WAL export shows 0 entries

Runtime WAL dir is wrong

Check bind mount: docker compose exec runtime ls /data/wal/

Drain still fails after otel-sink restart

Sink not yet accepting connections

sleep 5 then retry

SAFESEQ_NOT_FOUND

Missing telemetry.safe_seq

Restore backup or set AUTONOMYOPS_WAL_LEGACY_UPGRADE=1 for first-run

Events not appearing in control plane

OTel pipeline not connected

Check collector logs: docker compose logs otel-collector

count: 0 after drain

Event type filter too narrow

Try GET /v1/events?limit=20 (no filter)


What Just Happened

  • Stopped the OTLP collector and confirmed that tool-call events accumulated durably in the WAL (fsynced before each response)

  • Confirmed that a failed drain attempt does NOT delete events from the WAL

  • Restarted the collector and drained 5 events in priority order (errors first)

  • Demonstrated WAL recovery after a forced kill (safe-point truncation)

  • Triggered WAL fail-hard by removing telemetry.safe_seq, observed the fatal error

  • Ran Drill 4 to show that a tampered supply chain is detected before policy load

Next Tutorial

Tutorial 04 — OS Replacement Survival and Mission Runtime Reconstruction