Offline-first Runbook: Buffer Then Drain¶

What you’re proving¶

Events are durably buffered when collector/send path is down.
Drain retries occur, then buffered entries replay after reconnect.
autonomy telemetry drain advances the consumer cursor (telemetry.pos) only after successful delivery.
Offline sender failure does not delete buffered entries.

Prereqs¶

Repo root: <repo-root>
Go toolchain available

Steps¶

Run telemetry offline-first tests.

GOCACHE=/tmp/go-build go test ./telemetry \
  -run 'TestWALSurvivesCollectorDown|TestWALReplayAfterReconnect|TestDrainWithOfflineSender' -v

Inspect captured output.

sed -n '1,180p' docs/_generated/test-outputs/offline-drain-output.txt

Expected outputs (from real run)¶

=== RUN   TestWALSurvivesCollectorDown
--- PASS: TestWALSurvivesCollectorDown
=== RUN   TestWALReplayAfterReconnect
WARN telemetry/exporter: drain failed, will retry ...
--- PASS: TestWALReplayAfterReconnect
=== RUN   TestDrainWithOfflineSender
--- PASS: TestDrainWithOfflineSender
PASS

Verification¶

Exit code is 0.
Retry warnings appear during replay test.
All three tests pass.

Failure modes¶

Go cache permission errors: use GOCACHE=/tmp/go-build.
Slow environment can increase retry timing noise; verify by pass/fail status, not exact backoff count.

Non-goals¶

This runbook does not prove control-plane push or orchestration.
This does not claim global exactly-once semantics across fleets; delivery is at-least-once and receivers must deduplicate by stable event identifiers.

Evidence¶

telemetry/wal.go
telemetry/exporter.go
telemetry/buffer_test.go (TestWALSurvivesCollectorDown, TestWALReplayAfterReconnect)
telemetry/integration_test.go (TestDrainWithOfflineSender)
docs/_generated/test-outputs/offline-drain-output.txt