Offline-first Runbook: Buffer Then Drain

What you’re proving

  • Events are durably buffered when collector/send path is down.

  • Drain retries occur, then buffered entries replay after reconnect.

  • autonomy telemetry drain advances the consumer cursor (telemetry.pos) only after successful delivery.

  • Offline sender failure does not delete buffered entries.

Prereqs

  • Repo root: <repo-root>

  • Go toolchain available

Steps

  1. Run telemetry offline-first tests.

GOCACHE=/tmp/go-build go test ./telemetry \
  -run 'TestWALSurvivesCollectorDown|TestWALReplayAfterReconnect|TestDrainWithOfflineSender' -v
  1. Inspect captured output.

sed -n '1,180p' docs/_generated/test-outputs/offline-drain-output.txt

Expected outputs (from real run)

=== RUN   TestWALSurvivesCollectorDown
--- PASS: TestWALSurvivesCollectorDown
=== RUN   TestWALReplayAfterReconnect
WARN telemetry/exporter: drain failed, will retry ...
--- PASS: TestWALReplayAfterReconnect
=== RUN   TestDrainWithOfflineSender
--- PASS: TestDrainWithOfflineSender
PASS

Verification

  • Exit code is 0.

  • Retry warnings appear during replay test.

  • All three tests pass.

Failure modes

  • Go cache permission errors: use GOCACHE=/tmp/go-build.

  • Slow environment can increase retry timing noise; verify by pass/fail status, not exact backoff count.

Non-goals

  • This runbook does not prove control-plane push or orchestration.

  • This does not claim global exactly-once semantics across fleets; delivery is at-least-once and receivers must deduplicate by stable event identifiers.

Evidence

  • telemetry/wal.go

  • telemetry/exporter.go

  • telemetry/buffer_test.go (TestWALSurvivesCollectorDown, TestWALReplayAfterReconnect)

  • telemetry/integration_test.go (TestDrainWithOfflineSender)

  • docs/_generated/test-outputs/offline-drain-output.txt