Compliance Audit — Workplan v1.6 §3 + Phase 1 Stop Conditions

Date: 2026-03-04 Auditor: automated (Claude Code) Scope: telemetry/wal.go, all telemetry/*_test.go files, scripts/portability/, CI workflows Verdict: COMPLIANT (Phase 1 stop conditions S1–S12 closed; prior WS-1..WS-4 addressed) Closure provenance: WS-1..WS-4 were implemented in PR #37 (telemetry(portability): complete Phase 1 WAL contract and close WS-1 - WS-4).


1. Normative Clause Matrix

Each row maps one normative requirement from v1.6 §3 to the production code that enforces it, the tests that prove it, and the CI / evidence mechanism that captures it at scale.

§3.3 Frame Format & Structural Integrity

Clause

Requirement

Enforcing Code

Tests

CI / Evidence

§3.3 frame format

[4B big-endian uint32 len][JSON payload]; no newline delimiter

wal.go:289–292 (write), wal.go:333–337 (read), wal.go:587 (scan)

All ReadFrom_*, TestScanFramesWithSafe_* (structural)

Every go test ./... run; wal_verify.py R1–R4 checks each frame

§3.3 structural tail

Partial frame at EOF is a crash artifact; silently repaired by truncation

scanFramesWithSafeerrTruncatedTail (wal.go:581–604); OpenWAL truncation (wal.go:233–244)

TestScanFramesWithSafe_TruncatedTail_Header, TestScanFramesWithSafe_TruncatedTail_Payload, TestReadFrom_PartialTailFrame_StopsCleanly

portability-matrix step 2 (crash harness injects phantom then reopens)

§3.3 R1 (v1.x)

First frame Seq MUST equal 1

scanFramesWithSafe R1 check (wal.go:607–614); causes FIRST_SEQ_NOT_ONE

TestScanFramesWithSafe_FirstSeqNotOne, TestOpenWAL_FirstSeqNotOne_FailHard

recovery_scan_test.go runs in test-go CI; harness TestKnownGoodWALFixture verifies R1 indirectly

§3.3 R2

Seq values strictly contiguous in committed range; gaps beyond safe-point are phantoms (not errors)

scanFramesWithSafe R2 (wal.go:619–629); phantom-range early-exit (wal.go:622–625)

TestScanFramesWithSafe_SeqGap, TestOpenWAL_SeqGap_FailHard, TestScanFramesWithSafe_SeqGap_BeyondSafePoint

All 20 crash-harness iterations verify monotonic recovery

§3.3 R3

safe_seqmaxSeq

scanFramesWithSafe R3 post-scan check (wal.go:649–652); causes SAFESEQ_GT_MAXSEQ

TestScanFramesWithSafe_SafeSeqGtMaxSeq, TestOpenWAL_SafeSeqGtMaxSeq_FailHard, TestReorderBug_WouldBeDetected

portability-matrix; wal_verify.json R3_SAFESEQ_LE_MAXSEQ field

§3.3 R4

Frame with Seq == safe_seq MUST exist when safe_seq > 0

scanFramesWithSafe R4 check (wal.go:655–658); causes SAFESEQ_NOT_FOUND

TestScanFramesWithSafe_SafeSeqNotFound

wal_verify.json R4_SAFESEQ_FRAME_EXISTS field

§3.3 fail-hard log

Invariant violations logged as WAL_RECOVERY_FAIL_HARD with structured cause and dir

OpenWAL (wal.go:152–158): slog.Error("WAL_RECOVERY_FAIL_HARD", "cause", re.cause, ...)

All *_FailHard tests assert error type/cause; log output is to structured slog

autonomy.log in evidence dir captures slog output

§3.4 Steady-State Detection & Policy

Clause

Requirement

Enforcing Code

Tests

CI / Evidence

§3.4 steady-state detection

steady = validBytes > 0 || otherStateExists(dir)

otherStateExists (wal.go:511–522): checks lock.fingerprint, activation/, runtime.db

TestOpenWAL_SteadyState_WALHasFrames_SafeSeqMissing_FailHard, TestOpenWAL_SteadyState_OtherState_SafeSeqMissing_FailHard

openwal_steady_test.go runs in every test-go CI job

§3.4 telemetry.pos excluded

telemetry.pos MUST NOT be used for steady-state or recovery

wal.go:163: explicit comment; otherStateExists does NOT check posFileName

Implicit: TestOpenWAL_FirstRun_SafeSeqMissing_OK passes with only .pos present if steady=false

S11 stop condition checked; verified in operator reset tests

§3.4 first-run

No safe_seq + no other state → OK (no error, no warning)

OpenWAL switch case !steady: (wal.go:168–170): silent pass

TestOpenWAL_FirstRun_SafeSeqMissing_OK

First make portability-matrix run starts from a fresh dir

§3.4 steady-state fail-hard

Steady + no safe_seq + no override → fmt.Errorf(...)

OpenWAL default case (wal.go:199–208)

TestOpenWAL_SteadyState_WALHasFrames_SafeSeqMissing_FailHard, TestOpenWAL_SteadyState_OtherState_SafeSeqMissing_FailHard

test-go CI job

§3.4 legacy upgrade

AUTONOMYOPS_WAL_LEGACY_UPGRADE=1 + steady → WARN + OK

OpenWAL (wal.go:172–178): slog.Warn(...) with hint to unset

TestOpenWAL_LegacyUpgrade_SafeSeqMissing_OK (openwal_steady_test.go), TestOpenWAL_SafeSeqZero_LegacyUpgrade_TruncatesStructuralTail

S7 stop condition closed

§3.4.1 operator reset

AUTONOMYOPS_WAL_OPERATOR_RESET=1 + empty WAL + no safe_seq → WARN WAL_RESET_BY_OPERATOR + OK

OpenWAL (wal.go:180–197): slog.Warn("WAL_RESET_BY_OPERATOR", ...) with activation_invariants note

TestOperatorReset_EmitsMarkerAndStartsClean (asserts slog msg via captureSlogHandler), TestOperatorReset_WithoutEnvVar_FailsHard, TestOperatorReset_WalHasFrames_IsNotReset

S12 stop condition closed; slog capture is tested

§3.5 Option B — Safe-Point Mechanism

Clause

Requirement

Enforcing Code

Tests

CI / Evidence

§3.5 encoding

safe_seq = 8-byte little-endian uint64 (NOT big-endian)

encodeSafeSeq (wal.go:398–402): binary.LittleEndian.PutUint64; decodeSafeSeq (wal.go:407–412)

TestEncodeSafeSeq_RoundTrip, TestDecodeSafeSeq_WrongLength_Errors, TestSafeSeq_DiskEncoding_LittleEndian (explicitly verifies bytes[0]=seq, bytes[7]=0)

wal_verify.py reads with struct.unpack("<Q") (LE) matching the Go encoding

§3.5 corrupt safe_seq

Wrong-length file → fail-hard (never silently tolerated)

readSafeSeq (wal.go:483–498): decodeSafeSeq returns error on len != 8; surfaced as fmt.Errorf(...) from OpenWAL:139

TestReadSafeSeq_Corrupt_WrongLength (safe_seq_test.go), TestReadSafeSeq_Missing_ReturnsNotFound

S4 stop condition closed

§3.5 temp-then-rename atomicity

writeSafeSeq writes safe_seq.tmp, fsyncs it, then renames to safe_seq

writeSafeSeq (wal.go:439–470): steps 1–4 explicit; temp and target MUST be in same dir

TestWriteSafeSeq_AtomicRename_OldValuePersistedIfTmpExists (safe_seq_test.go)

Core matrix step 4 (atomic rename check); TestCrash_SafeSeqTmpLeftover_OpenWAL_Succeeds

§3.5 MUST ordering step 1

WAL fsync MUST complete before writeSafeSeq is called

Append (wal.go:300–306): w.file.Sync() then w.writeSafeSeq(e.Seq) in strict sequence

TestCrash_BetweenWALFsyncAndSafeSeqWrite (S3 closure: phantom discarded on reopen)

S8 stop condition closed (see WS-1 below re: ordering coverage)

§3.5 MUST ordering step 2

tmp write → tmp fsync

writeSafeSeq (wal.go:452–455): tmp.Sync() before Close()

TestWriteSafeSeq_AtomicRename_OldValuePersistedIfTmpExists (proves write+fsync completed)

Implicit in every crash harness iteration

§3.5 MUST ordering step 3

tmp fsync → rename

writeSafeSeq (wal.go:461–464): os.Rename(tmpPath, target) after tmp.Close()

Atomic rename test in core matrix

core_matrix.sh step 4

§3.5 monotonicity

safe_seq only ever increases; never written with lower value

writeSafeSeq called with e.Seq which is monotonically incremented by w.seq.Add(1)

TestSafeSeq_Monotonic_AcrossAppends, TestSafeSeq_NeverExceedsMaxSeq_AfterPhantomFrame

Crash harness invariant: safe_seq_on_disk == entries_committed

§3.5 safe_seq after recovery

On reopened WAL, w.seq.Store(maxSeq) ensures next Append continues from maxSeq+1

OpenWAL (wal.go:265): w.seq.Store(maxSeq)maxSeq is the post-truncation committed seq

TestAppend_SafeSeqPersistsAcrossReopen, TestCrashHarness_Randomized (each iter reopens after phantom injection)

Crash harness per-iteration assertion

§3.6 Directory Fsync Policy

Clause

Requirement

Enforcing Code

Tests

CI / Evidence

§3.6 dir fsync after rename

Parent directory MUST be fsynced after safe_seq rename to guarantee rename durability

fsyncDir (wal.go:417–427): opens dir, calls d.Sync(), closes; called from writeSafeSeq (wal.go:465–468)

TestSafeSeq_DiskEncoding_LittleEndian (reads back value, proves fsync completed); TestCrash_SafeSeqTmpLeftover_OpenWAL_Succeeds (proves dir fsync doesn’t interfere with tmp)

S9 stop condition closed; fsyncDir called in every writeSafeSeq invocation

§3.6 NOT after WAL append

No dir fsync after w.file.Write — only w.file.Sync() (file fsync, not dir)

Append (wal.go:300): only w.file.Sync(); no fsyncDir call in append path

Implicit: append path does not call fsyncDir

Correct by code inspection; no superfluous dir fsyncs measured

§3.5 / §3.4 ReadFrom Correctness

Clause

Requirement

Enforcing Code

Tests

CI / Evidence

R5 (no phantoms)

After OpenWAL truncation, ReadFrom never returns entries with Seq > safe_seq

Truncation (wal.go:240–244) removes phantom bytes before ReadFrom is ever called

TestOpenWAL_TruncatesToSafePoint_NotStructuralBoundary (verifies file size = bytesAtSafe), TestCrash_BetweenWALFsyncAndSafeSeqWrite

Crash harness: every iteration calls ReadFrom(0) and asserts len == entries_committed

§3.5 corrupt JSON fail-hard

Structurally valid frame with unparseable JSON in committed range → WAL_CORRUPT_INVALID_JSON error (never silently skipped)

ReadFrom (wal.go:346–357): json.Unmarshal failure → return nil, fmt.Errorf("... WAL_CORRUPT_INVALID_JSON ...")

TestReadFrom_MalformedJSON_FailHard

S5 stop condition closed


2. Phase 1 Stop Conditions Verification

ID

Condition

Status

Test

Code Location

S1

safe_seq written after every Append

✅ CLOSED

TestAppend_SafePointAdvancesPerEntry

Append (wal.go:306): w.writeSafeSeq(e.Seq)

S2

Recovery truncates to safe-point, not structural boundary

✅ CLOSED

TestOpenWAL_TruncatesToSafePoint_NotStructuralBoundary

OpenWAL (wal.go:228–231): truncateTo = bytesAtSafe

S3

No phantom entries survive recovery

✅ CLOSED

TestCrash_BetweenWALFsyncAndSafeSeqWrite, TestCrashHarness_Randomized (20 iterations × seeded)

Truncation before open-for-append

S4

Corrupt safe_seq → fail-hard

✅ CLOSED

TestReadSafeSeq_Corrupt_WrongLength

readSafeSeq + decodeSafeSeq

S5

Malformed JSON in valid frame → ReadFrom error

✅ CLOSED

TestReadFrom_MalformedJSON_FailHard

ReadFrom (wal.go:346–357)

S6

Seq gap in committed range → OpenWAL error

✅ CLOSED

TestOpenWAL_SeqGap_FailHard, TestScanFramesWithSafe_SeqGap

scanFramesWithSafe R2

S7

AUTONOMYOPS_WAL_LEGACY_UPGRADE=1 → WARN + ok

✅ CLOSED

TestOpenWAL_LegacyUpgrade_SafeSeqMissing_OK

OpenWAL (wal.go:172–178)

S8

WAL fsync BEFORE safe_seq rename

✅ CLOSED*

TestCrash_BetweenWALFsyncAndSafeSeqWrite (indirect)

Append sequential code (wal.go:300, then wal.go:306)

S9

Dir fsync after safe_seq rename

✅ CLOSED

TestSafeSeq_DiskEncoding_LittleEndian (indirect), code inspection

writeSafeSeq (wal.go:465–468)

S10

All pre-existing wal_test.go tests pass unmodified

✅ CLOSED

82/82 tests pass

Entire test suite

S11

telemetry.pos unchanged (consumer marker only)

✅ CLOSED

Implicit; otherStateExists excludes posFileName

wal.go:163, wal.go:366–382

S12

Operator reset: WAL_RESET_BY_OPERATOR emitted + Seq=1

✅ CLOSED

TestOperatorReset_EmitsMarkerAndStartsClean

OpenWAL (wal.go:192–197)

S8 has an indirect test only — see WS-1 below.


3. Weak Spots & Gaps (Status)

Ordered by severity: P1 = blocking for release; P2 = should fix before Phase 2 launch; P3 = documentation or minor.

WS-1 (P2) — S8 ordering is not directly asserted at the unit level (CLOSED)

What: S8 requires WAL fsync to complete before writeSafeSeq is called. The code is correct (wal.go:300 before wal.go:306) but the only test covering this is indirect: TestCrash_BetweenWALFsyncAndSafeSeqWrite proves that a crash at the S3 window leaves the right recovery state; it does NOT assert that w.file.Sync() returns before writeSafeSeq starts.

Risk: A future refactor that reorders or batches these calls (e.g., goroutine-based safe-point batching in Phase 2) could silently violate the MUST ordering.

Resolution: Added test-only hooks in WAL (onWalSynced, onSafeSeqAdvanced) and direct ordering test TestAppend_Ordering_WALSyncBeforeSafeSeq asserting exact order ["sync","safe"].


WS-2 (P2) — scanFramesWithSafe misidentifies root cause when corrupt JSON appears before safe_seq (CLOSED)

What: If a frame with invalid JSON payload appears within the committed range (Seq ≤ safe_seq), scanFramesWithSafe treats it as errTruncatedTail and breaks the scan loop (wal.go:600–604). After the loop: maxSeq < safeSeq, triggering SAFESEQ_GT_MAXSEQ; but the actual fault is a corrupt frame in the committed range.

Trace:

WAL: [Seq=1 OK][Seq=2 BAD_JSON][Seq=3][safe_seq=3]
scan: reads Seq=1 → ok. reads Seq=2 → bad JSON → err=errTruncatedTail, break
post-scan: safeSeq=3, maxSeq=1, safeSeq > maxSeq → SAFESEQ_GT_MAXSEQ

The error message says "safe_seq=3 > maxSeq=1" which obscures the true cause (corrupt frame 2).

Risk: Operator confusion during incident response. The operator might incorrectly diagnose a safe_seq file problem when the actual issue is storage corruption.

Resolution: Scanner now fail-hards immediately on malformed JSON in a structurally complete frame with cause WAL_CORRUPT_INVALID_JSON, before SAFESEQ_GT_MAXSEQ post-checks. Added TestScanFramesWithSafe_CorruptJSON_FailHardSpecificCause.


WS-3 (P3) — telemetry.safe_seq.tmp leftover never proactively cleaned (CLOSED)

What: If writeSafeSeq crashes between os.OpenFile(tmp) and os.Rename(tmp, target), the .tmp file accumulates on disk across reboots. TestCrash_SafeSeqTmpLeftover_OpenWAL_Succeeds proves this is safe (the stale .tmp does not corrupt recovery). However, the .tmp file is never cleaned up.

Risk: Minor: disk clutter in long-running deployments; potential confusion for operators checking the WAL directory.

Resolution: OpenWAL now best-effort removes safe_seq.tmp only when canonical safe_seq exists. Added tests:

  • TestOpenWAL_RemovesTmpWhenCanonicalExists

  • TestOpenWAL_LeavesTmpWhenCanonicalMissing


WS-4 (P3) — otherStateExists extension protocol not documented; implicit closed list (CLOSED)

What: otherStateExists checks exactly 3 paths: lock.fingerprint, activation/, runtime.db. If a future component adds a new significant runtime state file (e.g., policy.db, activation.lock), the function must be manually updated or the steady-state guarantee weakens (first-run detection becomes incorrect for that state).

Risk: Silent regression: new state file added; WAL mistakenly opened in first-run mode on a recovery; safe_seq missing not detected as steady-state.

Resolution: Added extension-protocol guidance directly above otherStateExists in wal.go, explicitly requiring new persisted runtime state checks to be registered there and reiterating that telemetry.pos is excluded.


4. Non-Blocking Observations (Informational)

These are not weak spots but are worth noting for review context.

OBS-1: safe_seq=0 is a valid committed value. readSafeSeq explicitly documents this (wal.go:482: “Zero (seq=0) is a valid committed safe-point”). The distinction between “found=false” (missing file) and “found=true, seq=0” is correctly preserved. Tests: TestReadSafeSeq_ZeroSeq_IsValid.

OBS-2: R3 guard has an intentional asymmetry. wal.go:649: if safeSeq > 0 && maxSeq > 0 && safeSeq > maxSeq. When maxSeq == 0 (empty WAL) and safeSeq > 0, the R3 check is skipped; R4 (bytesAtSafe == 0) fires instead. This is correct: an empty WAL with a non-zero safe_seq is best reported as SAFESEQ_NOT_FOUND (the frame doesn’t exist), not SAFESEQ_GT_MAXSEQ (a reorder bug). Documented in TestScanFramesWithSafe_SafeSeqNotFound.

OBS-3: Append increments seq before write. If Write or Sync fails, seq was already incremented but no frame was written. The next successful Append would produce a seq gap. However, OpenWAL resets w.seq.Store(maxSeq) on reopen, so the gap only persists within a single crashed WAL instance. Standard practice is to close and reopen after any failed Append. Not a new concern; pre-existing behavior.

OBS-4: legacyUpgradeEnvVar performs no gap check on the legacy WAL. When AUTONOMYOPS_WAL_LEGACY_UPGRADE=1, all frames up to validBytes are treated as committed with safePoint=0; gap checking is not performed on pre-Phase-1 WAL content. A genuinely gapped legacy WAL would be accepted. This is acceptable for Phase 1 (the env var is a one-shot migration escape hatch), but Phase 2 should consider adding a WARNING-level gap scan on the legacy path.

OBS-5: WAL rotation not implemented. maxSize = 64 MiB is tracked but no rotation logic exists. If exceeded, writes continue. This is documented as deferred to Phase 2. No compliance impact for Phase 1.


5. Proposed Follow-Up Actions (Minimal, PR-ready)

ID

Priority

Type

Action

Estimated Size

FU-1

P2

Test

Direct Append ordering assertion (sync then safe)

CLOSED

FU-2

P2

Code

Specific scanner cause for malformed JSON (WAL_CORRUPT_INVALID_JSON)

CLOSED

FU-3

P3

Code

OpenWAL conditional stale safe_seq.tmp cleanup

CLOSED

FU-4

P3

Docs

Extension protocol note above otherStateExists

CLOSED

FU-1 through FU-4 are implemented; no remaining weak-spot follow-ups are required for Phase 1 closure.


6. CI Coverage Summary

Coverage Dimension

Mechanism

Gap?

Unit correctness

go test ./... in ci.ymlgo-test job

None

Race detector

go test -race in ci.ymlgo-race job

None

Randomized crash recovery

TestCrashHarness_Randomized (20 iter, seeded) via portability-crash-harness

None

Cross-arch (amd64, arm64)

portability-nightly.yml jobs

riscv64 is QEMU / informational only

Cross-FS (ext4, xfs)

portability-nightly.yml loop device job

xfs requires root; informational on non-root CI

Evidence artifact

evidence/<arch>/<fs>/run-N/wal_verify.json

Correct; schema documents all invariants

Release gate

portability-nightly.ymlrelease-gate job aggregates wal_verify.json pass fields

Only gates on tag push; nightly is advisory


This document was generated from static analysis of telemetry/wal.go and telemetry/*_test.go. Last verified against commit state 2026-03-04.