Compliance Audit — Workplan v1.6 §3 + Phase 1 Stop Conditions¶

Date: 2026-03-04 Auditor: automated (Claude Code) Scope: telemetry/wal.go, all telemetry/*_test.go files, scripts/portability/, CI workflows Verdict: COMPLIANT (Phase 1 stop conditions S1–S12 closed; prior WS-1..WS-4 addressed) Closure provenance: WS-1..WS-4 were implemented in PR #37 (telemetry(portability): complete Phase 1 WAL contract and close WS-1 - WS-4).

1. Normative Clause Matrix¶

Each row maps one normative requirement from v1.6 §3 to the production code that enforces it, the tests that prove it, and the CI / evidence mechanism that captures it at scale.

§3.3 Frame Format & Structural Integrity¶

Clause	Requirement	Enforcing Code	Tests	CI / Evidence
§3.3 frame format	`[4B big-endian uint32 len][JSON payload]`; no newline delimiter	`wal.go:289–292` (write), `wal.go:333–337` (read), `wal.go:587` (scan)	All `ReadFrom_`, `TestScanFramesWithSafe_` (structural)	Every `go test ./...` run; `wal_verify.py` R1–R4 checks each frame
§3.3 structural tail	Partial frame at EOF is a crash artifact; silently repaired by truncation	`scanFramesWithSafe` → `errTruncatedTail` (`wal.go:581–604`); `OpenWAL` truncation (`wal.go:233–244`)	`TestScanFramesWithSafe_TruncatedTail_Header`, `TestScanFramesWithSafe_TruncatedTail_Payload`, `TestReadFrom_PartialTailFrame_StopsCleanly`	`portability-matrix` step 2 (crash harness injects phantom then reopens)
§3.3 R1 (v1.x)	First frame `Seq` MUST equal 1	`scanFramesWithSafe` R1 check (`wal.go:607–614`); causes `FIRST_SEQ_NOT_ONE`	`TestScanFramesWithSafe_FirstSeqNotOne`, `TestOpenWAL_FirstSeqNotOne_FailHard`	`recovery_scan_test.go` runs in `test-go` CI; harness `TestKnownGoodWALFixture` verifies R1 indirectly
§3.3 R2	`Seq` values strictly contiguous in committed range; gaps beyond safe-point are phantoms (not errors)	`scanFramesWithSafe` R2 (`wal.go:619–629`); phantom-range early-exit (`wal.go:622–625`)	`TestScanFramesWithSafe_SeqGap`, `TestOpenWAL_SeqGap_FailHard`, `TestScanFramesWithSafe_SeqGap_BeyondSafePoint`	All 20 crash-harness iterations verify monotonic recovery
§3.3 R3	`safe_seq` ≤ `maxSeq`	`scanFramesWithSafe` R3 post-scan check (`wal.go:649–652`); causes `SAFESEQ_GT_MAXSEQ`	`TestScanFramesWithSafe_SafeSeqGtMaxSeq`, `TestOpenWAL_SafeSeqGtMaxSeq_FailHard`, `TestReorderBug_WouldBeDetected`	`portability-matrix`; `wal_verify.json` R3_SAFESEQ_LE_MAXSEQ field
§3.3 R4	Frame with `Seq == safe_seq` MUST exist when `safe_seq > 0`	`scanFramesWithSafe` R4 check (`wal.go:655–658`); causes `SAFESEQ_NOT_FOUND`	`TestScanFramesWithSafe_SafeSeqNotFound`	`wal_verify.json` R4_SAFESEQ_FRAME_EXISTS field
§3.3 fail-hard log	Invariant violations logged as `WAL_RECOVERY_FAIL_HARD` with structured `cause` and `dir`	`OpenWAL` (`wal.go:152–158`): `slog.Error("WAL_RECOVERY_FAIL_HARD", "cause", re.cause, ...)`	All `*_FailHard` tests assert error type/cause; log output is to structured slog	`autonomy.log` in evidence dir captures slog output

§3.4 Steady-State Detection & Policy¶

Clause	Requirement	Enforcing Code	Tests	CI / Evidence
§3.4 steady-state detection	`steady = validBytes > 0 \|\| otherStateExists(dir)`	`otherStateExists` (`wal.go:511–522`): checks `lock.fingerprint`, `activation/`, `runtime.db`	`TestOpenWAL_SteadyState_WALHasFrames_SafeSeqMissing_FailHard`, `TestOpenWAL_SteadyState_OtherState_SafeSeqMissing_FailHard`	`openwal_steady_test.go` runs in every `test-go` CI job
§3.4 `telemetry.pos` excluded	`telemetry.pos` MUST NOT be used for steady-state or recovery	`wal.go:163`: explicit comment; `otherStateExists` does NOT check `posFileName`	Implicit: `TestOpenWAL_FirstRun_SafeSeqMissing_OK` passes with only `.pos` present if steady=false	S11 stop condition checked; verified in operator reset tests
§3.4 first-run	No safe_seq + no other state → OK (no error, no warning)	`OpenWAL` switch `case !steady:` (`wal.go:168–170`): silent pass	`TestOpenWAL_FirstRun_SafeSeqMissing_OK`	First `make portability-matrix` run starts from a fresh dir
§3.4 steady-state fail-hard	Steady + no safe_seq + no override → `fmt.Errorf(...)`	`OpenWAL` default case (`wal.go:199–208`)	`TestOpenWAL_SteadyState_WALHasFrames_SafeSeqMissing_FailHard`, `TestOpenWAL_SteadyState_OtherState_SafeSeqMissing_FailHard`	`test-go` CI job
§3.4 legacy upgrade	`AUTONOMYOPS_WAL_LEGACY_UPGRADE=1` + steady → WARN + OK	`OpenWAL` (`wal.go:172–178`): `slog.Warn(...)` with hint to unset	`TestOpenWAL_LegacyUpgrade_SafeSeqMissing_OK` (`openwal_steady_test.go`), `TestOpenWAL_SafeSeqZero_LegacyUpgrade_TruncatesStructuralTail`	S7 stop condition closed
§3.4.1 operator reset	`AUTONOMYOPS_WAL_OPERATOR_RESET=1` + empty WAL + no safe_seq → WARN `WAL_RESET_BY_OPERATOR` + OK	`OpenWAL` (`wal.go:180–197`): `slog.Warn("WAL_RESET_BY_OPERATOR", ...)` with `activation_invariants` note	`TestOperatorReset_EmitsMarkerAndStartsClean` (asserts slog msg via `captureSlogHandler`), `TestOperatorReset_WithoutEnvVar_FailsHard`, `TestOperatorReset_WalHasFrames_IsNotReset`	S12 stop condition closed; slog capture is tested

§3.5 Option B — Safe-Point Mechanism¶

Clause	Requirement	Enforcing Code	Tests	CI / Evidence
§3.5 encoding	`safe_seq` = 8-byte little-endian uint64 (NOT big-endian)	`encodeSafeSeq` (`wal.go:398–402`): `binary.LittleEndian.PutUint64`; `decodeSafeSeq` (`wal.go:407–412`)	`TestEncodeSafeSeq_RoundTrip`, `TestDecodeSafeSeq_WrongLength_Errors`, `TestSafeSeq_DiskEncoding_LittleEndian` (explicitly verifies bytes[0]=seq, bytes[7]=0)	`wal_verify.py` reads with `struct.unpack("<Q")` (LE) matching the Go encoding
§3.5 corrupt safe_seq	Wrong-length file → fail-hard (never silently tolerated)	`readSafeSeq` (`wal.go:483–498`): `decodeSafeSeq` returns error on `len != 8`; surfaced as `fmt.Errorf(...)` from `OpenWAL:139`	`TestReadSafeSeq_Corrupt_WrongLength` (`safe_seq_test.go`), `TestReadSafeSeq_Missing_ReturnsNotFound`	S4 stop condition closed
§3.5 temp-then-rename atomicity	`writeSafeSeq` writes `safe_seq.tmp`, fsyncs it, then renames to `safe_seq`	`writeSafeSeq` (`wal.go:439–470`): steps 1–4 explicit; temp and target MUST be in same dir	`TestWriteSafeSeq_AtomicRename_OldValuePersistedIfTmpExists` (`safe_seq_test.go`)	Core matrix step 4 (atomic rename check); `TestCrash_SafeSeqTmpLeftover_OpenWAL_Succeeds`
§3.5 MUST ordering step 1	WAL `fsync` MUST complete before `writeSafeSeq` is called	`Append` (`wal.go:300–306`): `w.file.Sync()` then `w.writeSafeSeq(e.Seq)` in strict sequence	`TestCrash_BetweenWALFsyncAndSafeSeqWrite` (S3 closure: phantom discarded on reopen)	S8 stop condition closed (see WS-1 below re: ordering coverage)
§3.5 MUST ordering step 2	tmp write → tmp fsync	`writeSafeSeq` (`wal.go:452–455`): `tmp.Sync()` before `Close()`	`TestWriteSafeSeq_AtomicRename_OldValuePersistedIfTmpExists` (proves write+fsync completed)	Implicit in every crash harness iteration
§3.5 MUST ordering step 3	tmp fsync → rename	`writeSafeSeq` (`wal.go:461–464`): `os.Rename(tmpPath, target)` after `tmp.Close()`	Atomic rename test in core matrix	`core_matrix.sh` step 4
§3.5 monotonicity	`safe_seq` only ever increases; never written with lower value	`writeSafeSeq` called with `e.Seq` which is monotonically incremented by `w.seq.Add(1)`	`TestSafeSeq_Monotonic_AcrossAppends`, `TestSafeSeq_NeverExceedsMaxSeq_AfterPhantomFrame`	Crash harness invariant: `safe_seq_on_disk == entries_committed`
§3.5 safe_seq after recovery	On reopened WAL, `w.seq.Store(maxSeq)` ensures next Append continues from `maxSeq+1`	`OpenWAL` (`wal.go:265`): `w.seq.Store(maxSeq)` — `maxSeq` is the post-truncation committed seq	`TestAppend_SafeSeqPersistsAcrossReopen`, `TestCrashHarness_Randomized` (each iter reopens after phantom injection)	Crash harness per-iteration assertion

§3.6 Directory Fsync Policy¶

Clause	Requirement	Enforcing Code	Tests	CI / Evidence
§3.6 dir fsync after rename	Parent directory MUST be fsynced after `safe_seq` rename to guarantee rename durability	`fsyncDir` (`wal.go:417–427`): opens dir, calls `d.Sync()`, closes; called from `writeSafeSeq` (`wal.go:465–468`)	`TestSafeSeq_DiskEncoding_LittleEndian` (reads back value, proves fsync completed); `TestCrash_SafeSeqTmpLeftover_OpenWAL_Succeeds` (proves dir fsync doesn’t interfere with tmp)	S9 stop condition closed; `fsyncDir` called in every `writeSafeSeq` invocation
§3.6 NOT after WAL append	No dir fsync after `w.file.Write` — only `w.file.Sync()` (file fsync, not dir)	`Append` (`wal.go:300`): only `w.file.Sync()`; no `fsyncDir` call in append path	Implicit: append path does not call `fsyncDir`	Correct by code inspection; no superfluous dir fsyncs measured

§3.5 / §3.4 ReadFrom Correctness¶

Clause	Requirement	Enforcing Code	Tests	CI / Evidence
R5 (no phantoms)	After `OpenWAL` truncation, `ReadFrom` never returns entries with `Seq > safe_seq`	Truncation (`wal.go:240–244`) removes phantom bytes before `ReadFrom` is ever called	`TestOpenWAL_TruncatesToSafePoint_NotStructuralBoundary` (verifies file size = bytesAtSafe), `TestCrash_BetweenWALFsyncAndSafeSeqWrite`	Crash harness: every iteration calls `ReadFrom(0)` and asserts `len == entries_committed`
§3.5 corrupt JSON fail-hard	Structurally valid frame with unparseable JSON in committed range → `WAL_CORRUPT_INVALID_JSON` error (never silently skipped)	`ReadFrom` (`wal.go:346–357`): `json.Unmarshal` failure → `return nil, fmt.Errorf("... WAL_CORRUPT_INVALID_JSON ...")`	`TestReadFrom_MalformedJSON_FailHard`	S5 stop condition closed

2. Phase 1 Stop Conditions Verification¶

ID	Condition	Status	Test	Code Location
S1	`safe_seq` written after every `Append`	✅ CLOSED	`TestAppend_SafePointAdvancesPerEntry`	`Append` (`wal.go:306`): `w.writeSafeSeq(e.Seq)`
S2	Recovery truncates to safe-point, not structural boundary	✅ CLOSED	`TestOpenWAL_TruncatesToSafePoint_NotStructuralBoundary`	`OpenWAL` (`wal.go:228–231`): `truncateTo = bytesAtSafe`
S3	No phantom entries survive recovery	✅ CLOSED	`TestCrash_BetweenWALFsyncAndSafeSeqWrite`, `TestCrashHarness_Randomized` (20 iterations × seeded)	Truncation before open-for-append
S4	Corrupt `safe_seq` → fail-hard	✅ CLOSED	`TestReadSafeSeq_Corrupt_WrongLength`	`readSafeSeq` + `decodeSafeSeq`
S5	Malformed JSON in valid frame → `ReadFrom` error	✅ CLOSED	`TestReadFrom_MalformedJSON_FailHard`	`ReadFrom` (`wal.go:346–357`)
S6	Seq gap in committed range → `OpenWAL` error	✅ CLOSED	`TestOpenWAL_SeqGap_FailHard`, `TestScanFramesWithSafe_SeqGap`	`scanFramesWithSafe` R2
S7	`AUTONOMYOPS_WAL_LEGACY_UPGRADE=1` → WARN + ok	✅ CLOSED	`TestOpenWAL_LegacyUpgrade_SafeSeqMissing_OK`	`OpenWAL` (`wal.go:172–178`)
S8	WAL `fsync` BEFORE `safe_seq` rename	✅ CLOSED*	`TestCrash_BetweenWALFsyncAndSafeSeqWrite` (indirect)	`Append` sequential code (`wal.go:300`, then `wal.go:306`)
S9	Dir fsync after `safe_seq` rename	✅ CLOSED	`TestSafeSeq_DiskEncoding_LittleEndian` (indirect), code inspection	`writeSafeSeq` (`wal.go:465–468`)
S10	All pre-existing `wal_test.go` tests pass unmodified	✅ CLOSED	82/82 tests pass	Entire test suite
S11	`telemetry.pos` unchanged (consumer marker only)	✅ CLOSED	Implicit; `otherStateExists` excludes `posFileName`	`wal.go:163`, `wal.go:366–382`
S12	Operator reset: `WAL_RESET_BY_OPERATOR` emitted + Seq=1	✅ CLOSED	`TestOperatorReset_EmitsMarkerAndStartsClean`	`OpenWAL` (`wal.go:192–197`)

S8 has an indirect test only — see WS-1 below.

3. Weak Spots & Gaps (Status)¶

Ordered by severity: P1 = blocking for release; P2 = should fix before Phase 2 launch; P3 = documentation or minor.

WS-1 (P2) — S8 ordering is not directly asserted at the unit level (CLOSED)¶

What: S8 requires WAL fsync to complete before writeSafeSeq is called. The code is correct (wal.go:300 before wal.go:306) but the only test covering this is indirect: TestCrash_BetweenWALFsyncAndSafeSeqWrite proves that a crash at the S3 window leaves the right recovery state; it does NOT assert that w.file.Sync() returns before writeSafeSeq starts.

Risk: A future refactor that reorders or batches these calls (e.g., goroutine-based safe-point batching in Phase 2) could silently violate the MUST ordering.

Resolution: Added test-only hooks in WAL (onWalSynced, onSafeSeqAdvanced) and direct ordering test TestAppend_Ordering_WALSyncBeforeSafeSeq asserting exact order ["sync","safe"].

WS-2 (P2) — `scanFramesWithSafe` misidentifies root cause when corrupt JSON appears before `safe_seq` (CLOSED)¶

What: If a frame with invalid JSON payload appears within the committed range (Seq ≤ safe_seq), scanFramesWithSafe treats it as errTruncatedTail and breaks the scan loop (wal.go:600–604). After the loop: maxSeq < safeSeq, triggering SAFESEQ_GT_MAXSEQ; but the actual fault is a corrupt frame in the committed range.

Trace:

WAL: [Seq=1 OK][Seq=2 BAD_JSON][Seq=3][safe_seq=3]
scan: reads Seq=1 → ok. reads Seq=2 → bad JSON → err=errTruncatedTail, break
post-scan: safeSeq=3, maxSeq=1, safeSeq > maxSeq → SAFESEQ_GT_MAXSEQ

The error message says "safe_seq=3 > maxSeq=1" which obscures the true cause (corrupt frame 2).

Risk: Operator confusion during incident response. The operator might incorrectly diagnose a safe_seq file problem when the actual issue is storage corruption.

Resolution: Scanner now fail-hards immediately on malformed JSON in a structurally complete frame with cause WAL_CORRUPT_INVALID_JSON, before SAFESEQ_GT_MAXSEQ post-checks. Added TestScanFramesWithSafe_CorruptJSON_FailHardSpecificCause.

WS-3 (P3) — `telemetry.safe_seq.tmp` leftover never proactively cleaned (CLOSED)¶

What: If writeSafeSeq crashes between os.OpenFile(tmp) and os.Rename(tmp, target), the .tmp file accumulates on disk across reboots. TestCrash_SafeSeqTmpLeftover_OpenWAL_Succeeds proves this is safe (the stale .tmp does not corrupt recovery). However, the .tmp file is never cleaned up.

Risk: Minor: disk clutter in long-running deployments; potential confusion for operators checking the WAL directory.

Resolution: OpenWAL now best-effort removes safe_seq.tmp only when canonical safe_seq exists. Added tests:

TestOpenWAL_RemovesTmpWhenCanonicalExists
TestOpenWAL_LeavesTmpWhenCanonicalMissing

WS-4 (P3) — `otherStateExists` extension protocol not documented; implicit closed list (CLOSED)¶

What: otherStateExists checks exactly 3 paths: lock.fingerprint, activation/, runtime.db. If a future component adds a new significant runtime state file (e.g., policy.db, activation.lock), the function must be manually updated or the steady-state guarantee weakens (first-run detection becomes incorrect for that state).

Risk: Silent regression: new state file added; WAL mistakenly opened in first-run mode on a recovery; safe_seq missing not detected as steady-state.

Resolution: Added extension-protocol guidance directly above otherStateExists in wal.go, explicitly requiring new persisted runtime state checks to be registered there and reiterating that telemetry.pos is excluded.

4. Non-Blocking Observations (Informational)¶

These are not weak spots but are worth noting for review context.

OBS-1: safe_seq=0 is a valid committed value. readSafeSeq explicitly documents this (wal.go:482: “Zero (seq=0) is a valid committed safe-point”). The distinction between “found=false” (missing file) and “found=true, seq=0” is correctly preserved. Tests: TestReadSafeSeq_ZeroSeq_IsValid.

OBS-2: R3 guard has an intentional asymmetry. wal.go:649: if safeSeq > 0 && maxSeq > 0 && safeSeq > maxSeq. When maxSeq == 0 (empty WAL) and safeSeq > 0, the R3 check is skipped; R4 (bytesAtSafe == 0) fires instead. This is correct: an empty WAL with a non-zero safe_seq is best reported as SAFESEQ_NOT_FOUND (the frame doesn’t exist), not SAFESEQ_GT_MAXSEQ (a reorder bug). Documented in TestScanFramesWithSafe_SafeSeqNotFound.

OBS-3: Append increments seq before write. If Write or Sync fails, seq was already incremented but no frame was written. The next successful Append would produce a seq gap. However, OpenWAL resets w.seq.Store(maxSeq) on reopen, so the gap only persists within a single crashed WAL instance. Standard practice is to close and reopen after any failed Append. Not a new concern; pre-existing behavior.

OBS-4: legacyUpgradeEnvVar performs no gap check on the legacy WAL. When AUTONOMYOPS_WAL_LEGACY_UPGRADE=1, all frames up to validBytes are treated as committed with safePoint=0; gap checking is not performed on pre-Phase-1 WAL content. A genuinely gapped legacy WAL would be accepted. This is acceptable for Phase 1 (the env var is a one-shot migration escape hatch), but Phase 2 should consider adding a WARNING-level gap scan on the legacy path.

OBS-5: WAL rotation not implemented. maxSize = 64 MiB is tracked but no rotation logic exists. If exceeded, writes continue. This is documented as deferred to Phase 2. No compliance impact for Phase 1.

5. Proposed Follow-Up Actions (Minimal, PR-ready)¶

ID	Priority	Type	Action	Estimated Size
FU-1	P2	Test	Direct Append ordering assertion (`sync` then `safe`)	CLOSED
FU-2	P2	Code	Specific scanner cause for malformed JSON (`WAL_CORRUPT_INVALID_JSON`)	CLOSED
FU-3	P3	Code	`OpenWAL` conditional stale `safe_seq.tmp` cleanup	CLOSED
FU-4	P3	Docs	Extension protocol note above `otherStateExists`	CLOSED

FU-1 through FU-4 are implemented; no remaining weak-spot follow-ups are required for Phase 1 closure.

6. CI Coverage Summary¶

Coverage Dimension	Mechanism	Gap?
Unit correctness	`go test ./...` in `ci.yml` → `go-test` job	None
Race detector	`go test -race` in `ci.yml` → `go-race` job	None
Randomized crash recovery	`TestCrashHarness_Randomized` (20 iter, seeded) via `portability-crash-harness`	None
Cross-arch (amd64, arm64)	`portability-nightly.yml` jobs	riscv64 is QEMU / informational only
Cross-FS (ext4, xfs)	`portability-nightly.yml` loop device job	xfs requires root; informational on non-root CI
Evidence artifact	`evidence/<arch>/<fs>/run-N/wal_verify.json`	Correct; schema documents all invariants
Release gate	`portability-nightly.yml` → `release-gate` job aggregates `wal_verify.json` `pass` fields	Only gates on tag push; nightly is advisory

This document was generated from static analysis of telemetry/wal.go and telemetry/*_test.go. Last verified against commit state 2026-03-04.