Edge Relay Deadletter Lab

Audience: operators and reviewers who want a reproducible local lab for the PR-14, PR-15, PR-16, and PR-23 relay operator surfaces in edged and edgectl.

This lab proves two operator workflows against a live daemon:

  1. edgectl relay deadletter list

  2. edgectl relay deadletter inspect <segment-id> <peer-id>

  3. edgectl relay deadletter retry <segment-id> <peer-id>

  4. edgectl relay deadletter purge --segment-id ... --peer-id ... [--force]

  5. PR-16 bandwidth enforcement under sustained relay load

  6. PR-16 daily quota reset evidence using the deterministic limiter clock path

  7. edgectl relay status [--output text|json]

It mirrors the HA evidence style:

  • one reusable lab entry point

  • one dated evidence/ bundle from a real run

  • raw captured command outputs stored in the repo

1. What the Lab Does

The lab runner:

  • builds edged and edgectl

  • generates a deterministic edge.toml and short-lived lab TLS material

  • seeds the relay BoltDB ledger with deadlettered entries and attempt history

  • starts edged with a real Unix control socket

  • runs edgectl relay deadletter list

  • runs edgectl relay deadletter inspect seg-dead-1 peer-a

  • runs edgectl relay status and captures both text and JSON output

  • runs edgectl relay deadletter retry seg-dead-1 peer-a

  • captures relay status after retry and purge to prove queue-depth changes

  • captures post-retry ledger state from the real relay BoltDB

  • runs edgectl relay deadletter purge in dry-run and --force modes

  • captures post-purge ledger state from the real relay BoltDB

  • optionally seeds bandwidth-limited relay work to a real mTLS peer

  • captures live relay throttling evidence and post-run ledger state

  • captures relay status with live bandwidth status populated

  • captures the deterministic 24-hour quota reset test output

  • writes all outputs into an evidence directory

Seeded deadletter entries and fixtures:

  • seg-dead-1 / peer-a with 3 failed attempts

  • seg-dead-2 / peer-b with 1 failed attempt

  • the retry target segment is persisted in the local store so the retried entry is actually executable

  • the retry peer is configured with a loopback address that refuses connections, making retry pickup deterministic

Inspect target:

  • seg-dead-1 / peer-a

2. Run the Lab

From the repository root:

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_edge_deadletter_lab.sh

Bandwidth verification mode:

bash scripts/labs/run_edge_deadletter_lab.sh \
  "$PWD/evidence/pr16-edge-bandwidth-local-$(date +%F)" \
  --with-bandwidth

Optional custom evidence directory:

bash scripts/labs/run_edge_deadletter_lab.sh \
  "$PWD/evidence/pr14-edge-deadletter-local-$(date +%F)"

3. Generated Artifacts

The runner writes these files into the evidence directory:

  • edge.toml

  • seed-manifest.json

  • edgectl-status.txt

  • edged.log

  • relay-deadletter-list.txt

  • relay-deadletter-list-limit-1.txt

  • relay-deadletter-inspect.txt

  • relay-status-initial.txt

  • relay-status-initial.json

  • relay-deadletter-retry.txt

  • audit-relay-deadletter-retry.log

  • retry-ledger-state.json

  • relay-deadletter-list-after-retry.txt

  • relay-status-after-retry.txt

  • relay-status-after-retry.json

  • relay-deadletter-purge-dry-run.txt

  • relay-deadletter-purge-force.txt

  • audit-relay-deadletter-purge.log

  • purge-ledger-state.json

  • relay-deadletter-list-after-purge.txt

  • relay-status-after-purge.txt

  • relay-status-after-purge.json

Bandwidth mode adds:

  • bandwidth-peer.log

  • bandwidth-peer-received.json

  • relay-bandwidth-set.txt

  • audit-relay-bandwidth-set.log

  • relay-bandwidth-config.txt

  • relay-status-with-bandwidth.txt

  • relay-status-with-bandwidth.json

  • seg-bandwidth-01-ledger-state.json through seg-bandwidth-06-ledger-state.json

  • daily-quota-reset-test.txt

4. Expected Results

relay deadletter list

Expected shape:

segment=seg-dead-1 peer=peer-a attempts=3 first_queued=... last_updated=...
segment=seg-dead-2 peer=peer-b attempts=1 first_queued=... last_updated=...
showing 2 deadletter entries

Expected truncation proof:

segment=seg-dead-1 peer=peer-a attempts=3 first_queued=... last_updated=...
showing 1 of 2 deadletter entries

relay deadletter inspect

Expected shape:

Segment:      seg-dead-1
Peer:         peer-a
State:        deadletter
Attempts:     3
First Queued: ...
Last Updated: ...

Attempt History:
  #1  outcome=FAILED ...
  #2  outcome=FAILED ...
  #3  outcome=FAILED ...

relay deadletter retry

Expected evidence shape:

  • the retry command reports retried=true

  • audit-relay-deadletter-retry.log contains audit.event=relay.deadletter.retried

  • the retried entry leaves the deadletter list

  • retry-ledger-state.json shows the same pair still exists in the ledger with:

    • state: "failed" or state: "deadletter" depending on retry budget

    • a larger attempt_count than the pre-retry deadletter fixture

    • an additional trailing failed attempt showing the executor actually picked it up

relay status

Expected evidence shape:

  • relay-status-initial.txt shows relay enabled, configured worker count, success condition, and queue-depth totals that match the seeded ledger state

  • relay-status-after-retry.* shows the deadletter queue shrinking after the retried entry is picked up by the executor

  • relay-status-after-purge.* shows the purged entry removed from the live queue totals

  • relay-status-with-bandwidth.* includes non-nil bandwidth status with the configured rate/quota and non-zero throttle activity when bandwidth mode is used

relay deadletter purge

Expected evidence shape:

  • dry-run reports would purge 1 deadletter entries

  • force purge reports purged 1 deadletter entries

  • audit-relay-deadletter-purge.log contains audit.event=relay.deadletter.purged

  • purge-ledger-state.json shows exists: false, zero remaining attempts, and zero segment records for that pair

  • the final deadletter list no longer contains the purged entry

PR-16 bandwidth enforcement

Expected bandwidth-mode evidence shape:

  • relay-bandwidth-set.txt reports applied=true

  • audit-relay-bandwidth-set.log contains audit.event=relay.bandwidth.configured

  • relay-bandwidth-config.txt shows the configured limits and non-zero throttle_count

  • bandwidth-peer-received.json shows only the number of segments that fit within the configured live rate/quota window

  • later bandwidth-segment ledger dumps show state: "failed" with bandwidth_rate_limited in the trailing attempts

PR-16 daily quota reset after 24h

Expected evidence shape:

  • daily-quota-reset-test.txt captures TestBandwidthLimiter_DailyQuota_ResetsAfter24h

  • the output shows an initial quota-exceeded event followed by a passing test once the injected clock advances by 24 hours

5. Current Evidence Bundle

Reference local runs live in:

  • evidence/pr14-edge-deadletter-local-2026-03-17/README.md

  • evidence/pr15-edge-deadletter-management-local-2026-03-17/README.md

  • evidence/pr16-edge-bandwidth-local-2026-03-17/README.md

  • evidence/pr23-edge-relay-status-local-2026-03-18/README.md

6. Implementation Notes

  • Runner script: scripts/labs/run_edge_deadletter_lab.sh

  • Lab setup helper: scripts/labs/edge_deadletter_lab_setup.go

  • Bandwidth peer helper: scripts/labs/edge_deadletter_lab_peer.go

The helper still seeds the relay ledger directly for determinism, but it now also persists the backing segment payloads and configured peers. That lets the PR-15 retry proof exercise a real executor scan instead of stopping at the mutation RPC.

In bandwidth mode, the same lab also seeds a real relay batch to a local mTLS peer and captures the deterministic 24-hour reset proof from the package test.

7. Sandbox Caveat

This lab uses the edged Unix control socket. If your environment blocks local Unix sockets, the runner may fail with a socket permission error. In that case, rerun it outside the restricted sandbox.

8. VAL19 — Relay Local-Network Impairment Validation

VAL19 is not a mode of this runner. It validates relay store-and-forward correctness and performance characterisation under real transport impairment (outage, bandwidth constraint, latency, mid-transfer disconnect) using a custom TCP impairment proxy (scripts/labs/relay_impairment_proxy.go).

The impairment lab is driven by a separate standalone script:

bash scripts/labs/run_relay_impairment_val19_lab.sh [EVIDENCE_DIR]

Why separate from this runner:

  • Requires two new Go binaries (relay_impairment_proxy.go, edge_relay_impairment_setup.go) built at runtime

  • Routes edged through a proxy address (different edge.toml than this lab)

  • Runtime is 3–4 minutes due to outage cycles + bandwidth + latency scenarios

  • Reuses edge_deadletter_lab_peer.go and edge_deadletter_lab_dump.go as helpers (unchanged)

The VAL19 formal plan, 10-check matrix, proxy control API reference, and throughput characterisation method are documented in relay-impairment-validation.md.

9. VAL20 — Relay Throughput Benchmark

VAL20 is not a mode of this runner. It measures edge relay executor throughput (segments/sec and bytes/sec) across five workload tiers and demonstrates backpressure behaviour under a 1 Mbps bandwidth constraint.

The throughput benchmark is driven by a separate standalone script:

bash scripts/labs/run_relay_throughput_val20_lab.sh [EVIDENCE_DIR]

Why separate from this runner:

  • Requires a new setup binary (edge_relay_throughput_setup.go) that seeds segments to PENDING state (not deadletter) for direct executor pickup

  • Reuses relay_impairment_proxy.go from VAL19 for bandwidth control

  • Runs five distinct tiers (N=1/10/100 × 64 B; N=10 × 128 KB clean/constrained) with fresh edged instances per tier — too slow to embed as a phase here

  • Ports isolated to 19040–19043 to avoid conflicts with VAL19

The VAL20 formal plan, 10-check matrix, tier definitions, throughput characterisation method, and report template are documented in relay-throughput-validation.md.

10. VAL21 — Relay Queue Depth and Overflow Validation

VAL21 is not a mode of this runner. It validates relay queue correctness under depth pressure, LRU eviction-induced relay failure, and relay status accuracy.

The overflow lab is driven by a separate standalone script:

bash scripts/labs/run_relay_overflow_val21_lab.sh [EVIDENCE_DIR]

Three scenarios:

  1. S-A — Deep queue drain (N=200 × 64 B): confirms correctness and monotone queue depth drain at scale.

  2. S-B — LRU eviction interaction (N=10 × 64 KB, ceiling=5×64 KB): confirms that evicted segments fail relay gracefully (deadletter — no panic, no silent data loss) and that all 10 segments are accounted for (delivered + deadletter = 10).

  3. S-C — Relay status accuracy (N=10 × 64 B): confirms that edgectl relay status --output json queue_depth.scheduled equals the seeded count at t+0.4 s, and drops to 0 after all segments are acknowledged.

Why separate from this runner:

  • S-B requires a low disk ceiling + aggressive eviction threshold (--ceiling-bytes 327680 --eviction-threshold 0.50) that would corrupt concurrent operation with any other scenario

  • S-A (N=200) adds 3–5 minutes to any embedded runner

  • Ports isolated to 19044–19047 to avoid conflicts with VAL19/VAL20

The VAL21 formal plan, 10-check matrix, LRU eviction policy explanation, and relay status field reference are documented in relay-overflow-validation.md.

11. VAL22 — Deadletter Workflow Validation

VAL22 is not a mode of this runner. It re-validates the complete deadletter operator workflow with controlled failure injection and verified delivery outcomes — closing the gap in this runner where retry success is never confirmed against a live peer.

The deadletter workflow lab is driven by:

bash scripts/labs/run_relay_deadletter_val22_lab.sh [EVIDENCE_DIR]

What this runner already covers vs what VAL22 adds:

Feature

This runner

VAL22

relay deadletter list + inspect

✓ re-validated

relay deadletter retry command

✓ (command exits 0)

delivery confirmed

Retry success rate

✓ 4/8 = 50% (Group R: 100%, Group U: 0%)

Failure injection (outage proxy)

✓ Group U re-deadletters after outage

Retention across restart

✓ BoltDB persistence verified

relay deadletter purge

✓ re-validated

Why separate: this runner hardcodes peer-a at a refusing address — retry always fails. VAL22 requires a live peer server for delivery confirmation. Ports isolated to 19050–19053.

The VAL22 formal plan, 10-check matrix, workflow matrix, failure injection plan, and retry success rate definition are documented in deadletter-workflow-validation.md.

12. VAL23 — Relay Bandwidth Management Validation

VAL23 is not a mode of this runner. It re-validates the relay bandwidth controls with isolated scenarios that separate rate-limit enforcement from daily-quota enforcement — closing the gap in this runner’s --with-bandwidth mode where both are always applied together and counter accuracy is not asserted.

The bandwidth validation lab is driven by:

bash scripts/labs/run_relay_bandwidth_val23_lab.sh [EVIDENCE_DIR]

What this runner already covers vs what VAL23 adds:

Feature

This runner (–with-bandwidth)

VAL23

relay config set-bandwidth command

✓ re-validated

relay.bandwidth.configured audit event

✓ re-validated

Rate + quota combined throttle

✓ (always both)

— (S-B and S-C isolate each)

Rate-only throttle

✓ S-B: throttle_count >= 4, quota_drop_count = 0

Quota-only enforcement

✓ S-C: exactly 2/6 segments delivered, quota_drop_count >= 4

Hot-reload (set-bandwidth live)

✓ S-D: applied=true + delivery resumes

Unlimited baseline (status check)

✓ S-A: unlimited=true before any config

Daily quota reset

✓ unit test

✓ S-E: unit test re-run in isolation

Why separate: this runner’s --with-bandwidth mode always sets bytes_per_second and daily_quota together — it is impossible to determine from the evidence whether throttling is due to the rate limit or quota exhaustion. Ports isolated to 19054–19055 (no proxy needed).

The VAL23 formal plan, 10-check matrix, scenario matrix, token-bucket timing reference, and daily-quota drop path are documented in relay-bandwidth-validation.md.

13. VAL24 — Relay 30-Day Soak Validation

VAL24 is not a mode of this runner. It is a 30-day continuous relay soak framework that validates sustained message processing correctness, zero message loss, and deadletter retry recovery over 1,440 rounds.

The soak is driven by three scripts following the VAL12/VAL18 pattern:

bash scripts/labs/run_soak_val24_setup.sh [SOAK_DIR]    # once — installs cron
bash scripts/labs/run_soak_val24_round.sh  <SOAK_DIR>   # per-round (cron-driven)
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> [status|final|...]  # any time

Architecture difference from VAL12/VAL18: relay BoltDB has no live injection API — edged restarts once per round to allow direct BoltDB seeding via edge_relay_soak_val24_setup.go --mode seed. The proxy and peer server stay running for the full 30-day soak.

What this runner covers vs what VAL24 adds:

Feature

This runner

VAL24

Relay delivery (single run)

✓ sustained 30 days

Deadletter list/inspect

✓ per-round

Retry recovery

✓ (single run)

✓ ~30 outage+recovery cycles

BoltDB persistence

✓ (single restart)

✓ 1,440 restart cycles

Message-loss accounting

✓ seeded = delivered + deadletter + loss

Long-duration proof artifact

reports/final.json with Gate D

Gate D (all four must hold for final PASS):

  1. Total rounds ≥ 1,440

  2. Clean-round delivery rate ≥ 0.990

  3. Retry recovery rate ≥ 0.990 (retried deadletter entries that delivered)

  4. Message loss = 0 (no silent loss)

Ports isolated to 19060–19063.

The VAL24 formal plan, 10-check matrix, traffic schedule, failure injection plan, Gate D criteria, monitoring plan, and evidence retention plan are documented in relay-soak-validation.md.

14. VAL27 — Relay Proof Report Generator

VAL27 is a report generator (not a test runner) that reads evidence from VAL19–VAL23 and optionally VAL24 soak results, producing a consolidated relay reliability proof report.

VAL27 auto-discovers the latest evidence directory for each relay VAL under the repo evidence/ root — no manual path configuration needed:

bash scripts/labs/run_relay_proof_report_val27.sh evidence/

Output is written to evidence/val27/.

Readiness levels:

Level

Gate

Relay Design Partner

VAL19–VAL23 all pass + key metrics met (zero loss, Group R = 1.000, bandwidth correct)

Relay GA

Design Partner PLUS VAL24 Gate D pass + multi-peer + production hardware throughput

Relay Public Production

GA PLUS BoltDB crash-consistency + multi-peer soaks + production observability

Beta-marked claims (explicitly labelled in the report):

  • Throughput figures are single-host local runs only

  • Bandwidth daily reset validated by unit test (injected clock) only

  • Impairment throughput is informational (no hard threshold)

  • 30-day soak reliability claims are provisional until VAL24 Gate D passes

The formal plan, metric definitions, beta claim resolution paths, and 10-check matrix are documented in relay-proof-report-validation.md.