Edge Relay Deadletter Lab¶
Audience: operators and reviewers who want a reproducible local lab for the PR-14, PR-15, PR-16, and PR-23 relay operator surfaces in edged and edgectl.
This lab proves two operator workflows against a live daemon:
edgectl relay deadletter listedgectl relay deadletter inspect <segment-id> <peer-id>edgectl relay deadletter retry <segment-id> <peer-id>edgectl relay deadletter purge --segment-id ... --peer-id ... [--force]PR-16 bandwidth enforcement under sustained relay load
PR-16 daily quota reset evidence using the deterministic limiter clock path
edgectl relay status [--output text|json]
It mirrors the HA evidence style:
one reusable lab entry point
one dated
evidence/bundle from a real runraw captured command outputs stored in the repo
1. What the Lab Does¶
The lab runner:
builds
edgedandedgectlgenerates a deterministic
edge.tomland short-lived lab TLS materialseeds the relay BoltDB ledger with deadlettered entries and attempt history
starts
edgedwith a real Unix control socketruns
edgectl relay deadletter listruns
edgectl relay deadletter inspect seg-dead-1 peer-aruns
edgectl relay statusand captures both text and JSON outputruns
edgectl relay deadletter retry seg-dead-1 peer-acaptures
relay statusafter retry and purge to prove queue-depth changescaptures post-retry ledger state from the real relay BoltDB
runs
edgectl relay deadletter purgein dry-run and--forcemodescaptures post-purge ledger state from the real relay BoltDB
optionally seeds bandwidth-limited relay work to a real mTLS peer
captures live relay throttling evidence and post-run ledger state
captures
relay statuswith live bandwidth status populatedcaptures the deterministic 24-hour quota reset test output
writes all outputs into an evidence directory
Seeded deadletter entries and fixtures:
seg-dead-1/peer-awith 3 failed attemptsseg-dead-2/peer-bwith 1 failed attemptthe retry target segment is persisted in the local store so the retried entry is actually executable
the retry peer is configured with a loopback address that refuses connections, making retry pickup deterministic
Inspect target:
seg-dead-1/peer-a
2. Run the Lab¶
From the repository root:
export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local
bash scripts/labs/run_edge_deadletter_lab.sh
Bandwidth verification mode:
bash scripts/labs/run_edge_deadletter_lab.sh \
"$PWD/evidence/pr16-edge-bandwidth-local-$(date +%F)" \
--with-bandwidth
Optional custom evidence directory:
bash scripts/labs/run_edge_deadletter_lab.sh \
"$PWD/evidence/pr14-edge-deadletter-local-$(date +%F)"
3. Generated Artifacts¶
The runner writes these files into the evidence directory:
edge.tomlseed-manifest.jsonedgectl-status.txtedged.logrelay-deadletter-list.txtrelay-deadletter-list-limit-1.txtrelay-deadletter-inspect.txtrelay-status-initial.txtrelay-status-initial.jsonrelay-deadletter-retry.txtaudit-relay-deadletter-retry.logretry-ledger-state.jsonrelay-deadletter-list-after-retry.txtrelay-status-after-retry.txtrelay-status-after-retry.jsonrelay-deadletter-purge-dry-run.txtrelay-deadletter-purge-force.txtaudit-relay-deadletter-purge.logpurge-ledger-state.jsonrelay-deadletter-list-after-purge.txtrelay-status-after-purge.txtrelay-status-after-purge.json
Bandwidth mode adds:
bandwidth-peer.logbandwidth-peer-received.jsonrelay-bandwidth-set.txtaudit-relay-bandwidth-set.logrelay-bandwidth-config.txtrelay-status-with-bandwidth.txtrelay-status-with-bandwidth.jsonseg-bandwidth-01-ledger-state.jsonthroughseg-bandwidth-06-ledger-state.jsondaily-quota-reset-test.txt
4. Expected Results¶
relay deadletter list¶
Expected shape:
segment=seg-dead-1 peer=peer-a attempts=3 first_queued=... last_updated=...
segment=seg-dead-2 peer=peer-b attempts=1 first_queued=... last_updated=...
showing 2 deadletter entries
Expected truncation proof:
segment=seg-dead-1 peer=peer-a attempts=3 first_queued=... last_updated=...
showing 1 of 2 deadletter entries
relay deadletter inspect¶
Expected shape:
Segment: seg-dead-1
Peer: peer-a
State: deadletter
Attempts: 3
First Queued: ...
Last Updated: ...
Attempt History:
#1 outcome=FAILED ...
#2 outcome=FAILED ...
#3 outcome=FAILED ...
relay deadletter retry¶
Expected evidence shape:
the retry command reports
retried=trueaudit-relay-deadletter-retry.logcontainsaudit.event=relay.deadletter.retriedthe retried entry leaves the deadletter list
retry-ledger-state.jsonshows the same pair still exists in the ledger with:state: "failed"orstate: "deadletter"depending on retry budgeta larger
attempt_countthan the pre-retry deadletter fixturean additional trailing failed attempt showing the executor actually picked it up
relay status¶
Expected evidence shape:
relay-status-initial.txtshows relay enabled, configured worker count, success condition, and queue-depth totals that match the seeded ledger staterelay-status-after-retry.*shows the deadletter queue shrinking after the retried entry is picked up by the executorrelay-status-after-purge.*shows the purged entry removed from the live queue totalsrelay-status-with-bandwidth.*includes non-nil bandwidth status with the configured rate/quota and non-zero throttle activity when bandwidth mode is used
relay deadletter purge¶
Expected evidence shape:
dry-run reports
would purge 1 deadletter entriesforce purge reports
purged 1 deadletter entriesaudit-relay-deadletter-purge.logcontainsaudit.event=relay.deadletter.purgedpurge-ledger-state.jsonshowsexists: false, zero remaining attempts, and zero segment records for that pairthe final deadletter list no longer contains the purged entry
PR-16 bandwidth enforcement¶
Expected bandwidth-mode evidence shape:
relay-bandwidth-set.txtreportsapplied=trueaudit-relay-bandwidth-set.logcontainsaudit.event=relay.bandwidth.configuredrelay-bandwidth-config.txtshows the configured limits and non-zerothrottle_countbandwidth-peer-received.jsonshows only the number of segments that fit within the configured live rate/quota windowlater bandwidth-segment ledger dumps show
state: "failed"withbandwidth_rate_limitedin the trailing attempts
PR-16 daily quota reset after 24h¶
Expected evidence shape:
daily-quota-reset-test.txtcapturesTestBandwidthLimiter_DailyQuota_ResetsAfter24hthe output shows an initial quota-exceeded event followed by a passing test once the injected clock advances by 24 hours
5. Current Evidence Bundle¶
Reference local runs live in:
evidence/pr14-edge-deadletter-local-2026-03-17/README.mdevidence/pr15-edge-deadletter-management-local-2026-03-17/README.mdevidence/pr16-edge-bandwidth-local-2026-03-17/README.mdevidence/pr23-edge-relay-status-local-2026-03-18/README.md
6. Implementation Notes¶
Runner script:
scripts/labs/run_edge_deadletter_lab.shLab setup helper:
scripts/labs/edge_deadletter_lab_setup.goBandwidth peer helper:
scripts/labs/edge_deadletter_lab_peer.go
The helper still seeds the relay ledger directly for determinism, but it now also persists the backing segment payloads and configured peers. That lets the PR-15 retry proof exercise a real executor scan instead of stopping at the mutation RPC.
In bandwidth mode, the same lab also seeds a real relay batch to a local mTLS peer and captures the deterministic 24-hour reset proof from the package test.
7. Sandbox Caveat¶
This lab uses the edged Unix control socket. If your environment blocks local
Unix sockets, the runner may fail with a socket permission error. In that case,
rerun it outside the restricted sandbox.
8. VAL19 — Relay Local-Network Impairment Validation¶
VAL19 is not a mode of this runner. It validates relay store-and-forward
correctness and performance characterisation under real transport impairment
(outage, bandwidth constraint, latency, mid-transfer disconnect) using a custom
TCP impairment proxy (scripts/labs/relay_impairment_proxy.go).
The impairment lab is driven by a separate standalone script:
bash scripts/labs/run_relay_impairment_val19_lab.sh [EVIDENCE_DIR]
Why separate from this runner:
Requires two new Go binaries (
relay_impairment_proxy.go,edge_relay_impairment_setup.go) built at runtimeRoutes
edgedthrough a proxy address (differentedge.tomlthan this lab)Runtime is 3–4 minutes due to outage cycles + bandwidth + latency scenarios
Reuses
edge_deadletter_lab_peer.goandedge_deadletter_lab_dump.goas helpers (unchanged)
The VAL19 formal plan, 10-check matrix, proxy control API reference, and throughput characterisation method are documented in relay-impairment-validation.md.
9. VAL20 — Relay Throughput Benchmark¶
VAL20 is not a mode of this runner. It measures edge relay executor throughput (segments/sec and bytes/sec) across five workload tiers and demonstrates backpressure behaviour under a 1 Mbps bandwidth constraint.
The throughput benchmark is driven by a separate standalone script:
bash scripts/labs/run_relay_throughput_val20_lab.sh [EVIDENCE_DIR]
Why separate from this runner:
Requires a new setup binary (
edge_relay_throughput_setup.go) that seeds segments to PENDING state (not deadletter) for direct executor pickupReuses
relay_impairment_proxy.gofrom VAL19 for bandwidth controlRuns five distinct tiers (N=1/10/100 × 64 B; N=10 × 128 KB clean/constrained) with fresh edged instances per tier — too slow to embed as a phase here
Ports isolated to 19040–19043 to avoid conflicts with VAL19
The VAL20 formal plan, 10-check matrix, tier definitions, throughput characterisation method, and report template are documented in relay-throughput-validation.md.
10. VAL21 — Relay Queue Depth and Overflow Validation¶
VAL21 is not a mode of this runner. It validates relay queue correctness under depth pressure, LRU eviction-induced relay failure, and relay status accuracy.
The overflow lab is driven by a separate standalone script:
bash scripts/labs/run_relay_overflow_val21_lab.sh [EVIDENCE_DIR]
Three scenarios:
S-A — Deep queue drain (N=200 × 64 B): confirms correctness and monotone queue depth drain at scale.
S-B — LRU eviction interaction (N=10 × 64 KB, ceiling=5×64 KB): confirms that evicted segments fail relay gracefully (deadletter — no panic, no silent data loss) and that all 10 segments are accounted for (delivered + deadletter = 10).
S-C — Relay status accuracy (N=10 × 64 B): confirms that
edgectl relay status --output jsonqueue_depth.scheduledequals the seeded count at t+0.4 s, and drops to 0 after all segments are acknowledged.
Why separate from this runner:
S-B requires a low disk ceiling + aggressive eviction threshold (
--ceiling-bytes 327680 --eviction-threshold 0.50) that would corrupt concurrent operation with any other scenarioS-A (N=200) adds 3–5 minutes to any embedded runner
Ports isolated to 19044–19047 to avoid conflicts with VAL19/VAL20
The VAL21 formal plan, 10-check matrix, LRU eviction policy explanation, and relay status field reference are documented in relay-overflow-validation.md.
11. VAL22 — Deadletter Workflow Validation¶
VAL22 is not a mode of this runner. It re-validates the complete deadletter operator workflow with controlled failure injection and verified delivery outcomes — closing the gap in this runner where retry success is never confirmed against a live peer.
The deadletter workflow lab is driven by:
bash scripts/labs/run_relay_deadletter_val22_lab.sh [EVIDENCE_DIR]
What this runner already covers vs what VAL22 adds:
Feature |
This runner |
VAL22 |
|---|---|---|
|
✓ |
✓ re-validated |
|
✓ (command exits 0) |
✓ delivery confirmed |
Retry success rate |
— |
✓ 4/8 = 50% (Group R: 100%, Group U: 0%) |
Failure injection (outage proxy) |
— |
✓ Group U re-deadletters after outage |
Retention across restart |
— |
✓ BoltDB persistence verified |
|
✓ |
✓ re-validated |
Why separate: this runner hardcodes peer-a at a refusing address —
retry always fails. VAL22 requires a live peer server for delivery
confirmation. Ports isolated to 19050–19053.
The VAL22 formal plan, 10-check matrix, workflow matrix, failure injection plan, and retry success rate definition are documented in deadletter-workflow-validation.md.
12. VAL23 — Relay Bandwidth Management Validation¶
VAL23 is not a mode of this runner. It re-validates the relay bandwidth
controls with isolated scenarios that separate rate-limit enforcement from
daily-quota enforcement — closing the gap in this runner’s --with-bandwidth
mode where both are always applied together and counter accuracy is not
asserted.
The bandwidth validation lab is driven by:
bash scripts/labs/run_relay_bandwidth_val23_lab.sh [EVIDENCE_DIR]
What this runner already covers vs what VAL23 adds:
Feature |
This runner (–with-bandwidth) |
VAL23 |
|---|---|---|
|
✓ |
✓ re-validated |
|
✓ |
✓ re-validated |
Rate + quota combined throttle |
✓ (always both) |
— (S-B and S-C isolate each) |
Rate-only throttle |
— |
✓ S-B: throttle_count >= 4, quota_drop_count = 0 |
Quota-only enforcement |
— |
✓ S-C: exactly 2/6 segments delivered, quota_drop_count >= 4 |
Hot-reload (set-bandwidth live) |
— |
✓ S-D: applied=true + delivery resumes |
Unlimited baseline (status check) |
— |
✓ S-A: unlimited=true before any config |
Daily quota reset |
✓ unit test |
✓ S-E: unit test re-run in isolation |
Why separate: this runner’s --with-bandwidth mode always sets
bytes_per_second and daily_quota together — it is impossible to determine
from the evidence whether throttling is due to the rate limit or quota
exhaustion. Ports isolated to 19054–19055 (no proxy needed).
The VAL23 formal plan, 10-check matrix, scenario matrix, token-bucket timing reference, and daily-quota drop path are documented in relay-bandwidth-validation.md.
13. VAL24 — Relay 30-Day Soak Validation¶
VAL24 is not a mode of this runner. It is a 30-day continuous relay soak framework that validates sustained message processing correctness, zero message loss, and deadletter retry recovery over 1,440 rounds.
The soak is driven by three scripts following the VAL12/VAL18 pattern:
bash scripts/labs/run_soak_val24_setup.sh [SOAK_DIR] # once — installs cron
bash scripts/labs/run_soak_val24_round.sh <SOAK_DIR> # per-round (cron-driven)
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> [status|final|...] # any time
Architecture difference from VAL12/VAL18: relay BoltDB has no live injection
API — edged restarts once per round to allow direct BoltDB seeding via
edge_relay_soak_val24_setup.go --mode seed. The proxy and peer server stay
running for the full 30-day soak.
What this runner covers vs what VAL24 adds:
Feature |
This runner |
VAL24 |
|---|---|---|
Relay delivery (single run) |
✓ |
✓ sustained 30 days |
Deadletter list/inspect |
✓ |
✓ per-round |
Retry recovery |
✓ (single run) |
✓ ~30 outage+recovery cycles |
BoltDB persistence |
✓ (single restart) |
✓ 1,440 restart cycles |
Message-loss accounting |
— |
✓ seeded = delivered + deadletter + loss |
Long-duration proof artifact |
— |
✓ |
Gate D (all four must hold for final PASS):
Total rounds ≥ 1,440
Clean-round delivery rate ≥ 0.990
Retry recovery rate ≥ 0.990 (retried deadletter entries that delivered)
Message loss = 0 (no silent loss)
Ports isolated to 19060–19063.
The VAL24 formal plan, 10-check matrix, traffic schedule, failure injection plan, Gate D criteria, monitoring plan, and evidence retention plan are documented in relay-soak-validation.md.
14. VAL27 — Relay Proof Report Generator¶
VAL27 is a report generator (not a test runner) that reads evidence from VAL19–VAL23 and optionally VAL24 soak results, producing a consolidated relay reliability proof report.
VAL27 auto-discovers the latest evidence directory for each relay VAL under the
repo evidence/ root — no manual path configuration needed:
bash scripts/labs/run_relay_proof_report_val27.sh evidence/
Output is written to evidence/val27/.
Readiness levels:
Level |
Gate |
|---|---|
Relay Design Partner |
VAL19–VAL23 all pass + key metrics met (zero loss, Group R = 1.000, bandwidth correct) |
Relay GA |
Design Partner PLUS VAL24 Gate D pass + multi-peer + production hardware throughput |
Relay Public Production |
GA PLUS BoltDB crash-consistency + multi-peer soaks + production observability |
Beta-marked claims (explicitly labelled in the report):
Throughput figures are single-host local runs only
Bandwidth daily reset validated by unit test (injected clock) only
Impairment throughput is informational (no hard threshold)
30-day soak reliability claims are provisional until VAL24 Gate D passes
The formal plan, metric definitions, beta claim resolution paths, and 10-check matrix are documented in relay-proof-report-validation.md.