Edge Relay Deadletter Lab¶

Audience: operators and reviewers who want a reproducible local lab for the PR-14, PR-15, PR-16, and PR-23 relay operator surfaces in edged and edgectl.

This lab proves two operator workflows against a live daemon:

edgectl relay deadletter list
edgectl relay deadletter inspect <segment-id> <peer-id>
edgectl relay deadletter retry <segment-id> <peer-id>
edgectl relay deadletter purge --segment-id ... --peer-id ... [--force]
PR-16 bandwidth enforcement under sustained relay load
PR-16 daily quota reset evidence using the deterministic limiter clock path
edgectl relay status [--output text|json]

It mirrors the HA evidence style:

one reusable lab entry point
one dated evidence/ bundle from a real run
raw captured command outputs stored in the repo

1. What the Lab Does¶

The lab runner:

builds edged and edgectl
generates a deterministic edge.toml and short-lived lab TLS material
seeds the relay BoltDB ledger with deadlettered entries and attempt history
starts edged with a real Unix control socket
runs edgectl relay deadletter list
runs edgectl relay deadletter inspect seg-dead-1 peer-a
runs edgectl relay status and captures both text and JSON output
runs edgectl relay deadletter retry seg-dead-1 peer-a
captures relay status after retry and purge to prove queue-depth changes
captures post-retry ledger state from the real relay BoltDB
runs edgectl relay deadletter purge in dry-run and --force modes
captures post-purge ledger state from the real relay BoltDB
optionally seeds bandwidth-limited relay work to a real mTLS peer
captures live relay throttling evidence and post-run ledger state
captures relay status with live bandwidth status populated
captures the deterministic 24-hour quota reset test output
writes all outputs into an evidence directory

Seeded deadletter entries and fixtures:

seg-dead-1 / peer-a with 3 failed attempts
seg-dead-2 / peer-b with 1 failed attempt
the retry target segment is persisted in the local store so the retried entry is actually executable
the retry peer is configured with a loopback address that refuses connections, making retry pickup deterministic

Inspect target:

seg-dead-1 / peer-a

2. Run the Lab¶

From the repository root:

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_edge_deadletter_lab.sh

Bandwidth verification mode:

bash scripts/labs/run_edge_deadletter_lab.sh \
  "$PWD/evidence/pr16-edge-bandwidth-local-$(date +%F)" \
  --with-bandwidth

Optional custom evidence directory:

bash scripts/labs/run_edge_deadletter_lab.sh \
  "$PWD/evidence/pr14-edge-deadletter-local-$(date +%F)"

3. Generated Artifacts¶

The runner writes these files into the evidence directory:

edge.toml
seed-manifest.json
edgectl-status.txt
edged.log
relay-deadletter-list.txt
relay-deadletter-list-limit-1.txt
relay-deadletter-inspect.txt
relay-status-initial.txt
relay-status-initial.json
relay-deadletter-retry.txt
audit-relay-deadletter-retry.log
retry-ledger-state.json
relay-deadletter-list-after-retry.txt
relay-status-after-retry.txt
relay-status-after-retry.json
relay-deadletter-purge-dry-run.txt
relay-deadletter-purge-force.txt
audit-relay-deadletter-purge.log
purge-ledger-state.json
relay-deadletter-list-after-purge.txt
relay-status-after-purge.txt
relay-status-after-purge.json

Bandwidth mode adds:

bandwidth-peer.log
bandwidth-peer-received.json
relay-bandwidth-set.txt
audit-relay-bandwidth-set.log
relay-bandwidth-config.txt
relay-status-with-bandwidth.txt
relay-status-with-bandwidth.json
seg-bandwidth-01-ledger-state.json through seg-bandwidth-06-ledger-state.json
daily-quota-reset-test.txt

4. Expected Results¶

`relay deadletter list`¶

Expected shape:

segment=seg-dead-1 peer=peer-a attempts=3 first_queued=... last_updated=...
segment=seg-dead-2 peer=peer-b attempts=1 first_queued=... last_updated=...
showing 2 deadletter entries

Expected truncation proof:

segment=seg-dead-1 peer=peer-a attempts=3 first_queued=... last_updated=...
showing 1 of 2 deadletter entries

`relay deadletter inspect`¶

Expected shape:

Segment:      seg-dead-1
Peer:         peer-a
State:        deadletter
Attempts:     3
First Queued: ...
Last Updated: ...

Attempt History:
  #1  outcome=FAILED ...
  #2  outcome=FAILED ...
  #3  outcome=FAILED ...

`relay deadletter retry`¶

Expected evidence shape:

the retry command reports retried=true
audit-relay-deadletter-retry.log contains audit.event=relay.deadletter.retried
the retried entry leaves the deadletter list
retry-ledger-state.json shows the same pair still exists in the ledger with:
- state: "failed" or state: "deadletter" depending on retry budget
- a larger attempt_count than the pre-retry deadletter fixture
- an additional trailing failed attempt showing the executor actually picked it up

`relay status`¶

Expected evidence shape:

relay-status-initial.txt shows relay enabled, configured worker count, success condition, and queue-depth totals that match the seeded ledger state
relay-status-after-retry.* shows the deadletter queue shrinking after the retried entry is picked up by the executor
relay-status-after-purge.* shows the purged entry removed from the live queue totals
relay-status-with-bandwidth.* includes non-nil bandwidth status with the configured rate/quota and non-zero throttle activity when bandwidth mode is used

`relay deadletter purge`¶

Expected evidence shape:

dry-run reports would purge 1 deadletter entries
force purge reports purged 1 deadletter entries
audit-relay-deadletter-purge.log contains audit.event=relay.deadletter.purged
purge-ledger-state.json shows exists: false, zero remaining attempts, and zero segment records for that pair
the final deadletter list no longer contains the purged entry

PR-16 bandwidth enforcement¶

Expected bandwidth-mode evidence shape:

relay-bandwidth-set.txt reports applied=true
audit-relay-bandwidth-set.log contains audit.event=relay.bandwidth.configured
relay-bandwidth-config.txt shows the configured limits and non-zero throttle_count
bandwidth-peer-received.json shows only the number of segments that fit within the configured live rate/quota window
later bandwidth-segment ledger dumps show state: "failed" with bandwidth_rate_limited in the trailing attempts

PR-16 daily quota reset after 24h¶

Expected evidence shape:

daily-quota-reset-test.txt captures TestBandwidthLimiter_DailyQuota_ResetsAfter24h
the output shows an initial quota-exceeded event followed by a passing test once the injected clock advances by 24 hours

5. Current Evidence Bundle¶

Reference local runs live in:

evidence/pr14-edge-deadletter-local-2026-03-17/README.md
evidence/pr15-edge-deadletter-management-local-2026-03-17/README.md
evidence/pr16-edge-bandwidth-local-2026-03-17/README.md
evidence/pr23-edge-relay-status-local-2026-03-18/README.md

6. Implementation Notes¶

Runner script: scripts/labs/run_edge_deadletter_lab.sh
Lab setup helper: scripts/labs/edge_deadletter_lab_setup.go
Bandwidth peer helper: scripts/labs/edge_deadletter_lab_peer.go

The helper still seeds the relay ledger directly for determinism, but it now also persists the backing segment payloads and configured peers. That lets the PR-15 retry proof exercise a real executor scan instead of stopping at the mutation RPC.

In bandwidth mode, the same lab also seeds a real relay batch to a local mTLS peer and captures the deterministic 24-hour reset proof from the package test.

7. Sandbox Caveat¶

This lab uses the edged Unix control socket. If your environment blocks local Unix sockets, the runner may fail with a socket permission error. In that case, rerun it outside the restricted sandbox.

8. VAL19 — Relay Local-Network Impairment Validation¶

VAL19 is not a mode of this runner. It validates relay store-and-forward correctness and performance characterisation under real transport impairment (outage, bandwidth constraint, latency, mid-transfer disconnect) using a custom TCP impairment proxy (scripts/labs/relay_impairment_proxy.go).

The impairment lab is driven by a separate standalone script:

bash scripts/labs/run_relay_impairment_val19_lab.sh [EVIDENCE_DIR]

Why separate from this runner:

Requires two new Go binaries (relay_impairment_proxy.go, edge_relay_impairment_setup.go) built at runtime
Routes edged through a proxy address (different edge.toml than this lab)
Runtime is 3–4 minutes due to outage cycles + bandwidth + latency scenarios
Reuses edge_deadletter_lab_peer.go and edge_deadletter_lab_dump.go as helpers (unchanged)

The VAL19 formal plan, 10-check matrix, proxy control API reference, and throughput characterisation method are documented in relay-impairment-validation.md.

9. VAL20 — Relay Throughput Benchmark¶

VAL20 is not a mode of this runner. It measures edge relay executor throughput (segments/sec and bytes/sec) across five workload tiers and demonstrates backpressure behaviour under a 1 Mbps bandwidth constraint.

The throughput benchmark is driven by a separate standalone script:

bash scripts/labs/run_relay_throughput_val20_lab.sh [EVIDENCE_DIR]

Why separate from this runner:

Requires a new setup binary (edge_relay_throughput_setup.go) that seeds segments to PENDING state (not deadletter) for direct executor pickup
Reuses relay_impairment_proxy.go from VAL19 for bandwidth control
Runs five distinct tiers (N=1/10/100 × 64 B; N=10 × 128 KB clean/constrained) with fresh edged instances per tier — too slow to embed as a phase here
Ports isolated to 19040–19043 to avoid conflicts with VAL19

The VAL20 formal plan, 10-check matrix, tier definitions, throughput characterisation method, and report template are documented in relay-throughput-validation.md.

10. VAL21 — Relay Queue Depth and Overflow Validation¶

VAL21 is not a mode of this runner. It validates relay queue correctness under depth pressure, LRU eviction-induced relay failure, and relay status accuracy.

The overflow lab is driven by a separate standalone script:

bash scripts/labs/run_relay_overflow_val21_lab.sh [EVIDENCE_DIR]

Three scenarios:

S-A — Deep queue drain (N=200 × 64 B): confirms correctness and monotone queue depth drain at scale.
S-B — LRU eviction interaction (N=10 × 64 KB, ceiling=5×64 KB): confirms that evicted segments fail relay gracefully (deadletter — no panic, no silent data loss) and that all 10 segments are accounted for (delivered + deadletter = 10).
S-C — Relay status accuracy (N=10 × 64 B): confirms that edgectl relay status --output json queue_depth.scheduled equals the seeded count at t+0.4 s, and drops to 0 after all segments are acknowledged.

Why separate from this runner:

S-B requires a low disk ceiling + aggressive eviction threshold (--ceiling-bytes 327680 --eviction-threshold 0.50) that would corrupt concurrent operation with any other scenario
S-A (N=200) adds 3–5 minutes to any embedded runner
Ports isolated to 19044–19047 to avoid conflicts with VAL19/VAL20

The VAL21 formal plan, 10-check matrix, LRU eviction policy explanation, and relay status field reference are documented in relay-overflow-validation.md.

11. VAL22 — Deadletter Workflow Validation¶

VAL22 is not a mode of this runner. It re-validates the complete deadletter operator workflow with controlled failure injection and verified delivery outcomes — closing the gap in this runner where retry success is never confirmed against a live peer.

The deadletter workflow lab is driven by:

bash scripts/labs/run_relay_deadletter_val22_lab.sh [EVIDENCE_DIR]

What this runner already covers vs what VAL22 adds:

Feature	This runner	VAL22
`relay deadletter list` + `inspect`	✓	✓ re-validated
`relay deadletter retry` command	✓ (command exits 0)	✓ delivery confirmed
Retry success rate	—	✓ 4/8 = 50% (Group R: 100%, Group U: 0%)
Failure injection (outage proxy)	—	✓ Group U re-deadletters after outage
Retention across restart	—	✓ BoltDB persistence verified
`relay deadletter purge`	✓	✓ re-validated

Why separate: this runner hardcodes peer-a at a refusing address — retry always fails. VAL22 requires a live peer server for delivery confirmation. Ports isolated to 19050–19053.

The VAL22 formal plan, 10-check matrix, workflow matrix, failure injection plan, and retry success rate definition are documented in deadletter-workflow-validation.md.

12. VAL23 — Relay Bandwidth Management Validation¶

VAL23 is not a mode of this runner. It re-validates the relay bandwidth controls with isolated scenarios that separate rate-limit enforcement from daily-quota enforcement — closing the gap in this runner’s --with-bandwidth mode where both are always applied together and counter accuracy is not asserted.

The bandwidth validation lab is driven by:

bash scripts/labs/run_relay_bandwidth_val23_lab.sh [EVIDENCE_DIR]

What this runner already covers vs what VAL23 adds:

Feature	This runner (–with-bandwidth)	VAL23
`relay config set-bandwidth` command	✓	✓ re-validated
`relay.bandwidth.configured` audit event	✓	✓ re-validated
Rate + quota combined throttle	✓ (always both)	— (S-B and S-C isolate each)
Rate-only throttle	—	✓ S-B: throttle_count >= 4, quota_drop_count = 0
Quota-only enforcement	—	✓ S-C: exactly 2/6 segments delivered, quota_drop_count >= 4
Hot-reload (set-bandwidth live)	—	✓ S-D: applied=true + delivery resumes
Unlimited baseline (status check)	—	✓ S-A: unlimited=true before any config
Daily quota reset	✓ unit test	✓ S-E: unit test re-run in isolation

Why separate: this runner’s --with-bandwidth mode always sets bytes_per_second and daily_quota together — it is impossible to determine from the evidence whether throttling is due to the rate limit or quota exhaustion. Ports isolated to 19054–19055 (no proxy needed).

The VAL23 formal plan, 10-check matrix, scenario matrix, token-bucket timing reference, and daily-quota drop path are documented in relay-bandwidth-validation.md.

13. VAL24 — Relay 30-Day Soak Validation¶

VAL24 is not a mode of this runner. It is a 30-day continuous relay soak framework that validates sustained message processing correctness, zero message loss, and deadletter retry recovery over 1,440 rounds.

The soak is driven by three scripts following the VAL12/VAL18 pattern:

bash scripts/labs/run_soak_val24_setup.sh [SOAK_DIR]    # once — installs cron
bash scripts/labs/run_soak_val24_round.sh  <SOAK_DIR>   # per-round (cron-driven)
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> [status|final|...]  # any time

Architecture difference from VAL12/VAL18: relay BoltDB has no live injection API — edged restarts once per round to allow direct BoltDB seeding via edge_relay_soak_val24_setup.go --mode seed. The proxy and peer server stay running for the full 30-day soak.

What this runner covers vs what VAL24 adds:

Feature	This runner	VAL24
Relay delivery (single run)	✓	✓ sustained 30 days
Deadletter list/inspect	✓	✓ per-round
Retry recovery	✓ (single run)	✓ ~30 outage+recovery cycles
BoltDB persistence	✓ (single restart)	✓ 1,440 restart cycles
Message-loss accounting	—	✓ seeded = delivered + deadletter + loss
Long-duration proof artifact	—	✓ `reports/final.json` with Gate D

Gate D (all four must hold for final PASS):

Total rounds ≥ 1,440
Clean-round delivery rate ≥ 0.990
Retry recovery rate ≥ 0.990 (retried deadletter entries that delivered)
Message loss = 0 (no silent loss)

Ports isolated to 19060–19063.

The VAL24 formal plan, 10-check matrix, traffic schedule, failure injection plan, Gate D criteria, monitoring plan, and evidence retention plan are documented in relay-soak-validation.md.

14. VAL27 — Relay Proof Report Generator¶

VAL27 is a report generator (not a test runner) that reads evidence from VAL19–VAL23 and optionally VAL24 soak results, producing a consolidated relay reliability proof report.

VAL27 auto-discovers the latest evidence directory for each relay VAL under the repo evidence/ root — no manual path configuration needed:

bash scripts/labs/run_relay_proof_report_val27.sh evidence/

Output is written to evidence/val27/.

Readiness levels:

Level	Gate
Relay Design Partner	VAL19–VAL23 all pass + key metrics met (zero loss, Group R = 1.000, bandwidth correct)
Relay GA	Design Partner PLUS VAL24 Gate D pass + multi-peer + production hardware throughput
Relay Public Production	GA PLUS BoltDB crash-consistency + multi-peer soaks + production observability

Beta-marked claims (explicitly labelled in the report):

Throughput figures are single-host local runs only
Bandwidth daily reset validated by unit test (injected clock) only
Impairment throughput is informational (no hard threshold)
30-day soak reliability claims are provisional until VAL24 Gate D passes

The formal plan, metric definitions, beta claim resolution paths, and 10-check matrix are documented in relay-proof-report-validation.md.

Edge Relay Deadletter Lab¶

1. What the Lab Does¶

2. Run the Lab¶

3. Generated Artifacts¶

4. Expected Results¶

relay deadletter list¶

relay deadletter inspect¶

relay deadletter retry¶

relay status¶

relay deadletter purge¶

PR-16 bandwidth enforcement¶

PR-16 daily quota reset after 24h¶

5. Current Evidence Bundle¶

6. Implementation Notes¶

7. Sandbox Caveat¶

8. VAL19 — Relay Local-Network Impairment Validation¶

9. VAL20 — Relay Throughput Benchmark¶

10. VAL21 — Relay Queue Depth and Overflow Validation¶

11. VAL22 — Deadletter Workflow Validation¶

12. VAL23 — Relay Bandwidth Management Validation¶

13. VAL24 — Relay 30-Day Soak Validation¶

14. VAL27 — Relay Proof Report Generator¶

`relay deadletter list`¶

`relay deadletter inspect`¶

`relay deadletter retry`¶

`relay status`¶

`relay deadletter purge`¶