VAL24 — Relay 30-Day Soak Validation¶
Audience: engineers and reviewers who want a reproducible long-duration relay soak framework proving sustained message processing correctness, deadletter and retry tracking, and zero message loss over 30 days.
1. Scope¶
VAL24 validates four operational goals over a 30-day continuous run:
Long-duration message processing — 10 segments delivered every 30 minutes for 30 days; total ≥ 14,400 segments processed.
Message-loss accounting — every seeded segment is accounted for: delivered (peer receipt evidence) + deadletter (relay deadletter list) + loss = seeded. Gate D target:
loss = 0(no silent loss).Deadletter and retry tracking — one outage injection per day causes segments to exhaust max_retry_count and enter DEADLETTER; the following recovery round explicitly retries all deadletter entries and confirms recovery from peer receipt evidence under a clean proxy. Retry recovery rate target: ≥ 99.0%.
Final proof artifact —
reports/final.jsonwith Gate D assessment (all four criteria) and the complete 10-check matrix result.
Branch rule: coverage by existing runner¶
Layer |
Existing asset |
Coverage |
|---|---|---|
Rollout 30-day soak |
|
✓ CP rollout plans, not relay |
HA 30-day soak |
|
✓ HA failover cycles, not relay |
Relay point-in-time |
|
✓ single-run, not soak |
Relay impairment |
|
✓ single-run, not soak |
Relay deadletter workflow |
|
✓ single-run, not soak |
New 3-script framework required. Key architectural difference from VAL12/VAL18: relay BoltDB has no live injection API, so edged must restart per round to allow direct BoltDB seeding (proxy + peer stay running across rounds).
Out of scope¶
Multi-peer relay soaks
Bandwidth soak (covered by VAL23 + the existing
--with-bandwidthmode)Relay throughput characterisation at scale (VAL20)
Automatic reconnect after proxy restart (edged restarts per round anyway)
2. Architecture¶
edged (restarts per round)
→ relay_impairment_proxy:19061 ← clean (normal rounds) or outage (injection rounds)
→ edge_deadletter_lab_peer:19062 ← stays running; accumulates all deliveries
proxy ctrl API: 19063
edged listen addr: 127.0.0.1:19060
edged control socket: $SOAK_DIR/ctl.sock
Proxy and peer run continuously for 30 days (restarted automatically if detected dead at round start). edged restarts once per round — this is a feature, not a limitation: it tests restart stability and BoltDB persistence across 1,440 restart cycles.
Port assignments (isolated from VAL19–VAL23)¶
Component |
Address |
|---|---|
edged |
|
proxy (edged→) |
|
peer server |
|
proxy ctrl API |
|
3. Traffic Schedule¶
Parameter |
Value |
|---|---|
Round interval |
30 minutes (cron: |
Segments per round |
10 × 64B |
Total rounds target |
1,440 (30 days × 48 rounds/day) |
Total segments target |
14,400 |
Outage injection interval |
Every 48 rounds (every 24 hours) |
Outage duration |
1 round per injection |
Recovery round |
1 round immediately after each outage |
Outage events over 30 days |
~30 |
Segment ID scheme¶
val24-r{ROUND:06d}-{IDX:03d} — e.g. val24-r000042-007 for segment 7
of round 42. The round number prefix allows per-round delivery accounting
from the cumulative peer-received.json without requiring a separate output
file per round.
4. Failure Injection Plan¶
Injection |
Mechanism |
Effect |
|---|---|---|
Outage injection (every 24h) |
|
TCP connections to peer refused. Segments fail all 3 retry attempts (max_retry_count=3; last attempt at ~7s → DEADLETTER). |
Recovery round |
|
All deadletter entries from the outage round are retried; proxy is clean → delivery succeeds. |
max_retry_count=3 is chosen deliberately:
With
backoff_base_seconds=1, attempt 3 fires at ~7s (1+2+4).Within the 60s round wait, segments go to DEADLETTER deterministically.
This ensures every outage round produces measurable deadletter entries for the retry recovery check, without relying on long exponential backoff windows.
5. Message-Loss Accounting Invariant¶
For every completed round:
segments_seeded == segments_delivered + segments_deadletter + segments_loss
segments_delivered: segments whose IDs appear inpeer-received.jsonsegments_deadletter: segments whose IDs appear inrelay deadletter listsegments_loss: segments unaccounted for (should be 0)
The invariant holds because:
With
max_retry_count=3and a 60s wait, all segments reach a terminal state (DELIVERED or DEADLETTER) within the round window.Segments still INFLIGHT at the stop signal complete their current attempt before edged exits gracefully (SIGTERM).
6. Gate D Criteria¶
All four criteria must hold for the final report to pass:
Criterion |
Threshold |
Metric |
|---|---|---|
Rounds completed |
≥ 1,440 |
|
Clean-round delivery rate |
≥ 0.990 |
|
Retry recovery rate |
≥ 0.990 |
|
Message loss |
= 0 |
|
The clean-round delivery rate is computed only over non-outage, non-recovery rounds. The 0.990 threshold (not 1.000) allows for rare transient failures (e.g., scheduler timing edge cases) without causing the soak to fail. The retry criterion is sampled only after at least one outage/recovery cycle has produced real retried entries.
7. 10-Check Matrix¶
ID |
When |
Description |
Pass criterion |
|---|---|---|---|
VAL24-01 |
Setup |
edged + proxy + peer start cleanly; round 1 completes |
Round 1 delivery_rate = 1.0 (all seeded segments delivered) |
VAL24-02 |
Day 1 |
First clean round: all 10 segments delivered |
|
VAL24-03 |
Day 1 |
First clean round: deadletter count = 0 |
|
VAL24-04 |
Day 1 |
Outage injection: segments enter DEADLETTER state |
Any outage round has |
VAL24-05 |
Day 1 |
Recovery: deadletter entries retried and observed delivered |
At least 1 recovery round with |
VAL24-06 |
Day 7 |
Clean-round delivery rate ≥ 0.990 after 7 days |
|
VAL24-07 |
Day 7 |
Retry recovery rate ≥ 0.990 after 7 days |
|
VAL24-08 |
Day 14 |
Message loss rate = 0.000 (no silent loss) |
|
VAL24-09 |
Day 30 |
Total rounds ≥ 1,440 |
|
VAL24-10 |
Day 30 |
Gate D: all four criteria pass |
delivery ≥ 0.990 AND retry ≥ 0.990 AND loss = 0 AND rounds ≥ 1440 |
Evidence files per round¶
File |
What it contains |
|---|---|
|
Per-round metrics (seeded, delivered, deadletter, loss, proxy_mode, duration_ms) |
|
Exact segment IDs seeded in this round |
|
edged log for this round |
|
|
|
|
|
edged log during recovery-round retry session |
|
Deadletter list before retry sweep |
|
|
|
Cumulative peer delivery evidence used to confirm recovered segment IDs after retry |
|
Daily aggregate report |
|
7-day checkpoint with Gate D partial assessment |
|
14-day checkpoint |
|
30-day final report with full Gate D assessment |
8. Run the Soak¶
Start¶
export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local
bash scripts/labs/run_soak_val24_setup.sh \
"$PWD/evidence/val24-relay-soak-$(date +%F)"
The setup script installs a cron job (*/30 * * * *) and runs round 1
immediately to validate the environment.
Monitor¶
# Current status (any time)
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> status
# Generate a checkpoint report
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> checkpoint-7d
# Generate final report
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> final
Stop¶
# Remove the cron job
crontab -l | grep -v run_soak_val24_round.sh | crontab -
# Stop any running edged instance (if mid-round)
kill "$(cat <SOAK_DIR>/edged.pid)" 2>/dev/null || true
9. Monitoring and Alerting Plan¶
The round script logs each round result to cron.log. Operators should
monitor:
Signal |
Location |
Alert condition |
|---|---|---|
Silent loss appears |
|
Immediate investigation — indicates a relay or BoltDB invariant violation |
Delivery rate drops |
|
Investigate edged.log for this round |
Deadletter entries grow |
|
Expected after outage rounds; alert if persists after recovery round |
edged crash |
|
Check log, restart manually if needed |
Proxy/peer crash |
|
Round script auto-restarts both; alert if recurring |
Round-level evidence retention: rounds/ directory grows at ~5 KB/round
(log + JSON files). Over 30 days: ~1,440 rounds × 5 KB = ~7 MB total.
Re-running run_soak_val24_setup.sh with the same SOAK_DIR is a resume
operation: it preserves the existing TLS/config state and keeps the long-lived
proxy/peer processes if they are already running.
10. Evidence Retention Plan¶
Artifact |
Retention |
|---|---|
|
Keep full 30-day history (≈7 MB) |
|
Keep; final count = proof of total delivered |
|
Keep 30 daily reports |
|
Keep 2 checkpoint reports |
|
Keep; primary Gate D proof artifact |
|
Keep for 30 days; trim after final report if needed |
11. Final Report Format¶
VAL24 — Relay 30-Day Soak Validation
Generated: <YYYY-MM-DD>
Soak dir: <path>
Elapsed: 30 day(s) (1440/1440 rounds)
Traffic summary:
Total seeded: 14400
Total delivered: 14356
Total deadletter: 44
Total loss (silent): 0
Clean rounds: 1380 delivery rate: 0.9996
Outage rounds: 30
Recovery rounds: 30 retried: 44 recovered: 44
Retry recovery rate: 1.0000
10-check matrix:
VAL24-01 PASS edged+proxy+peer started cleanly (round 1 delivered)
VAL24-02 PASS first clean round delivery_rate=1.0
VAL24-03 PASS first clean round deadletter=0
VAL24-04 PASS outage_rounds=30, any_deadletter=True
VAL24-05 PASS recovery_rounds=30, retried=44, recovered=44
VAL24-06 PASS clean_delivery_rate=0.9996 >= 0.990
VAL24-07 PASS retry_recovery_rate=1.0000 >= 0.990
VAL24-08 PASS loss_count=0, loss_rate=0.0000
VAL24-09 PASS total_rounds=1440 >= 1440
VAL24-10 PASS Gate D: delivery=0.9996, retry=1.0000, loss=0
Overall: PASS=10 FAIL=0
Gate D: PASS (delivery=0.9996, retry=1.0000, loss=0)
12. Tooling¶
File |
Role |
|---|---|
|
One-time setup (build, init, cron install, round 1) |
|
Per-round execution (cron-driven) |
|
Report generation (daily/checkpoint/final/status) |
|
Setup binary: |
|
Reused from VAL19 — clean/outage proxy |
|
Reused from PR-14 — mTLS peer server |
|
Reused — ledger state dump |