VAL24 — Relay 30-Day Soak Validation

Audience: engineers and reviewers who want a reproducible long-duration relay soak framework proving sustained message processing correctness, deadletter and retry tracking, and zero message loss over 30 days.

1. Scope

VAL24 validates four operational goals over a 30-day continuous run:

  1. Long-duration message processing — 10 segments delivered every 30 minutes for 30 days; total ≥ 14,400 segments processed.

  2. Message-loss accounting — every seeded segment is accounted for: delivered (peer receipt evidence) + deadletter (relay deadletter list) + loss = seeded. Gate D target: loss = 0 (no silent loss).

  3. Deadletter and retry tracking — one outage injection per day causes segments to exhaust max_retry_count and enter DEADLETTER; the following recovery round explicitly retries all deadletter entries and confirms recovery from peer receipt evidence under a clean proxy. Retry recovery rate target: ≥ 99.0%.

  4. Final proof artifactreports/final.json with Gate D assessment (all four criteria) and the complete 10-check matrix result.

Branch rule: coverage by existing runner

Layer

Existing asset

Coverage

Rollout 30-day soak

run_soak_val12_*.sh

✓ CP rollout plans, not relay

HA 30-day soak

run_soak_val18_*.sh

✓ HA failover cycles, not relay

Relay point-in-time

run_edge_deadletter_lab.sh

✓ single-run, not soak

Relay impairment

run_relay_impairment_val19_lab.sh

✓ single-run, not soak

Relay deadletter workflow

run_relay_deadletter_val22_lab.sh

✓ single-run, not soak

New 3-script framework required. Key architectural difference from VAL12/VAL18: relay BoltDB has no live injection API, so edged must restart per round to allow direct BoltDB seeding (proxy + peer stay running across rounds).

Out of scope

  • Multi-peer relay soaks

  • Bandwidth soak (covered by VAL23 + the existing --with-bandwidth mode)

  • Relay throughput characterisation at scale (VAL20)

  • Automatic reconnect after proxy restart (edged restarts per round anyway)

2. Architecture

edged (restarts per round)
  → relay_impairment_proxy:19061   ← clean (normal rounds) or outage (injection rounds)
      → edge_deadletter_lab_peer:19062  ← stays running; accumulates all deliveries
proxy ctrl API: 19063
edged listen addr: 127.0.0.1:19060
edged control socket: $SOAK_DIR/ctl.sock

Proxy and peer run continuously for 30 days (restarted automatically if detected dead at round start). edged restarts once per round — this is a feature, not a limitation: it tests restart stability and BoltDB persistence across 1,440 restart cycles.

Port assignments (isolated from VAL19–VAL23)

Component

Address

edged

127.0.0.1:19060

proxy (edged→)

127.0.0.1:19061

peer server

127.0.0.1:19062

proxy ctrl API

127.0.0.1:19063

3. Traffic Schedule

Parameter

Value

Round interval

30 minutes (cron: */30 * * * *)

Segments per round

10 × 64B

Total rounds target

1,440 (30 days × 48 rounds/day)

Total segments target

14,400

Outage injection interval

Every 48 rounds (every 24 hours)

Outage duration

1 round per injection

Recovery round

1 round immediately after each outage

Outage events over 30 days

~30

Segment ID scheme

val24-r{ROUND:06d}-{IDX:03d} — e.g. val24-r000042-007 for segment 7 of round 42. The round number prefix allows per-round delivery accounting from the cumulative peer-received.json without requiring a separate output file per round.

4. Failure Injection Plan

Injection

Mechanism

Effect

Outage injection (every 24h)

PUT /mode {"type":"outage"} to proxy ctrl API

TCP connections to peer refused. Segments fail all 3 retry attempts (max_retry_count=3; last attempt at ~7s → DEADLETTER).

Recovery round

PUT /mode {"type":"clean"} + relay deadletter retry for each entry

All deadletter entries from the outage round are retried; proxy is clean → delivery succeeds.

max_retry_count=3 is chosen deliberately:

  • With backoff_base_seconds=1, attempt 3 fires at ~7s (1+2+4).

  • Within the 60s round wait, segments go to DEADLETTER deterministically.

  • This ensures every outage round produces measurable deadletter entries for the retry recovery check, without relying on long exponential backoff windows.

5. Message-Loss Accounting Invariant

For every completed round:

segments_seeded == segments_delivered + segments_deadletter + segments_loss
  • segments_delivered: segments whose IDs appear in peer-received.json

  • segments_deadletter: segments whose IDs appear in relay deadletter list

  • segments_loss: segments unaccounted for (should be 0)

The invariant holds because:

  • With max_retry_count=3 and a 60s wait, all segments reach a terminal state (DELIVERED or DEADLETTER) within the round window.

  • Segments still INFLIGHT at the stop signal complete their current attempt before edged exits gracefully (SIGTERM).

6. Gate D Criteria

All four criteria must hold for the final report to pass:

Criterion

Threshold

Metric

Rounds completed

≥ 1,440

total_rounds

Clean-round delivery rate

≥ 0.990

clean_delivered / clean_seeded

Retry recovery rate

≥ 0.990

total_recovered / total_retried with total_retried > 0

Message loss

= 0

total_loss

The clean-round delivery rate is computed only over non-outage, non-recovery rounds. The 0.990 threshold (not 1.000) allows for rare transient failures (e.g., scheduler timing edge cases) without causing the soak to fail. The retry criterion is sampled only after at least one outage/recovery cycle has produced real retried entries.

7. 10-Check Matrix

ID

When

Description

Pass criterion

VAL24-01

Setup

edged + proxy + peer start cleanly; round 1 completes

Round 1 delivery_rate = 1.0 (all seeded segments delivered)

VAL24-02

Day 1

First clean round: all 10 segments delivered

clean_summaries[0].delivery_rate = 1.0

VAL24-03

Day 1

First clean round: deadletter count = 0

clean_summaries[0].segments_deadletter = 0

VAL24-04

Day 1

Outage injection: segments enter DEADLETTER state

Any outage round has segments_deadletter > 0

VAL24-05

Day 1

Recovery: deadletter entries retried and observed delivered

At least 1 recovery round with recovery_retried > 0 and recovery_delivered > 0

VAL24-06

Day 7

Clean-round delivery rate ≥ 0.990 after 7 days

clean_delivery_rate 0.990 (or elapsed < 7 days → skip)

VAL24-07

Day 7

Retry recovery rate ≥ 0.990 after 7 days

retry_recovery_rate 0.990 with total_retried > 0 (or elapsed < 7 days → skip)

VAL24-08

Day 14

Message loss rate = 0.000 (no silent loss)

total_loss = 0 (or elapsed < 14 days → skip)

VAL24-09

Day 30

Total rounds ≥ 1,440

total_rounds 1440 (or elapsed < 30 days → skip)

VAL24-10

Day 30

Gate D: all four criteria pass

delivery ≥ 0.990 AND retry ≥ 0.990 AND loss = 0 AND rounds ≥ 1440

Evidence files per round

File

What it contains

rounds/YYYY-MM-DD/round-HHMMSS/round-summary.json

Per-round metrics (seeded, delivered, deadletter, loss, proxy_mode, duration_ms)

rounds/.../seed-manifest.json

Exact segment IDs seeded in this round

rounds/.../edged.log

edged log for this round

rounds/.../deadletter-list.txt

relay deadletter list output at round end

rounds/.../relay-status.json

relay status --output json at round end

rounds/.../edged-recovery.log

edged log during recovery-round retry session

rounds/.../deadletter-pre-recovery.txt

Deadletter list before retry sweep

rounds/.../retry-recovery.log

relay deadletter retry output per entry

peer-received.json

Cumulative peer delivery evidence used to confirm recovered segment IDs after retry

reports/daily-YYYY-MM-DD.json

Daily aggregate report

reports/checkpoint-7d.json

7-day checkpoint with Gate D partial assessment

reports/checkpoint-14d.json

14-day checkpoint

reports/final.json

30-day final report with full Gate D assessment

8. Run the Soak

Start

export GOROOT=/home/ubuntu/.local/go1.25.7
export PATH="$GOROOT/bin:$PATH"
export GOTOOLCHAIN=local

bash scripts/labs/run_soak_val24_setup.sh \
  "$PWD/evidence/val24-relay-soak-$(date +%F)"

The setup script installs a cron job (*/30 * * * *) and runs round 1 immediately to validate the environment.

Monitor

# Current status (any time)
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> status

# Generate a checkpoint report
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> checkpoint-7d

# Generate final report
bash scripts/labs/run_soak_val24_report.sh <SOAK_DIR> final

Stop

# Remove the cron job
crontab -l | grep -v run_soak_val24_round.sh | crontab -

# Stop any running edged instance (if mid-round)
kill "$(cat <SOAK_DIR>/edged.pid)" 2>/dev/null || true

9. Monitoring and Alerting Plan

The round script logs each round result to cron.log. Operators should monitor:

Signal

Location

Alert condition

Silent loss appears

round-summary.jsonsegments_loss > 0

Immediate investigation — indicates a relay or BoltDB invariant violation

Delivery rate drops

round-summary.jsondelivery_rate < 1.0 on a clean round

Investigate edged.log for this round

Deadletter entries grow

relay-status.jsonqueue_depth.deadletter

Expected after outage rounds; alert if persists after recovery round

edged crash

edged.pid stale + edged.log has panic

Check log, restart manually if needed

Proxy/peer crash

proxy.pid / peer.pid stale

Round script auto-restarts both; alert if recurring

Round-level evidence retention: rounds/ directory grows at ~5 KB/round (log + JSON files). Over 30 days: ~1,440 rounds × 5 KB = ~7 MB total.

Re-running run_soak_val24_setup.sh with the same SOAK_DIR is a resume operation: it preserves the existing TLS/config state and keeps the long-lived proxy/peer processes if they are already running.

10. Evidence Retention Plan

Artifact

Retention

rounds/ directory

Keep full 30-day history (≈7 MB)

peer-received.json

Keep; final count = proof of total delivered

reports/daily-*.json

Keep 30 daily reports

reports/checkpoint-*.json

Keep 2 checkpoint reports

reports/final.json

Keep; primary Gate D proof artifact

edged.log per round

Keep for 30 days; trim after final report if needed

11. Final Report Format

VAL24 — Relay 30-Day Soak Validation
Generated:    <YYYY-MM-DD>
Soak dir:     <path>
Elapsed:      30 day(s)  (1440/1440 rounds)

Traffic summary:
  Total seeded:           14400
  Total delivered:        14356
  Total deadletter:       44
  Total loss (silent):    0
  Clean rounds:           1380  delivery rate: 0.9996
  Outage rounds:          30
  Recovery rounds:        30  retried: 44  recovered: 44
  Retry recovery rate:    1.0000

10-check matrix:
  VAL24-01 PASS  edged+proxy+peer started cleanly (round 1 delivered)
  VAL24-02 PASS  first clean round delivery_rate=1.0
  VAL24-03 PASS  first clean round deadletter=0
  VAL24-04 PASS  outage_rounds=30, any_deadletter=True
  VAL24-05 PASS  recovery_rounds=30, retried=44, recovered=44
  VAL24-06 PASS  clean_delivery_rate=0.9996 >= 0.990
  VAL24-07 PASS  retry_recovery_rate=1.0000 >= 0.990
  VAL24-08 PASS  loss_count=0, loss_rate=0.0000
  VAL24-09 PASS  total_rounds=1440 >= 1440
  VAL24-10 PASS  Gate D: delivery=0.9996, retry=1.0000, loss=0

Overall: PASS=10 FAIL=0
Gate D:  PASS  (delivery=0.9996, retry=1.0000, loss=0)

12. Tooling

File

Role

scripts/labs/run_soak_val24_setup.sh

One-time setup (build, init, cron install, round 1)

scripts/labs/run_soak_val24_round.sh

Per-round execution (cron-driven)

scripts/labs/run_soak_val24_report.sh

Report generation (daily/checkpoint/final/status)

scripts/labs/edge_relay_soak_val24_setup.go

Setup binary: --mode init (TLS/config) + --mode seed (per-round BoltDB seeding)

scripts/labs/relay_impairment_proxy.go

Reused from VAL19 — clean/outage proxy

scripts/labs/edge_deadletter_lab_peer.go

Reused from PR-14 — mTLS peer server

scripts/labs/edge_deadletter_lab_dump.go

Reused — ledger state dump