VAL 18 — HA 30-Day Soak Validation¶
Status: Implemented Scripts:
scripts/labs/run_soak_val18_setup.sh— one-time environment setup and safe resume without deleting persistent soak statescripts/labs/run_soak_val18_round.sh— per-round workload (cron every 2 hours)scripts/labs/run_soak_val18_report.sh— aggregated report generator
Evidence dir: operator-chosen SOAK_DIR (e.g. evidence/soak-val18/)
Ports: cp-val18-node1 → 19010 · cp-val18-node2 → 19011 · Docker PG → host port 5488
Gap reference: Gap HA-005 · Wave 5.10
Purpose¶
Validates the HA control-plane subsystem over a sustained 30-day runtime window:
Confirms both HA nodes remain healthy across ≥ 360 rounds (2-hour cron cadence)
Schedules and times at least 3 leader failovers over the soak period (workplan target ≥ 3)
Verifies failover completion ≤ 10,000 ms per workplan claim
<10 s averageVerifies data continuity across every failover (probe row readable on new leader)
Tracks HA cluster uptime ≥ 99.9% (Gate D criterion)
Provides a 30-day final report with Gate D HA readiness assessment
Branch-Specific Rule Application¶
Question |
Answer |
|---|---|
Is this covered by an existing LAB? |
No. |
Which LAB/evidence bundle is extended? |
Three new standalone scripts following the VAL12 pattern ( |
New evidence files |
Per-round |
Tutorial/runbook docs updated |
|
Reason new runner required |
30-day continuous execution with cron scheduling, PID tracking across invocations, node restart recovery, and progressive report generation cannot be expressed as a |
Soak Environment Plan¶
val18-ha-net (Docker bridge network, isolated)
│
val18-pg-primary (postgres:16, host port 5488 → internal 5432)
│
├── cp-val18-node1:19010 (orchestrator_ha_server binary)
└── cp-val18-node2:19011 (orchestrator_ha_server binary)
Both HA nodes connect to
postgres://postgres:val18pass@127.0.0.1:5488/autonomy. The PostgreSQL
container exposes port 5488 on the host, providing a stable connection address
that survives Docker network IP reassignment. Both HA nodes run as host
processes (not containers) with advisory lock election (--campaign 200ms).
Key flags:
--min-sync-replicas 0— quorum requires only primary PG connectivity; no streaming standby--campaign 200ms— fast leader election, typical failover ≤ 500 ms--quorum-monitor-interval 500ms— sub-second loss detection
Failover Schedule Strategy¶
Parameter |
Value |
Rationale |
|---|---|---|
Round interval |
Every 2 hours (cron: |
360 rounds over 30 days; low overhead |
Failover interval |
Every 12 rounds ( |
1 failover/24 h → ~30 total over 30 days |
Failover mechanism |
SIGTERM current leader PID |
Graceful resign; deterministic timing; same mechanism as VAL13 |
Post-failover restart |
Killed node restarted immediately as follower |
Keeps a 2-node cluster; avoids single-point-of-failure accumulation |
Minimum required |
≥ 3 total over 30 days (workplan target) |
~30 with default schedule — 10× the minimum |
Failover timing method:
1. start_ms = python3: int(time.time()*1000)
2. kill -TERM <leader_pid>
3. Poll follower /v1/ha/status every 50 ms until holder_id changes
4. end_ms = python3: int(time.time()*1000)
5. failover_ms = end_ms - start_ms (timeout: 10,000 ms = SOAK_FAILOVER_TIMEOUT_MS)
Observability and Monitoring Plan¶
Per-round evidence (rounds/YYYY-MM-DD/round-HHMMSS/)¶
File |
Content |
|---|---|
|
All round metrics (see schema below) |
|
|
|
|
|
Pre-failover leader state (failover rounds only) |
|
|
round-summary.json schema¶
{
"timestamp": "<RFC3339>",
"round_date": "YYYY-MM-DD",
"round_time": "HHMMSS",
"round_count": N,
"health": "ok"|"degraded"|"failed",
"restarted_node1": true|false,
"restarted_node2": true|false,
"pg_restarted": true|false,
"leader_node": "node1"|"node2"|"none",
"holder_id": "cp-val18-nodeX:19010",
"quorum_health": "healthy"|"degraded"|"lost",
"probe_row_ok": true|false,
"total_failovers": N,
"failover_triggered": true|false,
"failover_ms": N|null,
"failover_timing_ok": true|false|null,
"data_continuity_ok": true|false|null
}
Persistent log files (ha-logs/)¶
ha-logs/node1.log and ha-logs/node2.log are append-only logs from both HA
server processes. They are not rotated per round. They contain
acquired leadership, resigned, quorum monitor transitions, and audit
emitter startup messages.
Alert log (alerts.log)¶
Append-only alerts written by the round script on threshold breaches:
no_leader— no leader identified in roundquorum_lost—quorum_health=lostduring roundprobe_row_missing— probe row not readablefailover_slow—failover_ms > SOAK_FAILOVER_TIMEOUT_MSdata_continuity_fail— probe row not readable after failover
Reports¶
Report |
When |
Content |
|---|---|---|
|
Nightly (cron |
JSON only: day’s rounds, failovers, uptime |
|
Weekly (cron |
7-day window aggregation; VAL18-06/07/08 checks |
|
Day 15 (cron |
14-day window; VAL18-08 uptime check |
|
Day 30 (manual) |
All 30 days; Gate D assessment |
Evidence Retention Plan¶
Evidence |
Retention |
Notes |
|---|---|---|
|
Permanent |
~360 directories × ≤5 files each |
|
Permanent |
Append-only, expected ≤ 50 MB over 30 days |
|
Permanent |
Append-only |
|
Permanent |
30 files, < 1 KB each |
|
Permanent |
2–4 files |
|
Permanent |
Gate D evidence package |
Docker PG volume ( |
Persistent for 30 days |
Contains the probe table; setup reuses the existing volume/container on a normal rerun instead of deleting soak state |
Audit store ( |
Permanent |
Shared with shared audit emitter; accumulates events |
VAL18 10-Check Matrix¶
Check |
Name |
Verified At |
Threshold |
|---|---|---|---|
VAL18-01 |
framework_provisioned |
Setup |
Both nodes healthy; leader elected; probe row inserted; |
VAL18-02 |
initial_round_success |
Setup |
First round exits 0; leader and quorum identified |
VAL18-03 |
failover_triggered |
Day 1 (round 12) |
First scheduled SIGTERM failover completes; new leader elected |
VAL18-04 |
failover_timing_baseline |
Day 1 (round 12) |
First |
VAL18-05 |
data_continuity_baseline |
Day 1 (round 12) |
Probe row |
VAL18-06 |
failover_count_threshold |
Day 7 checkpoint |
|
VAL18-07 |
failover_timing_maintained |
Day 7 checkpoint |
|
VAL18-08 |
ha_uptime_maintained |
Day 14 checkpoint |
|
VAL18-09 |
total_failovers_final |
Day 30 final |
|
VAL18-10 |
data_continuity_final |
Day 30 final |
|
Pass/Fail Criteria¶
Outcome |
Condition |
|---|---|
PASS |
All 10 checks pass |
PARTIAL |
Checks 3, 4, 5, 7, 10 pass (failover occurred, timing OK, data intact) |
FAIL |
Check 10 fails (data loss after any failover) OR check 4 fails (any failover exceeded 10 s) |
The two mandatory checks are VAL18-04 (failover timing) and VAL18-10 (data continuity). The workplan’s public claims are “sub-10 s failover” and “no data loss”; both require these checks to pass over the full 30-day window.
Gate D HA Assessment (Gap HA-005)¶
The final report includes a Gate D assessment block:
## Gate D HA Assessment
PASS: Gate D HA criteria met (failovers=30, max_failover_ms=412 ≤ 10000,
data_continuity=1.0, uptime=99.9%)
Gate D passes when all four hold:
total_failovers ≥ 3over 30 daysmax_failover_ms ≤ 10,000(no failover exceeded 10 s)data_continuity_rate = 1.0(probe row accessible after every failover)ha_uptime_pct ≥ 99.9%
Checkpoint policy:
7-day checkpoint:
total_failovers ≥ 130-day final Gate D:
total_failovers ≥ 3
Evidence Files¶
Path |
Description |
|---|---|
|
VAL18-01/02 results, node PIDs, leader at setup, cron instructions |
|
Docker psql output: |
|
Shared configuration (sourced by round and report scripts) |
|
Current round number (incremented each round) |
|
Running failover count |
|
Rounds since last scheduled failover |
|
HA node1 append-only log |
|
HA node2 append-only log |
|
Threshold breach alerts |
|
Per-round JSON (≈360 files) |
|
Daily aggregate (30 files) |
|
7-day window checkpoint |
|
14-day window checkpoint |
|
30-day final report with Gate D assessment |
Known Failure Modes¶
Failure |
Likely Cause |
Mitigation |
|---|---|---|
VAL18-01 FAIL: no leader elected |
Both nodes started but neither won advisory lock; PG not ready |
Check |
VAL18-03/04 FAIL: no failover in first 24h |
|
Check |
VAL18-04 FAIL: |
Follower campaign interval too slow; PG advisory lock not released |
Check |
VAL18-05/10 FAIL: probe row missing |
Docker PG container recreated without the persistent volume (data lost) |
Never |
VAL18-08 FAIL: uptime < 99.9% |
Repeated process crashes; quorum lost rounds counted as failed |
Check |
Node restarted every round |
OS killed HA process (OOM, signal); binary crashing |
Check |
|
Must be run manually at day 30 |
Run |
Setup Safety Note¶
run_soak_val18_setup.sh is intended for initial provisioning and safe
operator resume. It reuses the persistent Docker volume and container when they
already exist, rather than deleting the 30-day soak state on a normal rerun.
Final Report Template¶
# VAL 18 — HA 30-Day Soak Report (30-day final)
Generated: <timestamp>
Total rounds: 360 (healthy: 359 failed: 1)
HA uptime: 99.7% (target ≥ 99.9%)
## Cluster Health
Quorum-lost rounds: 0
Probe fail rounds: 0
Node1 restart rounds: 1
Node2 restart rounds: 0
PG restart rounds: 0
## Failover Statistics
Total failovers: 30 (target ≥ 3)
Timing OK count: 30 (≤ 10,000 ms each)
Data continuity OK: 30 (probe row accessible after each failover)
Data continuity rate: 1.0 (target = 1.000)
p50 failover_ms: 312 ms
p95 failover_ms: 487 ms
p99 failover_ms: 612 ms
max failover_ms: 831 ms (target ≤ 10,000 ms)
## VAL18 Check Results
VAL18-01 framework_provisioned: PASS (recorded in setup-summary.txt)
VAL18-02 initial_round_success: PASS (recorded in setup-summary.txt)
VAL18-03 failover_triggered: PASS (recorded in round summaries)
VAL18-04 failover_timing_baseline: PASS (recorded in round summaries)
VAL18-05 data_continuity_baseline: PASS (recorded in round summaries)
VAL18-06 failover_count_threshold: PASS (total_failovers=30, target=1 per 7d)
VAL18-07 failover_timing_maintained: PASS (max_p99=612ms, target=10000ms)
VAL18-08 ha_uptime_maintained: PASS (uptime=99.7%, target=99.9%)
VAL18-09 total_failovers_final: PASS (total=30, target≥3)
VAL18-10 data_continuity_final: PASS (rate=1.000, target=1.000)
## Gate D HA Assessment
PASS: Gate D HA criteria met (failovers=30, max_failover_ms=831 ≤ 10000,
data_continuity=1.0, uptime=99.7%)
Gate D HA Readiness Assessment:
Record
max_failover_ms,data_continuity_rate, andha_uptime_pctas evidence values in the Gate D sign-off documentPASS requires VAL18-09 (≥3 failovers) + VAL18-10 (data_continuity=1.0) + VAL18-07 (timing maintained)
All three checkpoint reports should be retained as intermediate evidence alongside the final report