VAL 18 — HA 30-Day Soak Validation¶

Status: Implemented Scripts:

scripts/labs/run_soak_val18_setup.sh — one-time environment setup and safe resume without deleting persistent soak state
scripts/labs/run_soak_val18_round.sh — per-round workload (cron every 2 hours)
scripts/labs/run_soak_val18_report.sh — aggregated report generator

Evidence dir: operator-chosen SOAK_DIR (e.g. evidence/soak-val18/) Ports: cp-val18-node1 → 19010 · cp-val18-node2 → 19011 · Docker PG → host port 5488 Gap reference: Gap HA-005 · Wave 5.10

Purpose¶

Validates the HA control-plane subsystem over a sustained 30-day runtime window:

Confirms both HA nodes remain healthy across ≥ 360 rounds (2-hour cron cadence)
Schedules and times at least 3 leader failovers over the soak period (workplan target ≥ 3)
Verifies failover completion ≤ 10,000 ms per workplan claim <10 s average
Verifies data continuity across every failover (probe row readable on new leader)
Tracks HA cluster uptime ≥ 99.9% (Gate D criterion)
Provides a 30-day final report with Gate D HA readiness assessment

Branch-Specific Rule Application¶

Question	Answer
Is this covered by an existing LAB?	No. `run_quorum_lab()`, `run_ha_lab()`, and `run_ha_failover_val13_lab()` are single-session functions. VAL13 measures failover timing in a short burst (3 × SIGTERM cycles). None run continuously for 30 days, accumulate round evidence over time, or provide the Gate D “HA soak completed” public claim.
Which LAB/evidence bundle is extended?	Three new standalone scripts following the VAL12 pattern (`run_soak_val12_{setup,round,report}.sh`). Not a slice of `run_cli_audit_lab.sh`; the 30-day lifecycle cannot be expressed as a function within a single-shot script.
New evidence files	Per-round `round-summary.json` in `$SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/`; daily summaries, checkpoint reports, and the final report. See the Evidence Files table.
Tutorial/runbook docs updated	`docs/tutorials/cli-audit-lab.md` §8 VAL18 note added (separate runner, like VAL12).
Reason new runner required	30-day continuous execution with cron scheduling, PID tracking across invocations, node restart recovery, and progressive report generation cannot be expressed as a `run_cli_audit_lab.sh` slice. The same reasoning applied for VAL12.

Soak Environment Plan¶

val18-ha-net  (Docker bridge network, isolated)
     │
val18-pg-primary  (postgres:16, host port 5488 → internal 5432)
     │
     ├── cp-val18-node1:19010   (orchestrator_ha_server binary)
     └── cp-val18-node2:19011   (orchestrator_ha_server binary)

Both HA nodes connect to postgres://postgres:val18pass@127.0.0.1:5488/autonomy. The PostgreSQL container exposes port 5488 on the host, providing a stable connection address that survives Docker network IP reassignment. Both HA nodes run as host processes (not containers) with advisory lock election (--campaign 200ms).

Key flags:

--min-sync-replicas 0 — quorum requires only primary PG connectivity; no streaming standby
--campaign 200ms — fast leader election, typical failover ≤ 500 ms
--quorum-monitor-interval 500ms — sub-second loss detection

Failover Schedule Strategy¶

Parameter	Value	Rationale
Round interval	Every 2 hours (cron: `0 /2 * *`)	360 rounds over 30 days; low overhead
Failover interval	Every 12 rounds (`SOAK_FAILOVER_INTERVAL_ROUNDS=12`)	1 failover/24 h → ~30 total over 30 days
Failover mechanism	SIGTERM current leader PID	Graceful resign; deterministic timing; same mechanism as VAL13
Post-failover restart	Killed node restarted immediately as follower	Keeps a 2-node cluster; avoids single-point-of-failure accumulation
Minimum required	≥ 3 total over 30 days (workplan target)	~30 with default schedule — 10× the minimum

Failover timing method:

start_ms = python3: int(time.time()*1000)
kill -TERM <leader_pid>
Poll follower /v1/ha/status every 50 ms until holder_id changes
end_ms = python3: int(time.time()*1000)
failover_ms = end_ms - start_ms  (timeout: 10,000 ms = SOAK_FAILOVER_TIMEOUT_MS)

Observability and Monitoring Plan¶

Per-round evidence (`rounds/YYYY-MM-DD/round-HHMMSS/`)¶

File	Content
`round-summary.json`	All round metrics (see schema below)
`leader.txt`	`round=N leader=nodeX holder_id=...`
`health.txt`	`quorum_health=... probe_ok=...`
`failover-pre.txt`	Pre-failover leader state (failover rounds only)
`failover-result.txt`	`failover_ms=N new_holder=... timing_ok=... data_ok=...`

`round-summary.json` schema¶

{
  "timestamp":          "<RFC3339>",
  "round_date":         "YYYY-MM-DD",
  "round_time":         "HHMMSS",
  "round_count":        N,
  "health":             "ok"|"degraded"|"failed",
  "restarted_node1":    true|false,
  "restarted_node2":    true|false,
  "pg_restarted":       true|false,
  "leader_node":        "node1"|"node2"|"none",
  "holder_id":          "cp-val18-nodeX:19010",
  "quorum_health":      "healthy"|"degraded"|"lost",
  "probe_row_ok":       true|false,
  "total_failovers":    N,
  "failover_triggered": true|false,
  "failover_ms":        N|null,
  "failover_timing_ok": true|false|null,
  "data_continuity_ok": true|false|null
}

Persistent log files (`ha-logs/`)¶

ha-logs/node1.log and ha-logs/node2.log are append-only logs from both HA server processes. They are not rotated per round. They contain acquired leadership, resigned, quorum monitor transitions, and audit emitter startup messages.

Alert log (`alerts.log`)¶

Append-only alerts written by the round script on threshold breaches:

no_leader — no leader identified in round
quorum_lost — quorum_health=lost during round
probe_row_missing — probe row not readable
failover_slow — failover_ms > SOAK_FAILOVER_TIMEOUT_MS
data_continuity_fail — probe row not readable after failover

Reports¶

Report	When	Content
`daily/YYYY-MM-DD-summary.json`	Nightly (cron `0 1 * * *`)	JSON only: day’s rounds, failovers, uptime
`checkpoints/checkpoint-7d.{json,txt}`	Weekly (cron `0 1 * * 0`)	7-day window aggregation; VAL18-06/07/08 checks
`checkpoints/checkpoint-14d.{json,txt}`	Day 15 (cron `0 1 1 * *`)	14-day window; VAL18-08 uptime check
`final-report.{json,txt}`	Day 30 (manual)	All 30 days; Gate D assessment

Evidence Retention Plan¶

Evidence	Retention	Notes
`rounds/YYYY-MM-DD/round-HHMMSS/`	Permanent	~360 directories × ≤5 files each
`ha-logs/node{1,2}.log`	Permanent	Append-only, expected ≤ 50 MB over 30 days
`alerts.log`	Permanent	Append-only
`daily/*.json`	Permanent	30 files, < 1 KB each
`checkpoints/*.{json,txt}`	Permanent	2–4 files
`final-report.{json,txt}`	Permanent	Gate D evidence package
Docker PG volume (`val18-pg-vol`)	Persistent for 30 days	Contains the probe table; setup reuses the existing volume/container on a normal rerun instead of deleting soak state
Audit store (`audit-store/`)	Permanent	Shared with shared audit emitter; accumulates events

VAL18 10-Check Matrix¶

Check	Name	Verified At	Threshold
VAL18-01	framework_provisioned	Setup	Both nodes healthy; leader elected; probe row inserted; `setup-summary.txt` shows `val18_01=PASS`
VAL18-02	initial_round_success	Setup	First round exits 0; leader and quorum identified
VAL18-03	failover_triggered	Day 1 (round 12)	First scheduled SIGTERM failover completes; new leader elected
VAL18-04	failover_timing_baseline	Day 1 (round 12)	First `failover_ms ≤ 10,000`
VAL18-05	data_continuity_baseline	Day 1 (round 12)	Probe row `note='soak-init'` readable on the new leader after the first failover
VAL18-06	failover_count_threshold	Day 7 checkpoint	`total_failovers ≥ 1` within the 7-day checkpoint window
VAL18-07	failover_timing_maintained	Day 7 checkpoint	`max_failover_ms ≤ 10,000` across all recorded failovers
VAL18-08	ha_uptime_maintained	Day 14 checkpoint	`healthy_rounds / total_rounds ≥ 99.9%`
VAL18-09	total_failovers_final	Day 30 final	`total_failovers ≥ 3` over full 30-day window
VAL18-10	data_continuity_final	Day 30 final	`data_continuity_rate = 1.000` (probe row readable after every failover)

Pass/Fail Criteria¶

Outcome	Condition
PASS	All 10 checks pass
PARTIAL	Checks 3, 4, 5, 7, 10 pass (failover occurred, timing OK, data intact)
FAIL	Check 10 fails (data loss after any failover) OR check 4 fails (any failover exceeded 10 s)

The two mandatory checks are VAL18-04 (failover timing) and VAL18-10 (data continuity). The workplan’s public claims are “sub-10 s failover” and “no data loss”; both require these checks to pass over the full 30-day window.

Gate D HA Assessment (Gap HA-005)¶

The final report includes a Gate D assessment block:

## Gate D HA Assessment
PASS: Gate D HA criteria met (failovers=30, max_failover_ms=412 ≤ 10000,
      data_continuity=1.0, uptime=99.9%)

Gate D passes when all four hold:

total_failovers ≥ 3 over 30 days
max_failover_ms ≤ 10,000 (no failover exceeded 10 s)
data_continuity_rate = 1.0 (probe row accessible after every failover)
ha_uptime_pct ≥ 99.9%

Checkpoint policy:

7-day checkpoint: total_failovers ≥ 1
30-day final Gate D: total_failovers ≥ 3

Evidence Files¶

Path	Description
`setup-summary.txt`	VAL18-01/02 results, node PIDs, leader at setup, cron instructions
`probe-setup.txt`	Docker psql output: `CREATE TABLE` + `INSERT` for probe row
`config.env`	Shared configuration (sourced by round and report scripts)
`round-count`	Current round number (incremented each round)
`total-failovers`	Running failover count
`rounds-since-failover`	Rounds since last scheduled failover
`ha-logs/node1.log`	HA node1 append-only log
`ha-logs/node2.log`	HA node2 append-only log
`alerts.log`	Threshold breach alerts
`rounds/YYYY-MM-DD/round-HHMMSS/round-summary.json`	Per-round JSON (≈360 files)
`daily/YYYY-MM-DD-summary.json`	Daily aggregate (30 files)
`checkpoints/checkpoint-7d.{json,txt}`	7-day window checkpoint
`checkpoints/checkpoint-14d.{json,txt}`	14-day window checkpoint
`final-report.{json,txt}`	30-day final report with Gate D assessment

Known Failure Modes¶

Failure	Likely Cause	Mitigation
VAL18-01 FAIL: no leader elected	Both nodes started but neither won advisory lock; PG not ready	Check `ha-logs/node1.log` for `acquired leadership`; verify PG health on port 5488
VAL18-03/04 FAIL: no failover in first 24h	`SOAK_FAILOVER_INTERVAL_ROUNDS` not reached; round count file corrupted	Check `rounds-since-failover`; verify cron ran correctly
VAL18-04 FAIL: `failover_ms > 10000`	Follower campaign interval too slow; PG advisory lock not released	Check `--campaign 200ms`; verify advisory lock release in node logs
VAL18-05/10 FAIL: probe row missing	Docker PG container recreated without the persistent volume (data lost)	Never `docker rm -v val18-pg-primary`; setup is designed to reuse the existing volume/container on a normal rerun
VAL18-08 FAIL: uptime < 99.9%	Repeated process crashes; quorum lost rounds counted as failed	Check `alerts.log` for patterns and verify cron plus PID tracking
Node restarted every round	OS killed HA process (OOM, signal); binary crashing	Check `ha-logs/node{1,2}.log` for crash traces
`final-report` not generated	Must be run manually at day 30	Run `bash run_soak_val18_report.sh $SOAK_DIR --type final`

Setup Safety Note¶

run_soak_val18_setup.sh is intended for initial provisioning and safe operator resume. It reuses the persistent Docker volume and container when they already exist, rather than deleting the 30-day soak state on a normal rerun.

Final Report Template¶

# VAL 18 — HA 30-Day Soak Report (30-day final)

Generated:            <timestamp>
Total rounds:         360  (healthy: 359  failed: 1)
HA uptime:            99.7%  (target ≥ 99.9%)

## Cluster Health
Quorum-lost rounds:   0
Probe fail rounds:    0
Node1 restart rounds: 1
Node2 restart rounds: 0
PG restart rounds:    0

## Failover Statistics
Total failovers:      30  (target ≥ 3)
Timing OK count:      30  (≤ 10,000 ms each)
Data continuity OK:   30  (probe row accessible after each failover)
Data continuity rate: 1.0  (target = 1.000)
p50 failover_ms:      312 ms
p95 failover_ms:      487 ms
p99 failover_ms:      612 ms
max failover_ms:      831 ms  (target ≤ 10,000 ms)

## VAL18 Check Results
VAL18-01 framework_provisioned:    PASS  (recorded in setup-summary.txt)
VAL18-02 initial_round_success:    PASS  (recorded in setup-summary.txt)
VAL18-03 failover_triggered:       PASS  (recorded in round summaries)
VAL18-04 failover_timing_baseline: PASS  (recorded in round summaries)
VAL18-05 data_continuity_baseline: PASS  (recorded in round summaries)
VAL18-06 failover_count_threshold: PASS  (total_failovers=30, target=1 per 7d)
VAL18-07 failover_timing_maintained: PASS  (max_p99=612ms, target=10000ms)
VAL18-08 ha_uptime_maintained:     PASS  (uptime=99.7%, target=99.9%)
VAL18-09 total_failovers_final:    PASS  (total=30, target≥3)
VAL18-10 data_continuity_final:    PASS  (rate=1.000, target=1.000)

## Gate D HA Assessment
PASS: Gate D HA criteria met (failovers=30, max_failover_ms=831 ≤ 10000,
      data_continuity=1.0, uptime=99.7%)

Gate D HA Readiness Assessment:

Record max_failover_ms, data_continuity_rate, and ha_uptime_pct as evidence values in the Gate D sign-off document
PASS requires VAL18-09 (≥3 failovers) + VAL18-10 (data_continuity=1.0) + VAL18-07 (timing maintained)
All three checkpoint reports should be retained as intermediate evidence alongside the final report