VAL 18 — HA 30-Day Soak Validation

Status: Implemented Scripts:

  • scripts/labs/run_soak_val18_setup.sh — one-time environment setup and safe resume without deleting persistent soak state

  • scripts/labs/run_soak_val18_round.sh — per-round workload (cron every 2 hours)

  • scripts/labs/run_soak_val18_report.sh — aggregated report generator

Evidence dir: operator-chosen SOAK_DIR (e.g. evidence/soak-val18/) Ports: cp-val18-node1 → 19010 · cp-val18-node2 → 19011 · Docker PG → host port 5488 Gap reference: Gap HA-005 · Wave 5.10


Purpose

Validates the HA control-plane subsystem over a sustained 30-day runtime window:

  • Confirms both HA nodes remain healthy across ≥ 360 rounds (2-hour cron cadence)

  • Schedules and times at least 3 leader failovers over the soak period (workplan target ≥ 3)

  • Verifies failover completion ≤ 10,000 ms per workplan claim <10 s average

  • Verifies data continuity across every failover (probe row readable on new leader)

  • Tracks HA cluster uptime ≥ 99.9% (Gate D criterion)

  • Provides a 30-day final report with Gate D HA readiness assessment


Branch-Specific Rule Application

Question

Answer

Is this covered by an existing LAB?

No. run_quorum_lab(), run_ha_lab(), and run_ha_failover_val13_lab() are single-session functions. VAL13 measures failover timing in a short burst (3 × SIGTERM cycles). None run continuously for 30 days, accumulate round evidence over time, or provide the Gate D “HA soak completed” public claim.

Which LAB/evidence bundle is extended?

Three new standalone scripts following the VAL12 pattern (run_soak_val12_{setup,round,report}.sh). Not a slice of run_cli_audit_lab.sh; the 30-day lifecycle cannot be expressed as a function within a single-shot script.

New evidence files

Per-round round-summary.json in $SOAK_DIR/rounds/YYYY-MM-DD/round-HHMMSS/; daily summaries, checkpoint reports, and the final report. See the Evidence Files table.

Tutorial/runbook docs updated

docs/tutorials/cli-audit-lab.md §8 VAL18 note added (separate runner, like VAL12).

Reason new runner required

30-day continuous execution with cron scheduling, PID tracking across invocations, node restart recovery, and progressive report generation cannot be expressed as a run_cli_audit_lab.sh slice. The same reasoning applied for VAL12.


Soak Environment Plan

val18-ha-net  (Docker bridge network, isolated)
     │
val18-pg-primary  (postgres:16, host port 5488 → internal 5432)
     │
     ├── cp-val18-node1:19010   (orchestrator_ha_server binary)
     └── cp-val18-node2:19011   (orchestrator_ha_server binary)

Both HA nodes connect to postgres://postgres:val18pass@127.0.0.1:5488/autonomy. The PostgreSQL container exposes port 5488 on the host, providing a stable connection address that survives Docker network IP reassignment. Both HA nodes run as host processes (not containers) with advisory lock election (--campaign 200ms).

Key flags:

  • --min-sync-replicas 0 — quorum requires only primary PG connectivity; no streaming standby

  • --campaign 200ms — fast leader election, typical failover ≤ 500 ms

  • --quorum-monitor-interval 500ms — sub-second loss detection


Failover Schedule Strategy

Parameter

Value

Rationale

Round interval

Every 2 hours (cron: 0 */2 * * *)

360 rounds over 30 days; low overhead

Failover interval

Every 12 rounds (SOAK_FAILOVER_INTERVAL_ROUNDS=12)

1 failover/24 h → ~30 total over 30 days

Failover mechanism

SIGTERM current leader PID

Graceful resign; deterministic timing; same mechanism as VAL13

Post-failover restart

Killed node restarted immediately as follower

Keeps a 2-node cluster; avoids single-point-of-failure accumulation

Minimum required

≥ 3 total over 30 days (workplan target)

~30 with default schedule — 10× the minimum

Failover timing method:

1. start_ms = python3: int(time.time()*1000)
2. kill -TERM <leader_pid>
3. Poll follower /v1/ha/status every 50 ms until holder_id changes
4. end_ms = python3: int(time.time()*1000)
5. failover_ms = end_ms - start_ms  (timeout: 10,000 ms = SOAK_FAILOVER_TIMEOUT_MS)

Observability and Monitoring Plan

Per-round evidence (rounds/YYYY-MM-DD/round-HHMMSS/)

File

Content

round-summary.json

All round metrics (see schema below)

leader.txt

round=N leader=nodeX holder_id=...

health.txt

quorum_health=... probe_ok=...

failover-pre.txt

Pre-failover leader state (failover rounds only)

failover-result.txt

failover_ms=N new_holder=... timing_ok=... data_ok=...

round-summary.json schema

{
  "timestamp":          "<RFC3339>",
  "round_date":         "YYYY-MM-DD",
  "round_time":         "HHMMSS",
  "round_count":        N,
  "health":             "ok"|"degraded"|"failed",
  "restarted_node1":    true|false,
  "restarted_node2":    true|false,
  "pg_restarted":       true|false,
  "leader_node":        "node1"|"node2"|"none",
  "holder_id":          "cp-val18-nodeX:19010",
  "quorum_health":      "healthy"|"degraded"|"lost",
  "probe_row_ok":       true|false,
  "total_failovers":    N,
  "failover_triggered": true|false,
  "failover_ms":        N|null,
  "failover_timing_ok": true|false|null,
  "data_continuity_ok": true|false|null
}

Persistent log files (ha-logs/)

ha-logs/node1.log and ha-logs/node2.log are append-only logs from both HA server processes. They are not rotated per round. They contain acquired leadership, resigned, quorum monitor transitions, and audit emitter startup messages.

Alert log (alerts.log)

Append-only alerts written by the round script on threshold breaches:

  • no_leader — no leader identified in round

  • quorum_lostquorum_health=lost during round

  • probe_row_missing — probe row not readable

  • failover_slowfailover_ms > SOAK_FAILOVER_TIMEOUT_MS

  • data_continuity_fail — probe row not readable after failover

Reports

Report

When

Content

daily/YYYY-MM-DD-summary.json

Nightly (cron 0 1 * * *)

JSON only: day’s rounds, failovers, uptime

checkpoints/checkpoint-7d.{json,txt}

Weekly (cron 0 1 * * 0)

7-day window aggregation; VAL18-06/07/08 checks

checkpoints/checkpoint-14d.{json,txt}

Day 15 (cron 0 1 1 * *)

14-day window; VAL18-08 uptime check

final-report.{json,txt}

Day 30 (manual)

All 30 days; Gate D assessment


Evidence Retention Plan

Evidence

Retention

Notes

rounds/YYYY-MM-DD/round-HHMMSS/

Permanent

~360 directories × ≤5 files each

ha-logs/node{1,2}.log

Permanent

Append-only, expected ≤ 50 MB over 30 days

alerts.log

Permanent

Append-only

daily/*.json

Permanent

30 files, < 1 KB each

checkpoints/*.{json,txt}

Permanent

2–4 files

final-report.{json,txt}

Permanent

Gate D evidence package

Docker PG volume (val18-pg-vol)

Persistent for 30 days

Contains the probe table; setup reuses the existing volume/container on a normal rerun instead of deleting soak state

Audit store (audit-store/)

Permanent

Shared with shared audit emitter; accumulates events


VAL18 10-Check Matrix

Check

Name

Verified At

Threshold

VAL18-01

framework_provisioned

Setup

Both nodes healthy; leader elected; probe row inserted; setup-summary.txt shows val18_01=PASS

VAL18-02

initial_round_success

Setup

First round exits 0; leader and quorum identified

VAL18-03

failover_triggered

Day 1 (round 12)

First scheduled SIGTERM failover completes; new leader elected

VAL18-04

failover_timing_baseline

Day 1 (round 12)

First failover_ms 10,000

VAL18-05

data_continuity_baseline

Day 1 (round 12)

Probe row note='soak-init' readable on the new leader after the first failover

VAL18-06

failover_count_threshold

Day 7 checkpoint

total_failovers 1 within the 7-day checkpoint window

VAL18-07

failover_timing_maintained

Day 7 checkpoint

max_failover_ms 10,000 across all recorded failovers

VAL18-08

ha_uptime_maintained

Day 14 checkpoint

healthy_rounds / total_rounds 99.9%

VAL18-09

total_failovers_final

Day 30 final

total_failovers 3 over full 30-day window

VAL18-10

data_continuity_final

Day 30 final

data_continuity_rate = 1.000 (probe row readable after every failover)


Pass/Fail Criteria

Outcome

Condition

PASS

All 10 checks pass

PARTIAL

Checks 3, 4, 5, 7, 10 pass (failover occurred, timing OK, data intact)

FAIL

Check 10 fails (data loss after any failover) OR check 4 fails (any failover exceeded 10 s)

The two mandatory checks are VAL18-04 (failover timing) and VAL18-10 (data continuity). The workplan’s public claims are “sub-10 s failover” and “no data loss”; both require these checks to pass over the full 30-day window.


Gate D HA Assessment (Gap HA-005)

The final report includes a Gate D assessment block:

## Gate D HA Assessment
PASS: Gate D HA criteria met (failovers=30, max_failover_ms=412 ≤ 10000,
      data_continuity=1.0, uptime=99.9%)

Gate D passes when all four hold:

  • total_failovers 3 over 30 days

  • max_failover_ms 10,000 (no failover exceeded 10 s)

  • data_continuity_rate = 1.0 (probe row accessible after every failover)

  • ha_uptime_pct 99.9%

Checkpoint policy:

  • 7-day checkpoint: total_failovers 1

  • 30-day final Gate D: total_failovers 3


Evidence Files

Path

Description

setup-summary.txt

VAL18-01/02 results, node PIDs, leader at setup, cron instructions

probe-setup.txt

Docker psql output: CREATE TABLE + INSERT for probe row

config.env

Shared configuration (sourced by round and report scripts)

round-count

Current round number (incremented each round)

total-failovers

Running failover count

rounds-since-failover

Rounds since last scheduled failover

ha-logs/node1.log

HA node1 append-only log

ha-logs/node2.log

HA node2 append-only log

alerts.log

Threshold breach alerts

rounds/YYYY-MM-DD/round-HHMMSS/round-summary.json

Per-round JSON (≈360 files)

daily/YYYY-MM-DD-summary.json

Daily aggregate (30 files)

checkpoints/checkpoint-7d.{json,txt}

7-day window checkpoint

checkpoints/checkpoint-14d.{json,txt}

14-day window checkpoint

final-report.{json,txt}

30-day final report with Gate D assessment


Known Failure Modes

Failure

Likely Cause

Mitigation

VAL18-01 FAIL: no leader elected

Both nodes started but neither won advisory lock; PG not ready

Check ha-logs/node1.log for acquired leadership; verify PG health on port 5488

VAL18-03/04 FAIL: no failover in first 24h

SOAK_FAILOVER_INTERVAL_ROUNDS not reached; round count file corrupted

Check rounds-since-failover; verify cron ran correctly

VAL18-04 FAIL: failover_ms > 10000

Follower campaign interval too slow; PG advisory lock not released

Check --campaign 200ms; verify advisory lock release in node logs

VAL18-05/10 FAIL: probe row missing

Docker PG container recreated without the persistent volume (data lost)

Never docker rm -v val18-pg-primary; setup is designed to reuse the existing volume/container on a normal rerun

VAL18-08 FAIL: uptime < 99.9%

Repeated process crashes; quorum lost rounds counted as failed

Check alerts.log for patterns and verify cron plus PID tracking

Node restarted every round

OS killed HA process (OOM, signal); binary crashing

Check ha-logs/node{1,2}.log for crash traces

final-report not generated

Must be run manually at day 30

Run bash run_soak_val18_report.sh $SOAK_DIR --type final

Setup Safety Note

run_soak_val18_setup.sh is intended for initial provisioning and safe operator resume. It reuses the persistent Docker volume and container when they already exist, rather than deleting the 30-day soak state on a normal rerun.


Final Report Template

# VAL 18 — HA 30-Day Soak Report (30-day final)

Generated:            <timestamp>
Total rounds:         360  (healthy: 359  failed: 1)
HA uptime:            99.7%  (target ≥ 99.9%)

## Cluster Health
Quorum-lost rounds:   0
Probe fail rounds:    0
Node1 restart rounds: 1
Node2 restart rounds: 0
PG restart rounds:    0

## Failover Statistics
Total failovers:      30  (target ≥ 3)
Timing OK count:      30  (≤ 10,000 ms each)
Data continuity OK:   30  (probe row accessible after each failover)
Data continuity rate: 1.0  (target = 1.000)
p50 failover_ms:      312 ms
p95 failover_ms:      487 ms
p99 failover_ms:      612 ms
max failover_ms:      831 ms  (target ≤ 10,000 ms)

## VAL18 Check Results
VAL18-01 framework_provisioned:    PASS  (recorded in setup-summary.txt)
VAL18-02 initial_round_success:    PASS  (recorded in setup-summary.txt)
VAL18-03 failover_triggered:       PASS  (recorded in round summaries)
VAL18-04 failover_timing_baseline: PASS  (recorded in round summaries)
VAL18-05 data_continuity_baseline: PASS  (recorded in round summaries)
VAL18-06 failover_count_threshold: PASS  (total_failovers=30, target=1 per 7d)
VAL18-07 failover_timing_maintained: PASS  (max_p99=612ms, target=10000ms)
VAL18-08 ha_uptime_maintained:     PASS  (uptime=99.7%, target=99.9%)
VAL18-09 total_failovers_final:    PASS  (total=30, target≥3)
VAL18-10 data_continuity_final:    PASS  (rate=1.000, target=1.000)

## Gate D HA Assessment
PASS: Gate D HA criteria met (failovers=30, max_failover_ms=831 ≤ 10000,
      data_continuity=1.0, uptime=99.7%)

Gate D HA Readiness Assessment:

  • Record max_failover_ms, data_continuity_rate, and ha_uptime_pct as evidence values in the Gate D sign-off document

  • PASS requires VAL18-09 (≥3 failovers) + VAL18-10 (data_continuity=1.0) + VAL18-07 (timing maintained)

  • All three checkpoint reports should be retained as intermediate evidence alongside the final report