VAL 14 — HA Replication Lag Baseline¶

Status: Implemented Runner: run_ha_replication_lag_val14_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val14/ Port: cp-val14-node → 18999

Purpose¶

Benchmarks PostgreSQL streaming replication lag under the autonomyops HA architecture:

Characterises lag distribution at idle and under light and heavy write loads
Measures WAL LSN drain time after writes committed with synchronous_commit=off
Validates that quorum health responds correctly to standby availability changes
Derives formula-based alerting thresholds anchored to observed behaviour
Provides an evidence-backed replication health baseline for post-deployment monitoring

Branch-Specific Rule Application¶

Question	Answer
Is this covered by an existing LAB?	No. `run_quorum_lab()` validates quorum health states. `run_ha_lab()` validates graceful failover. Neither measures write-path lag, LSN drain time, or derives alerting thresholds from observed data.
Which LAB/evidence bundle is extended?	`run_cli_audit_lab.sh` — new function `run_ha_replication_lag_val14_lab()` appended. Reuses `start_ha_server()`, `wait_for_http()`, `wait_for_log()`, `wait_for_quorum_health()`, `wait_for_pg_container()` helpers defined in the same file.
New evidence files	22 files in `$EVIDENCE_DIR/val14/` — see Evidence Files table below.
Tutorial/runbook docs updated	`docs/tutorials/cli-audit-lab.md` §4 (slice 24), §5 (val14/ files), §6 (expected results), §8 (scope).
Reason new runner function required	None of the existing functions set up a streaming-replication PostgreSQL pair, measure `write_lag` from `pg_stat_replication`, or derive alerting thresholds. A dedicated function keeps infrastructure isolated and avoids disrupting the complex multi-step flows in `run_ha_lab()` and `run_quorum_lab()`.

Benchmark Design¶

Five workload phases, executed sequentially:

Phase	Name	Write Load	`synchronous_commit`	Measurement
1	Replication baseline	None	n/a	`pg_stat_replication.state = streaming` + idle LSN gap = 0
2	Idle lag sampling	None	n/a	10 × `write_lag_ms` samples at 200 ms interval
3	Light write load	100 rows × 500 bytes	`off`	5 lag samples during active drain + LSN drain time
4	Heavy write load	500 rows × 2000 bytes (~1 MB WAL)	`off`	5 lag samples during active drain + LSN drain time
5	Post-drain / restart catch-up	None	n/a	Verify LSN gap returns to 0 within 10 s and replay backlog created while standby was offline drains after restart

synchronous_commit=off is used for write phases so the INSERT returns immediately without waiting for the standby. This allows lag to accumulate and be measured. The primary’s global synchronous_commit = 'remote_apply' does not block the write; the session-level override takes precedence.

Tooling and Setup¶

Environment¶

val14-ha-net  (Docker bridge network, isolated)
     │
val14-pg-primary  (postgres:16, wal_level=replica, max_wal_senders=10)
     │  streaming replication (replication slot val14_slot)
val14-pg-standby  (postgres:16, hot_standby=on, basebackup-provisioned)
     │
cp-val14-node:18999  (orchestrator_ha_server binary)

synchronous_standby_names = 'FIRST 1 (val14_standby)' — named sync standby
synchronous_commit = 'remote_apply' — global setting; overridden to off in write sessions
HA server: --min-sync-replicas 1 → quorum healthy only when standby is streaming

Port Allocation¶

Resource	Port	Notes
cp-val14-node HTTP	18999	Not used by any other lab function
Docker PostgreSQL primary	5432 (internal)	Reachable via Docker network IP
Docker PostgreSQL standby	5432 (internal)	Reachable within Docker network only

Provisioning Steps¶

Create Docker network val14-ha-net + volumes val14-pg-primary-vol / val14-pg-standby-vol
Start val14-pg-primary (postgres:16 with WAL streaming parameters)
Create replicator role; add pg_hba.conf entry for replication
Create benchmark table val14_bench(seq INT PRIMARY KEY, payload TEXT)
Run pg_basebackup -Xs -P -R -C -S val14_slot inside a transient container → seeds val14-pg-standby-vol
Start val14-pg-standby (reads postgresql.auto.conf written by pg_basebackup -R)
Configure synchronous_standby_names + synchronous_commit on primary; pg_reload_conf()
Start HA server at 127.0.0.1:18999; wait for leader election

Metrics Collected¶

Metric	Source	Description
`write_lag_ms`	`pg_stat_replication.write_lag` via `EXTRACT(EPOCH FROM write_lag)*1000`	Time between primary write and standby acknowledgement
`LSN_gap`	`(pg_current_wal_lsn() - write_lsn)::bigint` from `pg_stat_replication`	Bytes of unacknowledged WAL; 0 when fully caught up
`drain_time_ms`	Elapsed time from end of write transaction until `LSN_gap = 0`	End-to-end replication flush latency
`idle_p95_ms`	95th percentile of 10 idle `write_lag_ms` samples	Baseline noise floor
`light_p95_ms`	95th percentile of 5 light-load lag samples captured before full drain	Typical steady-state lag during a light burst
`heavy_p99_ms`	99th percentile of 5 heavy-load lag samples	Peak lag under sustained write burst
`healthy_thresh_ms`	Derived (see §Analysis)	Normal operating envelope
`degraded_thresh_ms`	Derived (see §Analysis)	Standby notably behind; investigate
`alert_thresh_ms`	Derived (see §Analysis)	Action required; risk of data loss on failover

Analysis Method¶

Lag Sampling¶

_v14_sample_lag(outfile, N, interval):
  for i in 1..N:
    lag_ms = SELECT COALESCE(EXTRACT(EPOCH FROM write_lag)*1000, 0)::bigint
             FROM pg_stat_replication WHERE state='streaming'
    append lag_ms to outfile
    sleep interval
  return max(lag_ms seen)

write_lag is null when the standby is fully caught up (no pending writes); COALESCE(…, 0) maps null → 0 so idle samples register as 0 ms, not absent data.

LSN Drain Measurement¶

_v14_drain_ms():
  start_ms = current_time_ms
  repeat up to 200 × 50ms:
    gap = SELECT (pg_current_wal_lsn() - write_lsn)::bigint
          FROM pg_stat_replication WHERE state='streaming'
    if gap == 0: return (current_time_ms - start_ms)
  return 99999  # timeout sentinel

The drain clock starts after the INSERT transaction commits (with synchronous_commit=off). Gap = 0 indicates the standby has acknowledged all WAL up to the primary’s current position.

Threshold Derivation Formula¶

observed_p95 = max(idle_p95_ms, light_p95_ms)

healthy_thresh  = max(observed_p95 × 3 + 1, 10)
degraded_thresh = max(healthy_thresh × 10, 100)
alert_thresh    = max(healthy_thresh × 50, 500)

Rationale:

3 × multiplier: absorbs jitter and burst spikes without excessive false positives
+1: guarantees healthy > observed_p95 (no immediate alert at measured peak)
Minimums (10, 100, 500 ms): ensure meaningful thresholds on local Docker (near-zero observed lag)
10 × / 50 × factors: match conventional “warning” and “critical” tiers relative to baseline

Example outputs for local Docker (observed_p95 ≈ 0 ms):

Threshold	Value
healthy	10 ms
degraded	100 ms
alert	500 ms

Example outputs for production (observed_p95 ≈ 20 ms):

Threshold	Value
healthy	61 ms
degraded	610 ms
alert	3050 ms

VAL14 10-Check Matrix¶

Check	Name	Threshold	Phase
VAL14-01	replication_streaming	`pg_stat_replication.state = 'streaming'`	Baseline
VAL14-02	idle_lsn_gap_zero	`LSN_gap = 0` at rest	Baseline / idle sampling
VAL14-03	light_load_drain	`drain_ms ≤ 2000`	Light write (100 rows × 500 bytes)
VAL14-04	heavy_load_drain	`drain_ms ≤ 5000`	Heavy write (500 rows × 2000 bytes)
VAL14-05	post_drain_gap_closed	`LSN_gap = 0` within 10 s after heavy drain	Post-drain verification
VAL14-06	ha_replication_endpoint	`/v1/health/replication` responds 200	Replication-health surface
VAL14-07	standby_stop_degraded	`quorum_health = degraded` within 30 s of `docker stop`	Standby unavailability
VAL14-08	standby_start_healthy	`quorum_health = healthy` after `docker start`	Standby recovery
VAL14-09	catchup_after_restart	LSN backlog created while standby is offline drains to 0 within 10 s of standby restart	Post-recovery catch-up
VAL14-10	threshold_report	`healthy_thresh > 0 AND degraded_thresh > 0 AND alert_thresh > 0`	Threshold derivation

Pass/Fail Criteria¶

Outcome	Condition
PASS	All 10 checks pass
PARTIAL	Checks 1, 3, 4, 7, 8 pass (streaming confirmed, both drain windows met, quorum responds to standby state change)
FAIL	Check 1 fails (no replication) OR both check 3 and check 4 fail

The primary workplan claims addressed:

Replication established and measurable (VAL14-01 + VAL14-02)
Lag drains within bounded time under realistic write loads (VAL14-03 + VAL14-04)
HA quorum correctly reflects standby availability (VAL14-07 + VAL14-08)
Practical alerting thresholds derived from observed data (VAL14-10)

Evidence Files¶

File	Description
`val14-pg-setup.txt`	Role create, pg_hba update, table create, synchronous config
`val14-pg-basebackup.txt`	`pg_basebackup` output (progress + replication slot creation)
`val14-ha-server.log`	HA server log (leader election + quorum monitor)
`val14-01-replication.txt`	`pg_stat_replication` output (should show `state=streaming`)
`val14-02-idle-lsn-gap.txt`	LSN gap at rest (should be `0`)
`val14-02-idle-lag-samples.txt`	10 idle `write_lag_ms` samples (one integer per line)
`val14-light-write.txt`	INSERT output for 100-row light write
`val14-03-light-lag-samples.txt`	5 lag samples during/after light write
`val14-03-light-result.txt`	`light_drain_ms=<N> threshold=2000`
`val14-heavy-write.txt`	INSERT output for 500-row heavy write
`val14-04-heavy-lag-samples.txt`	5 lag samples during/after heavy write
`val14-04-heavy-result.txt`	`heavy_drain_ms=<N> threshold=5000`
`val14-05-post-drain-samples.txt`	5 lag samples after heavy drain (should all be 0)
`val14-05-drain-result.txt`	`post_drain_gap=<N>` (should be 0)
`val14-06-ha-endpoint.json`	`/v1/health/replication` JSON response
`val14-07-quorum-degraded.json`	`/v1/ha/quorum` JSON showing `quorum_health=degraded`
`val14-08-quorum-healthy.json`	`/v1/ha/quorum` JSON showing `quorum_health=healthy`
`val14-09-offline-write.txt`	INSERT output for the write burst generated while standby is offline
`val14-09-post-restart-replication.txt`	`pg_stat_replication` after standby restart
`val14-09-catchup-result.txt`	`catchup_drain_ms=<N> ok=true/false`
`val14-report.txt`	Human-readable report with all 10 check results + derived thresholds
`val14-report.json`	Machine-readable JSON report

Known Failure Modes¶

Failure	Likely Cause	Mitigation
VAL14-01 FAIL: standby not streaming	`pg_basebackup` did not write `standby.signal` + `postgresql.auto.conf`	Verify `val14-pg-basebackup.txt`; check `-R` flag was passed to `pg_basebackup`
VAL14-03/04 FAIL: `drain_ms=99999`	Standby not consuming WAL (replication lag > 10 s)	Check `pg_stat_replication.state`; verify standby container is running
VAL14-07 FAIL: quorum stays healthy after standby stop	`--min-sync-replicas 1` not applied or quorum monitor interval too long	Verify HA server startup args; check `/v1/ha/quorum` manually
VAL14-09 FAIL: catchup never drains	Replication slot `val14_slot` accumulated WAL during standby downtime; large WAL to replay	Increase drain poll window; check `pg_stat_replication.write_lsn` after restart
Docker not available	CI environment without Docker daemon	Function prints SKIP and exits 0; VAL14 not counted in pass total

Final Report Template¶

# VAL 14 — HA Replication Lag Baseline

Generated:     <timestamp>
Nodes:         cp-val14-node:18999 (primary: val14-pg-primary  standby: val14-pg-standby)

## Lag Measurements
Idle p95 write_lag_ms:          <N> ms
Light load p95 write_lag_ms:    <N> ms  (100 rows × 500 bytes)
Light load drain time:          <N> ms
Heavy load p99 write_lag_ms:    <N> ms  (500 rows × 2000 bytes)
Heavy load drain time:          <N> ms

## Derived Alerting Thresholds
Healthy threshold:              <N> ms  (3 × observed p95 + 1, min 10)
Degraded threshold:             <N> ms  (10 × healthy, min 100)
Alert threshold:                <N> ms  (50 × healthy, min 500)

## Benchmark Checks
VAL14-01 replication_streaming:       PASS
VAL14-02 idle_lsn_gap_zero:           PASS
VAL14-03 light_load_drain:            PASS  (drain_ms=<N>, threshold=2000)
VAL14-04 heavy_load_drain:            PASS  (drain_ms=<N>, threshold=5000)
VAL14-05 post_drain_gap_closed:       PASS
VAL14-06 ha_replication_endpoint:     PASS
VAL14-07 standby_stop_degraded:       PASS
VAL14-08 standby_start_healthy:       PASS
VAL14-09 catchup_after_restart:       PASS
VAL14-10 threshold_report:            PASS

## Summary
pass=10  fail=0  total=10

Replication Lag Baseline Assessment:

Record idle_p95_ms, light_p95_ms, heavy_p99_ms, light_drain_ms, heavy_drain_ms as baseline values in the production monitoring configuration
Record healthy_thresh_ms, degraded_thresh_ms, alert_thresh_ms as the initial alerting thresholds
Repeat VAL14 after any significant infrastructure change (PG version upgrade, network topology change, hardware migration) to rebase the thresholds