VAL 14 — HA Replication Lag Baseline¶
Status: Implemented
Runner: run_ha_replication_lag_val14_lab() in scripts/labs/run_cli_audit_lab.sh
Evidence dir: $EVIDENCE_DIR/val14/
Port: cp-val14-node → 18999
Purpose¶
Benchmarks PostgreSQL streaming replication lag under the autonomyops HA architecture:
Characterises lag distribution at idle and under light and heavy write loads
Measures WAL LSN drain time after writes committed with
synchronous_commit=offValidates that quorum health responds correctly to standby availability changes
Derives formula-based alerting thresholds anchored to observed behaviour
Provides an evidence-backed replication health baseline for post-deployment monitoring
Branch-Specific Rule Application¶
Question |
Answer |
|---|---|
Is this covered by an existing LAB? |
No. |
Which LAB/evidence bundle is extended? |
|
New evidence files |
22 files in |
Tutorial/runbook docs updated |
|
Reason new runner function required |
None of the existing functions set up a streaming-replication PostgreSQL pair, measure |
Benchmark Design¶
Five workload phases, executed sequentially:
Phase |
Name |
Write Load |
|
Measurement |
|---|---|---|---|---|
1 |
Replication baseline |
None |
n/a |
|
2 |
Idle lag sampling |
None |
n/a |
10 × |
3 |
Light write load |
100 rows × 500 bytes |
|
5 lag samples during active drain + LSN drain time |
4 |
Heavy write load |
500 rows × 2000 bytes (~1 MB WAL) |
|
5 lag samples during active drain + LSN drain time |
5 |
Post-drain / restart catch-up |
None |
n/a |
Verify LSN gap returns to 0 within 10 s and replay backlog created while standby was offline drains after restart |
synchronous_commit=off is used for write phases so the INSERT returns
immediately without waiting for the standby. This allows lag to accumulate and
be measured. The primary’s global synchronous_commit = 'remote_apply' does
not block the write; the session-level override takes precedence.
Tooling and Setup¶
Environment¶
val14-ha-net (Docker bridge network, isolated)
│
val14-pg-primary (postgres:16, wal_level=replica, max_wal_senders=10)
│ streaming replication (replication slot val14_slot)
val14-pg-standby (postgres:16, hot_standby=on, basebackup-provisioned)
│
cp-val14-node:18999 (orchestrator_ha_server binary)
synchronous_standby_names = 'FIRST 1 (val14_standby)'— named sync standbysynchronous_commit = 'remote_apply'— global setting; overridden tooffin write sessionsHA server:
--min-sync-replicas 1→ quorum healthy only when standby is streaming
Port Allocation¶
Resource |
Port |
Notes |
|---|---|---|
cp-val14-node HTTP |
18999 |
Not used by any other lab function |
Docker PostgreSQL primary |
5432 (internal) |
Reachable via Docker network IP |
Docker PostgreSQL standby |
5432 (internal) |
Reachable within Docker network only |
Provisioning Steps¶
Create Docker network
val14-ha-net+ volumesval14-pg-primary-vol/val14-pg-standby-volStart
val14-pg-primary(postgres:16 with WAL streaming parameters)Create
replicatorrole; addpg_hba.confentry for replicationCreate benchmark table
val14_bench(seq INT PRIMARY KEY, payload TEXT)Run
pg_basebackup -Xs -P -R -C -S val14_slotinside a transient container → seedsval14-pg-standby-volStart
val14-pg-standby(readspostgresql.auto.confwritten bypg_basebackup -R)Configure
synchronous_standby_names+synchronous_commiton primary;pg_reload_conf()Start HA server at
127.0.0.1:18999; wait for leader election
Metrics Collected¶
Metric |
Source |
Description |
|---|---|---|
|
|
Time between primary write and standby acknowledgement |
|
|
Bytes of unacknowledged WAL; 0 when fully caught up |
|
Elapsed time from end of write transaction until |
End-to-end replication flush latency |
|
95th percentile of 10 idle |
Baseline noise floor |
|
95th percentile of 5 light-load lag samples captured before full drain |
Typical steady-state lag during a light burst |
|
99th percentile of 5 heavy-load lag samples |
Peak lag under sustained write burst |
|
Derived (see §Analysis) |
Normal operating envelope |
|
Derived (see §Analysis) |
Standby notably behind; investigate |
|
Derived (see §Analysis) |
Action required; risk of data loss on failover |
Analysis Method¶
Lag Sampling¶
_v14_sample_lag(outfile, N, interval):
for i in 1..N:
lag_ms = SELECT COALESCE(EXTRACT(EPOCH FROM write_lag)*1000, 0)::bigint
FROM pg_stat_replication WHERE state='streaming'
append lag_ms to outfile
sleep interval
return max(lag_ms seen)
write_lag is null when the standby is fully caught up (no pending writes); COALESCE(…, 0) maps null → 0 so idle samples register as 0 ms, not absent data.
LSN Drain Measurement¶
_v14_drain_ms():
start_ms = current_time_ms
repeat up to 200 × 50ms:
gap = SELECT (pg_current_wal_lsn() - write_lsn)::bigint
FROM pg_stat_replication WHERE state='streaming'
if gap == 0: return (current_time_ms - start_ms)
return 99999 # timeout sentinel
The drain clock starts after the INSERT transaction commits (with synchronous_commit=off). Gap = 0 indicates the standby has acknowledged all WAL up to the primary’s current position.
Threshold Derivation Formula¶
observed_p95 = max(idle_p95_ms, light_p95_ms)
healthy_thresh = max(observed_p95 × 3 + 1, 10)
degraded_thresh = max(healthy_thresh × 10, 100)
alert_thresh = max(healthy_thresh × 50, 500)
Rationale:
3 ×multiplier: absorbs jitter and burst spikes without excessive false positives+1: guarantees healthy > observed_p95 (no immediate alert at measured peak)Minimums (
10,100,500ms): ensure meaningful thresholds on local Docker (near-zero observed lag)10 ×/50 ×factors: match conventional “warning” and “critical” tiers relative to baseline
Example outputs for local Docker (observed_p95 ≈ 0 ms):
Threshold |
Value |
|---|---|
healthy |
10 ms |
degraded |
100 ms |
alert |
500 ms |
Example outputs for production (observed_p95 ≈ 20 ms):
Threshold |
Value |
|---|---|
healthy |
61 ms |
degraded |
610 ms |
alert |
3050 ms |
VAL14 10-Check Matrix¶
Check |
Name |
Threshold |
Phase |
|---|---|---|---|
VAL14-01 |
replication_streaming |
|
Baseline |
VAL14-02 |
idle_lsn_gap_zero |
|
Baseline / idle sampling |
VAL14-03 |
light_load_drain |
|
Light write (100 rows × 500 bytes) |
VAL14-04 |
heavy_load_drain |
|
Heavy write (500 rows × 2000 bytes) |
VAL14-05 |
post_drain_gap_closed |
|
Post-drain verification |
VAL14-06 |
ha_replication_endpoint |
|
Replication-health surface |
VAL14-07 |
standby_stop_degraded |
|
Standby unavailability |
VAL14-08 |
standby_start_healthy |
|
Standby recovery |
VAL14-09 |
catchup_after_restart |
LSN backlog created while standby is offline drains to 0 within 10 s of standby restart |
Post-recovery catch-up |
VAL14-10 |
threshold_report |
|
Threshold derivation |
Pass/Fail Criteria¶
Outcome |
Condition |
|---|---|
PASS |
All 10 checks pass |
PARTIAL |
Checks 1, 3, 4, 7, 8 pass (streaming confirmed, both drain windows met, quorum responds to standby state change) |
FAIL |
Check 1 fails (no replication) OR both check 3 and check 4 fail |
The primary workplan claims addressed:
Replication established and measurable (VAL14-01 + VAL14-02)
Lag drains within bounded time under realistic write loads (VAL14-03 + VAL14-04)
HA quorum correctly reflects standby availability (VAL14-07 + VAL14-08)
Practical alerting thresholds derived from observed data (VAL14-10)
Evidence Files¶
File |
Description |
|---|---|
|
Role create, pg_hba update, table create, synchronous config |
|
|
|
HA server log (leader election + quorum monitor) |
|
|
|
LSN gap at rest (should be |
|
10 idle |
|
INSERT output for 100-row light write |
|
5 lag samples during/after light write |
|
|
|
INSERT output for 500-row heavy write |
|
5 lag samples during/after heavy write |
|
|
|
5 lag samples after heavy drain (should all be 0) |
|
|
|
|
|
|
|
|
|
INSERT output for the write burst generated while standby is offline |
|
|
|
|
|
Human-readable report with all 10 check results + derived thresholds |
|
Machine-readable JSON report |
Known Failure Modes¶
Failure |
Likely Cause |
Mitigation |
|---|---|---|
VAL14-01 FAIL: standby not streaming |
|
Verify |
VAL14-03/04 FAIL: |
Standby not consuming WAL (replication lag > 10 s) |
Check |
VAL14-07 FAIL: quorum stays healthy after standby stop |
|
Verify HA server startup args; check |
VAL14-09 FAIL: catchup never drains |
Replication slot |
Increase drain poll window; check |
Docker not available |
CI environment without Docker daemon |
Function prints SKIP and exits 0; VAL14 not counted in pass total |
Final Report Template¶
# VAL 14 — HA Replication Lag Baseline
Generated: <timestamp>
Nodes: cp-val14-node:18999 (primary: val14-pg-primary standby: val14-pg-standby)
## Lag Measurements
Idle p95 write_lag_ms: <N> ms
Light load p95 write_lag_ms: <N> ms (100 rows × 500 bytes)
Light load drain time: <N> ms
Heavy load p99 write_lag_ms: <N> ms (500 rows × 2000 bytes)
Heavy load drain time: <N> ms
## Derived Alerting Thresholds
Healthy threshold: <N> ms (3 × observed p95 + 1, min 10)
Degraded threshold: <N> ms (10 × healthy, min 100)
Alert threshold: <N> ms (50 × healthy, min 500)
## Benchmark Checks
VAL14-01 replication_streaming: PASS
VAL14-02 idle_lsn_gap_zero: PASS
VAL14-03 light_load_drain: PASS (drain_ms=<N>, threshold=2000)
VAL14-04 heavy_load_drain: PASS (drain_ms=<N>, threshold=5000)
VAL14-05 post_drain_gap_closed: PASS
VAL14-06 ha_replication_endpoint: PASS
VAL14-07 standby_stop_degraded: PASS
VAL14-08 standby_start_healthy: PASS
VAL14-09 catchup_after_restart: PASS
VAL14-10 threshold_report: PASS
## Summary
pass=10 fail=0 total=10
Replication Lag Baseline Assessment:
Record
idle_p95_ms,light_p95_ms,heavy_p99_ms,light_drain_ms,heavy_drain_msas baseline values in the production monitoring configurationRecord
healthy_thresh_ms,degraded_thresh_ms,alert_thresh_msas the initial alerting thresholdsRepeat VAL14 after any significant infrastructure change (PG version upgrade, network topology change, hardware migration) to rebase the thresholds