VAL 14 — HA Replication Lag Baseline

Status: Implemented Runner: run_ha_replication_lag_val14_lab() in scripts/labs/run_cli_audit_lab.sh Evidence dir: $EVIDENCE_DIR/val14/ Port: cp-val14-node → 18999


Purpose

Benchmarks PostgreSQL streaming replication lag under the autonomyops HA architecture:

  • Characterises lag distribution at idle and under light and heavy write loads

  • Measures WAL LSN drain time after writes committed with synchronous_commit=off

  • Validates that quorum health responds correctly to standby availability changes

  • Derives formula-based alerting thresholds anchored to observed behaviour

  • Provides an evidence-backed replication health baseline for post-deployment monitoring


Branch-Specific Rule Application

Question

Answer

Is this covered by an existing LAB?

No. run_quorum_lab() validates quorum health states. run_ha_lab() validates graceful failover. Neither measures write-path lag, LSN drain time, or derives alerting thresholds from observed data.

Which LAB/evidence bundle is extended?

run_cli_audit_lab.sh — new function run_ha_replication_lag_val14_lab() appended. Reuses start_ha_server(), wait_for_http(), wait_for_log(), wait_for_quorum_health(), wait_for_pg_container() helpers defined in the same file.

New evidence files

22 files in $EVIDENCE_DIR/val14/ — see Evidence Files table below.

Tutorial/runbook docs updated

docs/tutorials/cli-audit-lab.md §4 (slice 24), §5 (val14/ files), §6 (expected results), §8 (scope).

Reason new runner function required

None of the existing functions set up a streaming-replication PostgreSQL pair, measure write_lag from pg_stat_replication, or derive alerting thresholds. A dedicated function keeps infrastructure isolated and avoids disrupting the complex multi-step flows in run_ha_lab() and run_quorum_lab().


Benchmark Design

Five workload phases, executed sequentially:

Phase

Name

Write Load

synchronous_commit

Measurement

1

Replication baseline

None

n/a

pg_stat_replication.state = streaming + idle LSN gap = 0

2

Idle lag sampling

None

n/a

10 × write_lag_ms samples at 200 ms interval

3

Light write load

100 rows × 500 bytes

off

5 lag samples during active drain + LSN drain time

4

Heavy write load

500 rows × 2000 bytes (~1 MB WAL)

off

5 lag samples during active drain + LSN drain time

5

Post-drain / restart catch-up

None

n/a

Verify LSN gap returns to 0 within 10 s and replay backlog created while standby was offline drains after restart

synchronous_commit=off is used for write phases so the INSERT returns immediately without waiting for the standby. This allows lag to accumulate and be measured. The primary’s global synchronous_commit = 'remote_apply' does not block the write; the session-level override takes precedence.


Tooling and Setup

Environment

val14-ha-net  (Docker bridge network, isolated)
     │
val14-pg-primary  (postgres:16, wal_level=replica, max_wal_senders=10)
     │  streaming replication (replication slot val14_slot)
val14-pg-standby  (postgres:16, hot_standby=on, basebackup-provisioned)
     │
cp-val14-node:18999  (orchestrator_ha_server binary)
  • synchronous_standby_names = 'FIRST 1 (val14_standby)' — named sync standby

  • synchronous_commit = 'remote_apply' — global setting; overridden to off in write sessions

  • HA server: --min-sync-replicas 1 → quorum healthy only when standby is streaming

Port Allocation

Resource

Port

Notes

cp-val14-node HTTP

18999

Not used by any other lab function

Docker PostgreSQL primary

5432 (internal)

Reachable via Docker network IP

Docker PostgreSQL standby

5432 (internal)

Reachable within Docker network only

Provisioning Steps

  1. Create Docker network val14-ha-net + volumes val14-pg-primary-vol / val14-pg-standby-vol

  2. Start val14-pg-primary (postgres:16 with WAL streaming parameters)

  3. Create replicator role; add pg_hba.conf entry for replication

  4. Create benchmark table val14_bench(seq INT PRIMARY KEY, payload TEXT)

  5. Run pg_basebackup -Xs -P -R -C -S val14_slot inside a transient container → seeds val14-pg-standby-vol

  6. Start val14-pg-standby (reads postgresql.auto.conf written by pg_basebackup -R)

  7. Configure synchronous_standby_names + synchronous_commit on primary; pg_reload_conf()

  8. Start HA server at 127.0.0.1:18999; wait for leader election


Metrics Collected

Metric

Source

Description

write_lag_ms

pg_stat_replication.write_lag via EXTRACT(EPOCH FROM write_lag)*1000

Time between primary write and standby acknowledgement

LSN_gap

(pg_current_wal_lsn() - write_lsn)::bigint from pg_stat_replication

Bytes of unacknowledged WAL; 0 when fully caught up

drain_time_ms

Elapsed time from end of write transaction until LSN_gap = 0

End-to-end replication flush latency

idle_p95_ms

95th percentile of 10 idle write_lag_ms samples

Baseline noise floor

light_p95_ms

95th percentile of 5 light-load lag samples captured before full drain

Typical steady-state lag during a light burst

heavy_p99_ms

99th percentile of 5 heavy-load lag samples

Peak lag under sustained write burst

healthy_thresh_ms

Derived (see §Analysis)

Normal operating envelope

degraded_thresh_ms

Derived (see §Analysis)

Standby notably behind; investigate

alert_thresh_ms

Derived (see §Analysis)

Action required; risk of data loss on failover


Analysis Method

Lag Sampling

_v14_sample_lag(outfile, N, interval):
  for i in 1..N:
    lag_ms = SELECT COALESCE(EXTRACT(EPOCH FROM write_lag)*1000, 0)::bigint
             FROM pg_stat_replication WHERE state='streaming'
    append lag_ms to outfile
    sleep interval
  return max(lag_ms seen)

write_lag is null when the standby is fully caught up (no pending writes); COALESCE(…, 0) maps null → 0 so idle samples register as 0 ms, not absent data.

LSN Drain Measurement

_v14_drain_ms():
  start_ms = current_time_ms
  repeat up to 200 × 50ms:
    gap = SELECT (pg_current_wal_lsn() - write_lsn)::bigint
          FROM pg_stat_replication WHERE state='streaming'
    if gap == 0: return (current_time_ms - start_ms)
  return 99999  # timeout sentinel

The drain clock starts after the INSERT transaction commits (with synchronous_commit=off). Gap = 0 indicates the standby has acknowledged all WAL up to the primary’s current position.

Threshold Derivation Formula

observed_p95 = max(idle_p95_ms, light_p95_ms)

healthy_thresh  = max(observed_p95 × 3 + 1, 10)
degraded_thresh = max(healthy_thresh × 10, 100)
alert_thresh    = max(healthy_thresh × 50, 500)

Rationale:

  • 3 × multiplier: absorbs jitter and burst spikes without excessive false positives

  • +1: guarantees healthy > observed_p95 (no immediate alert at measured peak)

  • Minimums (10, 100, 500 ms): ensure meaningful thresholds on local Docker (near-zero observed lag)

  • 10 × / 50 × factors: match conventional “warning” and “critical” tiers relative to baseline

Example outputs for local Docker (observed_p95 ≈ 0 ms):

Threshold

Value

healthy

10 ms

degraded

100 ms

alert

500 ms

Example outputs for production (observed_p95 ≈ 20 ms):

Threshold

Value

healthy

61 ms

degraded

610 ms

alert

3050 ms


VAL14 10-Check Matrix

Check

Name

Threshold

Phase

VAL14-01

replication_streaming

pg_stat_replication.state = 'streaming'

Baseline

VAL14-02

idle_lsn_gap_zero

LSN_gap = 0 at rest

Baseline / idle sampling

VAL14-03

light_load_drain

drain_ms 2000

Light write (100 rows × 500 bytes)

VAL14-04

heavy_load_drain

drain_ms 5000

Heavy write (500 rows × 2000 bytes)

VAL14-05

post_drain_gap_closed

LSN_gap = 0 within 10 s after heavy drain

Post-drain verification

VAL14-06

ha_replication_endpoint

/v1/health/replication responds 200

Replication-health surface

VAL14-07

standby_stop_degraded

quorum_health = degraded within 30 s of docker stop

Standby unavailability

VAL14-08

standby_start_healthy

quorum_health = healthy after docker start

Standby recovery

VAL14-09

catchup_after_restart

LSN backlog created while standby is offline drains to 0 within 10 s of standby restart

Post-recovery catch-up

VAL14-10

threshold_report

healthy_thresh > 0 AND degraded_thresh > 0 AND alert_thresh > 0

Threshold derivation


Pass/Fail Criteria

Outcome

Condition

PASS

All 10 checks pass

PARTIAL

Checks 1, 3, 4, 7, 8 pass (streaming confirmed, both drain windows met, quorum responds to standby state change)

FAIL

Check 1 fails (no replication) OR both check 3 and check 4 fail

The primary workplan claims addressed:

  • Replication established and measurable (VAL14-01 + VAL14-02)

  • Lag drains within bounded time under realistic write loads (VAL14-03 + VAL14-04)

  • HA quorum correctly reflects standby availability (VAL14-07 + VAL14-08)

  • Practical alerting thresholds derived from observed data (VAL14-10)


Evidence Files

File

Description

val14-pg-setup.txt

Role create, pg_hba update, table create, synchronous config

val14-pg-basebackup.txt

pg_basebackup output (progress + replication slot creation)

val14-ha-server.log

HA server log (leader election + quorum monitor)

val14-01-replication.txt

pg_stat_replication output (should show state=streaming)

val14-02-idle-lsn-gap.txt

LSN gap at rest (should be 0)

val14-02-idle-lag-samples.txt

10 idle write_lag_ms samples (one integer per line)

val14-light-write.txt

INSERT output for 100-row light write

val14-03-light-lag-samples.txt

5 lag samples during/after light write

val14-03-light-result.txt

light_drain_ms=<N>  threshold=2000

val14-heavy-write.txt

INSERT output for 500-row heavy write

val14-04-heavy-lag-samples.txt

5 lag samples during/after heavy write

val14-04-heavy-result.txt

heavy_drain_ms=<N>  threshold=5000

val14-05-post-drain-samples.txt

5 lag samples after heavy drain (should all be 0)

val14-05-drain-result.txt

post_drain_gap=<N> (should be 0)

val14-06-ha-endpoint.json

/v1/health/replication JSON response

val14-07-quorum-degraded.json

/v1/ha/quorum JSON showing quorum_health=degraded

val14-08-quorum-healthy.json

/v1/ha/quorum JSON showing quorum_health=healthy

val14-09-offline-write.txt

INSERT output for the write burst generated while standby is offline

val14-09-post-restart-replication.txt

pg_stat_replication after standby restart

val14-09-catchup-result.txt

catchup_drain_ms=<N>  ok=true/false

val14-report.txt

Human-readable report with all 10 check results + derived thresholds

val14-report.json

Machine-readable JSON report


Known Failure Modes

Failure

Likely Cause

Mitigation

VAL14-01 FAIL: standby not streaming

pg_basebackup did not write standby.signal + postgresql.auto.conf

Verify val14-pg-basebackup.txt; check -R flag was passed to pg_basebackup

VAL14-03/04 FAIL: drain_ms=99999

Standby not consuming WAL (replication lag > 10 s)

Check pg_stat_replication.state; verify standby container is running

VAL14-07 FAIL: quorum stays healthy after standby stop

--min-sync-replicas 1 not applied or quorum monitor interval too long

Verify HA server startup args; check /v1/ha/quorum manually

VAL14-09 FAIL: catchup never drains

Replication slot val14_slot accumulated WAL during standby downtime; large WAL to replay

Increase drain poll window; check pg_stat_replication.write_lsn after restart

Docker not available

CI environment without Docker daemon

Function prints SKIP and exits 0; VAL14 not counted in pass total


Final Report Template

# VAL 14 — HA Replication Lag Baseline

Generated:     <timestamp>
Nodes:         cp-val14-node:18999 (primary: val14-pg-primary  standby: val14-pg-standby)

## Lag Measurements
Idle p95 write_lag_ms:          <N> ms
Light load p95 write_lag_ms:    <N> ms  (100 rows × 500 bytes)
Light load drain time:          <N> ms
Heavy load p99 write_lag_ms:    <N> ms  (500 rows × 2000 bytes)
Heavy load drain time:          <N> ms

## Derived Alerting Thresholds
Healthy threshold:              <N> ms  (3 × observed p95 + 1, min 10)
Degraded threshold:             <N> ms  (10 × healthy, min 100)
Alert threshold:                <N> ms  (50 × healthy, min 500)

## Benchmark Checks
VAL14-01 replication_streaming:       PASS
VAL14-02 idle_lsn_gap_zero:           PASS
VAL14-03 light_load_drain:            PASS  (drain_ms=<N>, threshold=2000)
VAL14-04 heavy_load_drain:            PASS  (drain_ms=<N>, threshold=5000)
VAL14-05 post_drain_gap_closed:       PASS
VAL14-06 ha_replication_endpoint:     PASS
VAL14-07 standby_stop_degraded:       PASS
VAL14-08 standby_start_healthy:       PASS
VAL14-09 catchup_after_restart:       PASS
VAL14-10 threshold_report:            PASS

## Summary
pass=10  fail=0  total=10

Replication Lag Baseline Assessment:

  • Record idle_p95_ms, light_p95_ms, heavy_p99_ms, light_drain_ms, heavy_drain_ms as baseline values in the production monitoring configuration

  • Record healthy_thresh_ms, degraded_thresh_ms, alert_thresh_ms as the initial alerting thresholds

  • Repeat VAL14 after any significant infrastructure change (PG version upgrade, network topology change, hardware migration) to rebase the thresholds