DataOfficialVerified

Replication Lag Debugger

by Skill Me

Diagnoses read-replica lag and the stale-read bugs it causes, then applies read-after-write consistency strategies to fix them. Use when a read returns data older than a just-committed write ("the row I just saved disappeared"), when replica lag monitoring alerts or grows after a backfill, or when deciding which reads must hit the primary versus a replica.

replicationread-replicaconsistencypostgresmysql

Demo

Click to play with sound.

SKILL.md preview

View source on GitHub →

---
name: Replication Lag Debugger
description: Diagnoses read-replica lag and the stale-read bugs it causes, then applies read-after-write consistency strategies to fix them. Use when a read returns data older than a just-committed write ("the row I just saved disappeared"), when replica lag monitoring alerts or grows after a backfill, or when deciding which reads must hit the primary versus a replica.
---
# Replication Lag Debugger

Stop stale-read bugs caused by reads racing ahead of replication. A read replica is eventually consistent: a write committed on the primary is not instantly visible on replicas, so most "the data disappeared after I saved it" reports are reads hitting a replica that hasn't caught up — not data loss.

## Workflow

1. **Confirm it's a lag bug, not data loss.** Re-read the same row from the primary. If it's present on the primary but missing/stale on a replica, it's replication lag. If it's missing on the primary too, this is the wrong skill — it's a write/transaction bug.
2. **Measure the lag with real numbers — never prescribe a fix without this.**
   - Postgres: on the primary capture `pg_current_wal_lsn()`; on each replica read `pg_last_wal_replay_lsn()` and `pg_last_xact_replay_timestamp()`; inspect `pg_stat_replication` for the write/flush/replay LSN gaps.
   - MySQL: read `Seconds_Behind_Source` from `SHOW REPLICA STATUS` (treat as approximate — it stalls at 0 then jumps).
   - Record the actual gap (LSN delta and time) and whether it's steady or spiking.
3. **Split send lag from replay lag.** Compare flush-LSN gap (network/send) against replay-LSN gap (replica busy applying). The cause and fix differ; don't guess.
4. **Find why it lags**, matched to the split above:
   - Single-threaded replay can't keep a write-heavy primary — enable parallel apply (MySQL `replica_parallel_workers`; Postgres has limited parallel recovery).
   - A long-running query on the replica conflicts with WAL replay and pauses it (Postgres `hot_standby_feedback` / `max_standby_streaming_delay` tradeoff).
   - A bulk write, backfill migration, or VACUUM floods the WAL stream — throttle the backfill into smaller batches.
   - Cross-region network saturation adds send lag.
5. **Fix the stale read with read-after-write, cheapest first:**
   - Route a user's reads to the primary for a short window after their write (session stickiness keyed by user).
   - Or always read your own writes from the primary, replicas for other users' data.
   - Or use LSN/GTID tracking: capture the write's LSN, then have the read wait until the chosen replica's replay LSN passes it (Postgres poll `pg_last_wal_replay_lsn()`; MySQL `WAIT_FOR_EXECUTED_GTID_SET`).

Replication Lag Debugger

Demo

SKILL.md preview

Reviews

Write a review