Database Replication

Reads scale across many machines. Writes do not.

Reads and writes are not the same

In most apps, reads outnumber writes by a lot. On Twitter, you scroll your feed about 100 times for every tweet you post. On Instagram, every post gets thousands of likes. On Reddit, most users only read.

A common ratio is about 90 percent reads and 10 percent writes.

Reads are usually simple. Fetch a row. Return it.

Writes are harder. They take locks. They update indexes. They touch related rows. They have to survive a crash.

If you can scale reads on their own, separate from writes, you unlock a much bigger system.

One database cannot take 200 reads per second

Your app gets 200 reads per second plus a handful of writes. A single database handles 40 r/s. It is overloaded by 5x.

You cannot just add another database. The two would hold different data. Which one is right?

You need a way to keep multiple databases in sync. That is replication.

One writer. Many readers

Promote one database to be the primary. It is the only one that takes writes. The other databases become read replicas. They are read-only copies.

When you write to the primary, it streams those changes over the network to all the replicas. The replicas apply the changes in the same order. They end up with the same data as the primary.

For reads, your app can hit any of the replicas. They all have the same data, with a tiny delay.

Send reads to replicas. Send writes to the primary

Your app code, or a proxy like PgBouncer, routes traffic two ways.

All writes (INSERT, UPDATE, DELETE) go to the primary. All reads (SELECT) go to one of the replicas, often picked round-robin.

Add more replicas and your reads scale in a straight line. 5 replicas at 40 r/s each gives 200 r/s of read capacity. Match your traffic.

Writes still all land on the primary. But writes are 10 times rarer than reads, so the primary stays healthy.

The tradeoff. Replication lag

Replication is not instant. A write takes milliseconds to reach all replicas. Under heavy load it can take seconds.

Picture this. A user updates their profile name. The write hits the primary. The app reads from a replica right after. The replica has not received the new name yet. The user sees their old name. They wonder if their save even worked.

This is called replication lag. It means you have an eventually consistent system, not a strictly consistent one.

There are a couple of fixes. You can read your own writes from the primary for a short window. Or you can accept the small staleness, which is what most apps do.

What happens if the primary dies

Now the failure mode. If the primary crashes, writes have nowhere to go. Reads still work because the replicas are alive. But the whole app turns read-only.

The fix is called failover. You detect that the primary is dead. You promote one of the replicas to be the new primary. You point the app at the new one.

Managed databases like RDS and Cloud SQL do this in seconds, automatically. In a setup you run yourself, it is manual and risky. There is a window where two replicas might both think they are the primary. This is called split brain. Both can accept conflicting writes.

This is its own big topic. For now, know that one primary is a single point of failure. The answer is automatic failover.

Now build it yourself →