Disaster Recovery

The plan for the worst day. You hope you never need it. You build it anyway.

The disasters

When people say "high availability," they usually mean a single server died, or one availability zone went down, and the system survived.

Disaster recovery is for bigger failures.

A whole AWS region offline for hours. This has happened to us-east-1 several times. A bad deploy corrupts your database. A ransomware attack encrypts all your servers. A misconfigured delete drops a critical table. A natural disaster physically takes out a data center.

These are rare. They are also catastrophic when they happen. Companies have gone out of business from a single bad backup strategy.

RTO and RPO. The two numbers

You define your DR plan with two numbers.

RTO (Recovery Time Objective) is how fast you have to be back up. "We can be down for 4 hours at most." A smaller RTO costs more. It usually means hot standbys and fast failover.

RPO (Recovery Point Objective) is how much data you can lose. "We can lose up to 15 minutes of writes." A smaller RPO also costs more. It means continuous replication and frequent backups.

A bank has an RTO of minutes and an RPO of zero. A photo-sharing app might have an RTO of hours and an RPO of an hour.

These numbers shape your architecture. RPO of zero needs synchronous cross-region replication, which is slow and expensive. RPO of an hour is fine with hourly backups.

Decide these numbers for your business first. Then build the plan that meets them.

Backups. The simplest DR

The minimum bar is to take backups. Daily and weekly full snapshots of the database. Copy them somewhere that is not in the same region as production.

Test them. A backup you cannot restore is not a backup. It is just a file. Companies regularly find out their backups have been quietly broken for years.

Better. Use continuous backups. Most modern managed databases like RDS and Cloud SQL offer point-in-time recovery. You can restore to any second within the last 7 to 30 days. This is the best protection against human error, like running DELETE FROM users without a WHERE clause.

Keep several generations. If you only keep yesterday's backup and the corruption happened 3 days ago, you are out of luck.

Multi-region. The gold standard

For services that cannot afford to be down, you run the whole stack in a second region.

Active-passive. us-east-1 takes all traffic. us-west-2 holds a continuously updated standby. When us-east goes down, you flip DNS and traffic moves to us-west.

Active-active. Both regions take traffic at the same time. East coast users hit us-east. West coast users hit us-west. If one region dies, the other absorbs everything. No failover delay.

The cost is high. You pay for twice the infrastructure. You handle complex cross-region data sync. You handle edge cases like "a user signs up in east, the signup is slow to propagate, and their next request lands on west, which has not seen them yet." But for products where minutes of downtime cost millions (financial services, large e-commerce), this is required.

You do not have to be Netflix-sized to need this. Even a small SaaS might have a contract that requires it. Plan for the disaster before it happens.

Now build it yourself →