Failover
How systems survive losing critical pieces without waking someone up at 3am.
Active-passive. The standby pattern
The classic setup. One component is active and takes traffic. Another is passive, a warm standby that is ready to take over. The active one streams its state to the passive one all the time.
A primary database streams writes to a standby. The standby applies them and stays current. It serves no traffic. It just waits.
When the primary fails, an automated process notices (a health check fails for several seconds), promotes the standby to be the new primary, and reroutes traffic.
The time to fail over is usually 30 seconds to a few minutes. Users might see a short blip. Compared to "wait hours for a human to debug and restore," it is a huge improvement.
Detection. How do you know it died
Failover starts with detection. There are two common ways.
Health checks. Every few seconds you probe the primary. Failed probes for N intervals in a row means "primary is down."
Heartbeats. The primary sends "I am alive" pings. No ping for X seconds means "primary is dead."
The hard problem is telling apart "down" from "slow" from "network partition." A primary might be alive but unreachable from the failover monitor. If the monitor promotes the standby, you now have two primaries that both think they are in charge. Both accept writes. They drift apart. Data corruption.
This is called split brain. The classic distributed systems nightmare.
There are several common defenses. Quorum-based promotion, where a majority of monitors must agree before failover. STONITH ("shoot the other node in the head"), which force-kills the old primary before promoting the standby. Fencing, which revokes the old primary's permission to write.
Active-active. No standby. Both serve
Another option. Two (or more) active nodes. Both take traffic. If one dies, the other absorbs everything. No failover delay. No warm-up.
For stateless services like web servers and workers, active-active is trivial. Just put a load balancer in front of N copies.
For stateful services like databases, active-active is hard. Both nodes accept writes. They have to coordinate to avoid conflicts. You either use a consensus protocol like Raft or Paxos, or you accept eventual consistency.
DynamoDB, Cassandra, and multi-region Spanner are all active-active databases. They are also harder to run than primary-replica setups. Pick based on how much reliability you need.
Test it. For real.
The single most important thing about failover. If you do not test it, it does not work.
A failover plan you have never run is wishful thinking. Real failures happen at 3am, in a state you did not see coming, when something else is also broken.
Companies that take reliability seriously practice chaos engineering. Netflix has Chaos Monkey. There is Gremlin, and AWS Fault Injection Simulator. They kill random services during business hours, while engineers are awake to react. If a failover does not work, they find out before customers do.
A lower bar. Run failover drills every quarter. Promote the standby. Confirm the system survives. Roll back. Fix what did not work.
Untested code is broken code. The same is true for failover.