Single Point of Failure (SPOF)

Every system has them. Knowing where they are is half the battle.

What a SPOF is

A single point of failure is any piece of your system whose failure takes down the whole thing. It does not matter how reliable the other parts are.

If you have 10 redundant servers but they all share a single load balancer, the load balancer is a SPOF. When it dies, the 10 servers might as well not exist.

The same goes for a single database, a single DNS provider, a single AWS region, a single cache cluster, or a single CI/CD system that every deploy goes through.

The goal is not to remove every SPOF. That is impossible. There is always something. The goal is to know they exist, understand how they fail, and have a plan.

Common SPOFs to look for

Map out your system. Anywhere traffic must pass through is a likely SPOF.

The load balancer. Run two. Use DNS or Anycast to route around a dead one. The database primary. Have a standby and a plan to fail over (the next concept). DNS. Use a second DNS provider as a backup. A single AWS region. Use a multi-region setup to survive a region-wide outage. A single shared cache cluster. Lose Redis and every server gets slow, or breaks if the code assumes the cache is there. A config service. If every server reads its config from one central service on startup, that service is critical. The CI/CD pipeline. Cannot deploy a hotfix? You are stuck with the bug. The auth service. If every request authenticates through one service and it dies, every request fails.

The exercise. Walk through a request from the client to the data and back. Every step is a possible SPOF.

Removing the obvious ones

Some SPOFs have well-known fixes.

Two load balancers behind a shared IP or DNS round-robin. AWS's ALB is already replicated across zones. You do not see it, but it is there.

Database replication and failover. A primary plus a standby. If the primary dies, the standby is promoted. Modern managed databases like RDS and Cloud SQL do this in seconds.

Multi-AZ deployments. Spread servers across several data centers in the same region. If one data center loses power, the others keep serving.

Multi-region deployments. Spread across regions like us-east-1 and us-west-2. This protects you against whole-region outages. Rare but real. AWS has had region-wide outages.

Cost. Each layer of redundancy doubles your bill on that piece. So you cannot make everything redundant. Find the most critical SPOFs, remove those, and accept that lower priority things may cause outages.

The sneaky ones

Beyond the obvious infrastructure SPOFs, some sneaky ones bite real systems.

A shared utility library. Every service depends on it. A bad release ships everywhere and breaks the whole company at once.

A single on-call person. If only one engineer understands the system, them being on vacation is a SPOF.

A third-party API. Stripe, Twilio, or SendGrid going down takes your features down with them. Unless you planned for it with queuing, retries, or fallback paths.

A single feature flag service. If your kill switches live in one central service and it goes down, you cannot turn features off when they are melting.

The deployment pipeline. CI is down, so you cannot ship a fix, so the outage drags on.

Reliability is not just about hardware. It is about spotting every place your system quietly depends on something else "just being there."

Now build it yourself →