Circuit Breaker

A failure-isolation pattern borrowed from electrical engineering. A must-have for distributed systems.

The cascading failure problem

Service A depends on Service B. B is slow today. Every call takes 30 seconds instead of 50 ms.

A's threads pile up waiting for B. Each request to A holds a thread, holds memory, and holds a database connection. As more requests come in, A runs out of threads. A is now slow too. Service C, which depends on A, starts piling up too. The whole system falls over.

This is a cascading failure. One slow downstream service causes everything upstream to lock up. It is the number one cause of big outages in microservice systems.

The fix is simple to say. Notice that B is failing. Stop calling B. Fail fast on the requests that need it.

How a circuit breaker works

A circuit breaker sits between your code and a downstream service. It has three states.

Closed is the normal state. Requests pass through. The breaker counts failures.

Open means too many recent failures. The breaker "trips." Requests fail right away without calling the downstream. The breaker returns an error or a fallback value.

Half-open comes after a cooldown. The breaker lets one trial request through. If it succeeds, the breaker closes and returns to normal. If it fails, the breaker stays open.

The threshold is up to you. A common one is "if more than 50 percent of the last 100 calls failed, open the breaker for 30 seconds."

The point is that instead of waiting 30 seconds for every call to time out, you fail in about 1 ms. Your service stays responsive. The downstream gets a break to recover.

Fallbacks. What to do when the breaker is open

When the breaker is open, you have to return something. The common choices.

Return an error. Simplest. "The recommendations service is not available. Please refresh in a moment." The user sees an error but the rest of the page works.

Return a default. "We could not load your recommendations, here is the trending list instead." Degraded but useful.

Return cached data. Serve a stale version. Often fine. Better than nothing.

Queue it for later. "We will get to this when we can." Good for non-critical paths.

The principle is to fail gracefully. A partial page is better than no page. A "could not load comments" line is better than the whole post failing to load.

This is the idea of graceful degradation. Do not take the whole system down because one feature is broken.

Libraries that do this for you

Building circuit breakers correctly is hard. Use a well-tested library.

Hystrix from Netflix (now in maintenance) made this pattern famous. Resilience4j for Java is its modern successor. Polly for .NET. gobreaker for Go. Envoy and Istio are service meshes that do circuit breaking at the network layer for any language.

Service meshes like Istio and Linkerd have become the standard way to add circuit breaking, retries, timeouts, and traffic policies across all your services without changing the app code. The mesh sidecar handles it.

Whether you use a library or a mesh, the principle is the same. Do not let a slow downstream service tie up your threads. Detect failure fast. Fail fast. Recover when it heals.

Now build it yourself →