Retries & Backoff

Naive retries cause more outages than they prevent. Here is how to do them right.

Why retries exist

Networks are unreliable. Servers sometimes fail. A call that worked a second ago might fail this second for no reason. The next call would succeed.

The fix is obvious. If a call fails, try again.

This is built into HTTP clients, RPC frameworks, and queue consumers. Everywhere. It works. Most "errors" you see in distributed systems are temporary.

But naive retries, where you try again right away and keep going until it works, are one of the most common causes of big outages. Here is why.

The retry storm

Service B is overloaded. It takes 5 seconds instead of 50 ms. Service A times out and retries right away. Now there are two requests in flight to B for every one original request. B is even more overloaded. A retries again. Now three. Now five. Now twenty.

This is a retry storm. The retries multiply the load on a service that is already struggling. Instead of helping, you have guaranteed it never recovers.

If 100 upstream services all do naive retries when one downstream slows, you get a huge spike of traffic. The downstream goes from "slow" to "dead." Everything cascades.

This is the dark side of retries. Doing them wrong is worse than not retrying at all.

Exponential backoff. Spread out the retries

The fix is exponential backoff.

After the first failure, wait 100 ms before retrying. After the second, wait 200 ms. After the third, wait 400 ms. Then 800, then 1600, then 3200, doubling each time, capped at some maximum.

Each retry is further apart than the last. If the downstream is overwhelmed, retries spread out over seconds and minutes. That gives it time to recover.

This is a standard feature in every modern HTTP client and SDK. The AWS SDK, gRPC, and the Stripe SDK all do exponential backoff by default. You set the max retries (usually 3 to 5) and the max delay (usually 30 to 60 seconds).

Jitter. Do not retry in sync

One more refinement. Picture 1,000 clients all hit a service and get errors at the same time. They all start their backoff. Wait 100 ms. Retry.

100 ms later, all 1,000 retry at the same instant. The service gets the same spike all over again.

The fix is jitter. Add random noise to the backoff delay. Instead of waiting exactly 100 ms, wait somewhere between 50 and 150 ms. Spread the retries across time.

A common formula. delay = random(0, min(cap, base * 2^attempt)). This is called "full jitter." AWS, Stripe, and Google all use it.

Retries with exponential backoff and jitter is the right answer. Memorize it. Use it everywhere.

Bonus. Combine this with circuit breakers from the previous concept. Retries handle short blips. Circuit breakers handle long outages. Together they keep distributed systems alive.

Now build it yourself →