Rate Limiting
Without rate limiting, one bad client can take down your whole service.
Why it exists
Without rate limits, all kinds of things can happen.
A buggy client retries in a tight loop and sends 10,000 r/s at your API. A scraper downloads every page on your site as fast as it can. An attacker tries to brute-force a login endpoint with millions of password guesses. A regular user accidentally clicks "submit" 50 times.
Each of these can overload your servers, exhaust your database, or rack up big bills if you pay for downstream APIs.
Rate limiting caps how many requests a given client can send in a given time window. Extra requests get a polite "slow down" response (429 Too Many Requests) and never reach your real backend.
How it works. The token bucket
The most common algorithm is the token bucket.
Each client has a bucket that holds say 10 tokens. Tokens refill at a fixed rate, like 1 per second. Every request takes one token. If the bucket is empty, the request is rejected with a 429.
This shape allows short bursts (you can spend all 10 tokens at once) while limiting sustained load (you can only keep up 1 r/s over time).
There are other algorithms too. The leaky bucket smooths bursts by queuing requests and processing them at a fixed rate. The fixed window counts requests inside each 1-minute window and resets on the minute. The sliding window is a more accurate version of that. Each has tradeoffs around bursts and edge cases.
Most real systems use the token bucket. Cloudflare. AWS API Gateway. NGINX. All token-bucket based.
What "per client" means
What do you rate-limit by? You have a few choices.
Per IP address. Simple. But many users share an IP, like corporate NATs or mobile carriers. One scraper can punish hundreds of innocent users.
Per API key. Clean. Every customer gets their own key and their own limit. Standard for B2B APIs.
Per user ID. Needs the user to be logged in. You can set different limits per tier. Free users get 100 per hour. Paid users get 10,000 per hour.
Per endpoint. Cap /api/login at 5 per minute to block brute-force attempts. Allow /api/search at 1,000 per minute.
Real systems combine all of these. A per-IP burst limit, a per-user sustained limit, a per-endpoint strict limit. Stacked rate limiters.
The downstream effects
When the limit is hit and you return a 429, what should the client do?
Good clients respect Retry-After headers and back off. Many real clients do not. They retry immediately and burn through their limit forever.
There is also the fail open vs fail closed question. If your rate limiter itself is down or slow.
Fail open lets all traffic through. Safer for users, but you lose protection at the exact moment you need it most. Fail closed rejects all traffic. Safer for your backend, but you lose customers during a rate limiter outage.
Most real systems fail open and watch their dashboards very closely. If the limiter dies, traffic spikes to the backends but you notice right away.
Rate limiting is not just defense. It also makes your system fair. No single user can starve everyone else.