Load Balancer
When one server can't handle the traffic, put a traffic cop in front of it.
Start with one server
A web server is just a computer. It runs a program that listens for requests. When a request arrives, the server does some work. It might look up data in a database. It might build a webpage. Then it sends the answer back.
Every server has a limit. A typical one starts to struggle at around 40 requests per second. Past that, the computer slows down. The CPU is busy. The memory fills up. Responses that used to come back in 100 milliseconds now take 30 seconds.
When traffic is low, like 25 requests per second, the server is fine. The capacity bar in the picture stays green.
Now traffic spikes
Imagine your product gets featured somewhere big, like Hacker News. Suddenly traffic jumps from 25 to 100 requests per second. Overnight.
Your one server cannot keep up. Requests pile up. Memory fills. The capacity bar in the picture turns red and starts pulsing. That is the server telling you it is in trouble. New requests get thrown away because there is no room to handle them. You will see them disappear in the simulation.
Users see error pages or blank screens. Every second the site is down, you are losing money and trust.
Put a load balancer in front
A load balancer is a program that sits in front of your servers. Its only job is to take incoming requests and pass each one along to one of your servers.
Real load balancers used by companies: NGINX, HAProxy, AWS ALB, Cloudflare. They are very fast. A single load balancer can handle 100,000 requests per second or more. That is way more than any single server can.
The user never knows there are several servers behind the scenes. From their side, they are just talking to one address.
But here is the catch. Just adding a load balancer does not solve anything yet. If there is still only one server behind it, the load balancer is just passing requests to the same overloaded server. We need more servers.
Add more servers
Now add two more servers. Identical copies of the first one.
The load balancer picks one server for each request. The simplest way is called round-robin. The first request goes to server 1. The next to server 2. The next to server 3. Then back to server 1. And so on. Each server gets about a third of the work.
100 requests per second split across 3 servers is about 33 each. That is comfortably below each server's 40 per second limit. The bars stay green. Users get fast responses.
That is the whole idea. It is called horizontal scaling. Instead of buying one bigger computer, you add more of the same kind.
Other ways to pick a server
Round-robin is the simplest way. But there are other ways the load balancer can choose.
Least-connections. The load balancer checks how busy each server is and sends the next request to the least busy one. This works better when some requests take much longer than others.
IP hash. The same user always gets sent to the same server, based on their internet address. This is useful when a server has cached data specific to that user.
Weighted. Maybe one of your servers is more powerful than the others. With weighted routing, you can send it more traffic.
Pick a strategy based on your needs. Round-robin works well for most cases and is usually the default.
Wait, the load balancer is the new weak link
Notice something? Every request now goes through that one load balancer. If it crashes, your whole site goes down. Even if all your servers are perfectly healthy.
In real systems, companies run two load balancers, not one. If the first one fails, traffic automatically goes to the second. Services like AWS Application Load Balancer handle this for you behind the scenes.
We are not going to simulate this today, but here is a rule worth remembering. Any single thing that all your traffic passes through is a weak link. The fix is always the same. Add another one.