Auto-scaling

Only pay for the servers you need right now. Not the ones you might need at peak.

Why we want this in the first place

Most apps have spiky traffic. A news website is dead at 3am and crowded at 9am. An online store is sleepy on a Tuesday afternoon and on fire during a flash sale.

You have two bad options if you have to pick a fixed number of servers.

If you keep enough servers running for peak traffic, you are wasting money the other 90 percent of the time. Those servers sit there doing nothing while you pay for them.

If you only keep enough for average traffic, your site falls over when the spike hits.

Auto-scaling solves this. The platform watches how busy your servers are and adds or removes them automatically. You might have 50 servers at 9am and 5 servers at 3am. You did not have to do anything. The platform handled it.

Every cloud provider has this built in. AWS calls it Auto Scaling Groups. Google calls it Managed Instance Groups. Kubernetes calls it the Horizontal Pod Autoscaler. They all do the same thing.

How it actually works under the hood

Auto-scaling is just a feedback loop. Every minute or so, the platform does three things.

It looks at some metric across all your servers. Usually that metric is average CPU usage, but it could be request rate, queue depth, or some custom value you provide.

If the metric is above a threshold for a few minutes, like average CPU above 70 percent, it scales out. That means launching more servers.

If the metric is below a low threshold, like average CPU below 30 percent, it scales in. That means terminating some servers.

Scaling out is usually faster than scaling in. You want to react quickly when load grows. You can take your time when things are getting quiet.

There are usually cool-down periods to prevent flapping. You do not want the system adding servers and then immediately removing them because it overcorrected.

The set of running servers is called a scaling group. You define a minimum size (always-on baseline) and a maximum size (cost cap). For example, you might say "never less than 2 servers, never more than 50."

When auto-scaling does not save you

Auto-scaling assumes any new server can immediately take a share of the load. That assumption breaks in a few real situations.

Cold start is slow. If a new server takes 5 minutes to boot, download code, warm up caches, and pass its health check, your traffic spike might be over by the time the new server is ready to help.

The bottleneck is somewhere else. Adding 50 web servers does not help if they all need to talk to the same database, and the database is the one struggling. Auto-scaling the wrong tier just makes the real bottleneck worse.

The servers actually have state. If a user had cached data on Server B and the auto-scaler decides Server B is no longer needed, that user loses their cached data and has to start fresh.

Solutions you see in real systems include warm pools (servers pre-booted and ready to add to the group instantly), provisioned concurrency (the AWS Lambda equivalent), scaling on leading indicators like queue depth instead of CPU, and always making sure the actual bottleneck is something that can scale.

Now build it yourself →