Latency vs Throughput
Speed is two numbers, not one. Mixing them up will mislead every optimization you try.
Think of a pipe with water in it
Picture water flowing through a pipe.
Latency is how long one drop takes to travel from one end of the pipe to the other. It is measured in time. Milliseconds. Seconds.
Throughput is how many drops arrive at the other end every second. It is measured in volume over time. Requests per second. Megabits per second.
A long, thin pipe has high latency (drops travel slowly) and low throughput (few drops arrive). A short, fat pipe has low latency and high throughput. A long, fat pipe has high latency but high throughput. Satellite internet is like this. The signal travels far so each request is slow. But the pipe is wide so you can stream HD video.
Networks have both numbers. Servers have both numbers. They are two different things. You can be good at one and bad at the other.
Why latency matters
Latency is what users feel. When you click a button and the page sits there for two seconds before doing anything, that delay is latency.
A lot of things add to latency.
Physical distance is one. A network packet from New York to London takes about 70 milliseconds for a round trip. That is just the speed of light. You cannot beat physics.
Network hops are another. Every router along the way adds a few milliseconds. DNS lookup, TCP handshake, TLS handshake. Each one is another 10 to 100 milliseconds.
Then there is server processing time. How long does the server take to think about your request and prepare an answer.
People start to notice delays above 100 milliseconds. One full second feels slow. Ten seconds feels broken. The main way to fix latency is to be physically closer to the user. That is why content delivery networks (CDNs) exist.
Why throughput matters
Throughput is how many users you can serve at the same time. If you have a million users and each one sends ten requests per second, your system needs to handle ten million requests per second total.
Every server has a ceiling. In our simulations, each server handles about 40 requests per second. To go past that, you put a load balancer in front and add more servers. That is called scaling horizontally.
The way to fix throughput is to be bigger. More servers. More bandwidth. More workers in parallel.
One thing to be careful about. Adding more servers gives you more throughput, but it does not make any single request faster. Each request still takes the same time as before. If a single page load takes 800 milliseconds with one server, it still takes 800 milliseconds with ten servers. You just have ten times as many of them happening at once.
Sometimes you can trade one for the other
In some cases you can give up one to gain the other.
Batching means waiting a tiny bit and grouping several requests into one. Each request takes longer (worse latency) but the total number per second goes up (better throughput). Databases often write to disk this way.
Pipelining means sending the next request before the previous answer comes back. You stack requests in flight. Throughput goes up but failures get harder to recover from.
Parallelism means splitting one request into smaller pieces and running them at the same time. The request finishes faster (better latency) but you use more CPU and memory.
The right trade depends on which number matters for what you are building.
Which one matters more depends on what you are building
Video streaming cares about throughput. You need enough bandwidth to push HD video to everyone. A 200 millisecond delay before playback starts is fine.
Online gaming cares about latency. A 50 millisecond delay can be the difference between a hit and a miss. The amount of data per second is small.
Background data sync cares about throughput. You are copying terabytes. Nobody minds if each chunk takes 100 milliseconds.
Real-time trading cares about latency in the extreme. Firms put their servers in the same building as the stock exchange to save microseconds.
Whenever someone says "make it faster," stop and ask. Do you mean lower latency? Or higher throughput? They are different problems with different fixes.