Metrics & Alerting

Logs tell you what happened. Metrics tell you how often, how slow, and how much.

Logs vs metrics

Logs are events. "User 42 logged in at 3:14pm." They are great for debugging one specific incident.

Metrics are aggregated numbers over time.

requests_per_second, sampled every 10 seconds. p99_latency_ms, the 99th-percentile request latency. error_rate_percent, errors divided by total requests. db_pool_active_connections.

You cannot debug one specific user's problem from metrics. But you can answer "is the system healthy right now?" "How does today compare to last week?" "Is latency creeping up over the last hour?"

Metrics are how you spot trends, set alerts, and run dashboards. Every real system needs both logs and metrics.

The RED metrics

Three metrics matter for almost every service. Memorize them.

Rate. Requests per second. Errors. The error rate (percent of requests that failed). Duration. How long each request took. Usually p50, p95, and p99 latency.

These three answer the basic questions. Is the service alive? Healthy? Fast?

For background workers and queues, the same idea is called USE. Utilization, Saturation, Errors.

For databases, you watch query rate, error rate, connection pool usage, and slow query count.

You can graph these for every service in your system. If they all look normal, the system is healthy. If one of them spikes, you have a starting point.

Alerting. When to wake someone up

Metrics feed alerts. An alert is a rule. "If p99 latency goes above 1 second for 5 minutes, page the on-call engineer."

The hard part is not setting alerts. It is setting good ones.

Bad alerts.

Too sensitive. "Page on any 5xx." You get woken up every night for stray errors that do not matter. Too noisy. 100 alerts a day. People ignore them all. Real alerts get lost. Not actionable. "CPU above 80 percent." So what? If it self-recovers, why was someone woken up?

Good alerts.

Symptom-based, not cause-based. "Users are seeing errors" is a symptom. "Server A is at 90 percent CPU" is a cause. Symptoms always need attention. Causes are useful for diagnosis. Actionable. The alert says "this is broken, here is the runbook." Tuned. The threshold is where the system actually starts hurting users. Not where it is slightly different from normal.

A rule of thumb. Every alert should be something a human must act on within the response time. Everything else is a dashboard.

Observability. Putting it together

The modern name for "logging plus metrics plus tracing" is observability.

Logs are detailed events. Good for debugging one specific failure. Metrics are aggregated numbers. Good for trends and alerts. Traces follow one request across services. "This user request went to service A, then service B, then the database, and was slow in B." Common distributed tracing tools are Jaeger, Zipkin, Datadog APM, and AWS X-Ray.

The three together are called the "three pillars of observability."

In real life, you set alerts on metrics. You get paged. You look at the metrics dashboards to confirm. You look at traces to find the slow component. You look at logs to see the exact error. Each tool answers a different question.

Building this stack from scratch is a project on its own. Most teams use managed services like Datadog, Honeycomb, or New Relic so they can focus on what to watch rather than how to watch.

Now build it yourself →