Logging

When something breaks at 3am, logs are how you find out what happened.

What logging is

A log is a record of events your code emitted, in order. Things like "user signed in," "database query took 200 ms," or "payment failed: insufficient funds."

Every modern app emits logs all the time. To stdout, to files, to central services. They are your debugger in production where you cannot attach a real one.

The common log levels you will see.

DEBUG is very chatty. For development. Off in production. INFO is normal operation. "Request received," "job completed." WARN is something unusual but not broken. "Retry succeeded after 2 attempts." ERROR means something failed. "Database connection lost." FATAL means the app cannot continue. About to crash.

The volume difference is big. DEBUG can be thousands of lines per second. INFO might be hundreds. ERROR ideally is zero.

Structured logging

The old style is a plain string. "User 42 logged in from IP 1.2.3.4".

The modern style is structured JSON. { "event": "user_login", "user_id": 42, "ip": "1.2.3.4", "ts": "2024-..." }

This matters a lot at scale. To query strings, you need regular expressions. To query JSON, you can index and filter the fields directly.

Modern logging libraries like winston, pino, and structlog emit structured JSON by default. Your aggregation system parses it natively.

Add context to every log. Which service emitted it. Which trace ID it belongs to. Which user, if any. Without context, a log is "something happened somewhere." With context, it is "this request from this user hit this error."

Central aggregation

Logs on individual servers are useless at scale. You have 100 servers. You cannot SSH into each one to read logs.

The standard pattern is to ship every server's logs to a central log aggregator. Search across the whole fleet from one place.

Common tools.

The ELK stack (Elasticsearch, Logstash, Kibana) is open source, self-hosted, and powerful. Datadog Logs, Splunk, Sumo Logic are managed services. CloudWatch Logs and Google Cloud Logging are built into the cloud platforms.

The aggregator indexes the JSON fields. You can ask questions like "show me every ERROR log from the checkout service in the last hour where status was 500."

Without central logging, you cannot do production work at any real scale. It is table stakes.

What to log, what not to log

Log every request that comes in, with a trace ID, user ID, and response code. Log errors with full stack traces. Log important state changes. "User upgraded." "Order shipped." "Config reloaded." Log slow operations that go over a threshold.

Do not log passwords, API keys, or credit card numbers. Ever. Logs are the number one leak path for secrets. Do not log personal info you do not need, like full names or addresses. Stay compliant with GDPR and CCPA. Do not run DEBUG-level logs in production. Volume costs money and drowns out the signal.

Storage gets expensive at scale. Hot logs that you can query within seconds are usually kept for 30 days. Cold logs go to cheap object storage for long-term audit.

Logs are write-once. Once they ship, they are evidence. Treat them that way.

Now build it yourself →