Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Observability

Concept

A foundational idea to recognize and understand.

Context

Your software is running in production. Users are using it. But you can’t see inside it. You know what goes in (requests) and what comes out (responses), but the internal state (why a request was slow, why a recommendation was wrong, why a queue is growing) is opaque. This is a tactical pattern that bridges the gap between deployed software and the humans (or agents) responsible for it.

Observability complements Testing, which verifies behavior before deployment. Observability gives you visibility after deployment, when real users and real data are involved.

Problem

Software in production behaves differently than software in testing. Real data is messier, real load is higher, and real users find paths you never anticipated. When something goes wrong, or just behaves unexpectedly, you need to understand why, not just that. But production systems are complex, and adding visibility after the fact is expensive and disruptive. How do you design systems so that you can understand their internal behavior from the outside?

Forces

  • You can’t debug what you can’t see.
  • Adding logging and instrumentation after problems appear is reactive and often insufficient.
  • Too much logging creates noise that buries the signal.
  • Sensitive data must not leak into logs or metrics.
  • Observability infrastructure (log aggregation, metrics dashboards, tracing systems) has real cost.

Solution

Design your software so that its internal state can be inferred from its external outputs. The three pillars of observability are:

Logs: timestamped records of discrete events. “User 42 placed order 789 at 14:32:07.” Logs tell you what happened. Good logs are structured (key-value pairs, not free-form text), include context (request IDs, user IDs), and use consistent severity levels.

Metrics: numerical measurements over time. “Request latency p99 is 230ms. Error rate is 0.3%. Queue depth is 47.” Metrics tell you how the system is performing. They’re cheap to collect and good for alerting on thresholds.

Traces: records of a request’s path through the system, showing which services it touched, how long each step took, and where it spent the most time. Traces tell you where time goes. They’re necessary for diagnosing performance problems in distributed systems.

The point is that observability isn’t something you bolt on; it’s something you design in. Every significant operation should emit enough information that someone investigating a problem six months from now can reconstruct what happened.

How It Plays Out

An e-commerce site experiences intermittent slow checkouts. Without observability, the team would guess, deploy changes, and hope. With observability, they open the tracing dashboard, find a slow checkout request, and see that the payment service call took 8 seconds instead of the usual 200 milliseconds. They check the payment service metrics and see a spike in database connection wait time. The root cause, a connection pool exhaustion, is identified in minutes, not days.

In agentic workflows, observability enables agents to monitor and maintain deployed systems. An agent can watch metrics, detect anomalies, and investigate using logs and traces, all programmatically. “Alert: error rate exceeded 1%. Investigate.” The agent queries recent error logs, identifies the most common error, traces it to a recent deployment, and reports its findings. This kind of automated investigation is only possible when the system is observable.

Tip

Structure your logs as key-value pairs (or JSON), not free-form sentences. Structured logs are searchable by machines, including AI agents, while “Something went wrong with the order” is useful to nobody.

Example Prompt

“Add structured JSON logging to the checkout flow. Each log entry should include a request_id, the step name, the duration in milliseconds, and any error details. Replace the existing print statements.”

Consequences

Observable systems are easier to operate, debug, and improve. Problems are found faster, root causes are identified more reliably, and the team spends less time guessing. Observability data also serves as a foundation for Performance Envelope definition: you can’t set performance targets without measuring actual performance.

The costs are real: storage for logs and metrics, network overhead for telemetry, engineering time to instrument code, and the risk of exposing sensitive data in logs. Treat observability as a feature that requires design and review, not an afterthought you sprinkle on.

  • Complements: Test — tests verify behavior before deployment; observability verifies behavior after.
  • Enables: Failure Mode detection — you can’t detect failure modes you can’t observe.
  • Enables: Silent Failure detection — observability is the primary defense against silent failures.
  • Informs: Performance Envelope — metrics data defines and monitors the envelope.
  • Contrasts with: Test Oracle — oracles verify correctness in test; observability reveals behavior in production.
  • Contrasts with: Fixture — fixtures control inputs for testing, while observability captures outputs in production.
  • Related: Shadow Agent – shadow agents evade all observability systems.
  • Related: Premature Optimization – measurement must precede optimization.
  • Related: Technical Debt – debt hides in unmonitored code.