Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Failure Mode

Concept

A foundational idea to recognize and understand.

Context

Every system can fail. The question isn’t whether but how. A failure mode is a specific, identifiable way that a system can break or degrade. Understanding failure modes is a tactical pattern; it operates at the level of individual components and their interactions, and it informs how you design, test, and operate software.

Failure modes connect to Invariants (what must not break), Tests (how you verify it doesn’t break), and Observability (how you detect it breaking in production).

Problem

When you build software, you naturally think about how it should work. But reliable software requires thinking equally hard about how it will fail. A database will become unavailable. A network call will time out. A disk will fill up. A user will submit unexpected input. Each of these is a failure mode, and each demands a different response. If you haven’t thought about how your system fails, you’ll discover its failure modes in production, from your users.

Forces

  • There are more ways for a system to fail than to succeed.
  • Not all failures are equally likely or equally damaging.
  • Handling every conceivable failure is impractical and makes code complex.
  • Unhandled failures tend to cascade: one component’s failure becomes another’s input.
  • Users and operators need to understand what went wrong, not just that something did.

Solution

Systematically identify and categorize the ways your system can fail, then decide how to handle each one. For each component or interaction, ask: “What happens when this goes wrong?”

Common failure modes include:

  • Crash — the process terminates unexpectedly.
  • Timeout — an operation takes too long and is abandoned.
  • Resource exhaustion — memory, disk, connections, or threads run out.
  • Data corruption — stored data becomes inconsistent or invalid.
  • Dependency failure — a service or library the system relies on stops working.
  • Byzantine failure — a component produces incorrect results but doesn’t report an error.

For each identified failure mode, choose a response: retry, fall back to a default, degrade gracefully, alert an operator, or fail fast and clearly. The worst response is no response, letting the failure propagate silently.

Document your failure modes. A failure mode catalog for a system is like a medical chart: it tells you what can go wrong, what the symptoms look like, and what to do about it.

How It Plays Out

A weather application depends on a third-party API for forecast data. The team identifies three failure modes for this dependency: the API could be down (timeout), it could return stale data (data quality), or it could return an error (explicit failure). For timeouts, the app shows the last known forecast with a “data may be outdated” banner. For stale data, it checks the timestamp and warns the user. For errors, it falls back to a simplified forecast from a secondary source. None of these responses is perfect, but all are better than crashing or showing nothing.

In agentic workflows, failure mode analysis applies to the agent itself. An AI agent can fail in ways that resemble software failures: it can time out (context window exhaustion), produce corrupted output (hallucination), or silently do the wrong thing (misunderstood instruction). Treating the agent as a component with known failure modes, and designing safeguards accordingly, makes agentic workflows more reliable. For example, always validating agent output against Tests before accepting it.

Note

The most dangerous failure modes aren’t the obvious ones (crash, timeout) but the subtle ones: data that is almost correct, responses that are slightly wrong, processes that succeed but produce garbage. These are the failures that survive testing and reach users.

Example Prompt

“List the failure modes for our dependency on the weather API: timeout, stale data, error response, rate limiting. For each mode, implement a fallback behavior and add a test that simulates the failure.”

Consequences

Explicit failure mode analysis makes systems more reliable and easier to operate. When something goes wrong, the team isn’t surprised; they’ve already considered this scenario and have a response ready. It also improves Observability, because each failure mode implies specific signals to monitor.

The cost is analysis time and code complexity. Handling failure modes adds conditional logic, fallback paths, and monitoring. There’s a judgment call in how many failure modes to handle explicitly. Focus on the most likely and most damaging modes first; pragmatism beats completeness.

  • Detected by: Observability — you need visibility to detect failure modes in production.
  • Tested against: Test — test each failure mode explicitly.
  • Includes: Silent Failure — a particularly dangerous category of failure mode.
  • Bounded by: Performance Envelope — operating outside the envelope triggers failure modes.
  • Relates to: Invariant — a violated invariant is often the mechanism of a failure mode.