Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Normalization / Denormalization

Pattern

A reusable solution you can apply to your work.

Also known as: Normal Forms (normalization), Materialized Views (denormalization)

Understand This First

  • Schema (Database) – normalization and denormalization are techniques for schema design.
  • Source of Truth – denormalized copies must have a clear authoritative source.
  • DRY – normalization is DRY applied to data; denormalization is a controlled violation of DRY.

Context

When designing a Schema for a Database, you face a design choice about how to organize your tables and fields. Normalization means structuring data so that each fact is stored exactly once — the DRY principle applied to database design. Denormalization means intentionally duplicating data so that certain queries become faster. This is an architectural pattern because it shapes the performance, consistency guarantees, and maintenance burden of everything built on the database.

Problem

How do you structure stored data to minimize inconsistency without sacrificing the performance of the queries your application actually needs?

A fully normalized database stores each fact once. If a customer’s name appears in the customers table, it doesn’t also appear in the orders table; the order just references the customer by ID. This is clean and consistent, but displaying an order summary now requires joining two tables, which is slower than reading a single row. A fully denormalized database stores everything together. Each order row includes the customer’s name, address, and phone number. That’s fast to read, but updating a customer’s name requires finding and changing every order they ever placed.

Forces

  • Storing each fact once (DRY) prevents update anomalies. You can’t forget to update a copy you didn’t know existed.
  • Read-heavy workloads benefit from having data pre-joined and ready to serve.
  • Write-heavy workloads benefit from normalization, where updates touch one row instead of many.
  • The complexity of keeping denormalized copies in sync can offset the performance gains.

Solution

Start normalized. Store each fact once, reference related data by ID, and let the database join tables at query time. This is the safe default because it prevents an entire category of bugs: the kind where two copies of the same fact disagree.

Denormalize selectively, when you have evidence that specific read operations are too slow and the cost of maintaining redundant copies is acceptable. Common denormalization strategies include adding computed columns (storing an order total instead of recalculating it from line items), creating summary tables (a monthly_sales table updated by a background job), and embedding related data (storing the customer name directly on the order row for display purposes).

When you denormalize, document which data is authoritative and which is derived. A denormalized copy should always have a clear upstream Source of Truth and a defined mechanism for staying in sync, whether that’s a database trigger, a background job, or application logic.

How It Plays Out

A social media application stores posts and user profiles in separate, normalized tables. The feed page — which shows posts alongside author names and avatars — requires joining the two tables for every post. Under heavy load, this join becomes the bottleneck. The team denormalizes by copying the author’s name and avatar URL onto each post row. Reads become fast, but now when a user changes their avatar, a background job must update thousands of post rows. The team accepts this tradeoff because avatar changes are rare and feed reads are constant.

When an AI agent generates database code, it often defaults to either extreme: heavily normalized (many small tables joined at query time) or heavily denormalized (a single JSON blob). Guiding the agent with explicit instructions like “normalize by default, but store the order total as a computed column for fast access” produces a practical design that balances both concerns.

Note

There is no single “correct” level of normalization. The right answer depends on your read/write ratio, your consistency requirements, and how willing you are to maintain synchronization logic. Start normalized and denormalize only where measurements show a real need.

Example Prompt

“The feed page is slow because it joins posts with user profiles on every request. Add a denormalized author_name and avatar_url to the posts table, and create a background job that syncs these fields when a user updates their profile.”

Consequences

Normalization gives you consistency and flexibility. You can change a fact in one place, and queries always reflect the current truth. It simplifies writes and reduces storage. But it can make reads slower, especially for dashboards and reports that aggregate data from many tables.

Denormalization gives you read speed and simpler queries at the cost of write complexity and the ongoing risk of stale data. Every denormalized copy is a consistency liability that must be managed. Over-denormalization leads to the exact problem normalization was invented to solve: update anomalies, where one copy says the customer lives in New York and another says Chicago.

  • Uses / Depends on: Schema (Database) — normalization and denormalization are techniques for schema design.
  • Uses / Depends on: Source of Truth — denormalized copies must have a clear authoritative source.
  • Uses / Depends on: DRY — normalization is DRY applied to data; denormalization is a controlled violation of DRY.
  • Enables: Consistency — normalization reduces the surface area for inconsistency.
  • Enables: CRUD — the normalization level affects the complexity of CRUD operations.