Test Oracle

Pattern

A reusable solution you can apply to your work.

Context

You have a Test that runs your code and produces an output. Now you need to decide: is that output correct? The thing that answers this question is called an oracle. This is a tactical pattern that sits at the heart of every testing strategy.

Without an oracle, a test is just a program that runs code. It can tell you the code didn’t crash, but it can’t tell you the code did the right thing.

Problem

Knowing whether software produced the right answer is often harder than producing the answer in the first place. For simple functions (add two numbers, sort a list) the expected output is obvious. But for complex systems (a recommendation engine, a layout algorithm, a natural language response) defining “correct” is genuinely difficult. How do you establish a reliable source of truth for your tests?

Forces

Simple oracles (hardcoded expected values) are easy to write but only cover specific cases.
Complex systems produce outputs that are hard to verify precisely.
Some behaviors have multiple valid outputs, making exact comparison impossible.
The oracle itself can be wrong, creating false confidence.
Maintaining oracles adds cost as the system evolves.

Solution

Choose a source of truth appropriate to what you’re testing. The most common oracles, from simplest to most sophisticated:

Expected values. You hardcode the correct output for specific inputs. This is the bread and butter of unit testing: assert add(2, 3) == 5. Simple, clear, and fragile if the expected behavior changes.

Reference implementations. You compare your code’s output against a trusted alternative: a known-good library, a previous version, or a deliberately simple (but slow) implementation. This works well for algorithmic code where correctness is well-defined.

Property checks. Instead of checking for an exact value, you check that the output satisfies certain properties. “The sorted list has the same elements as the input” and “each element is less than or equal to the next” together define correctness for sorting without hardcoding any specific output.

Human judgment. For subjective or complex outputs (UI rendering, generated text, design choices) a human reviews the result and decides whether it’s acceptable. This doesn’t scale, but it’s sometimes the only honest oracle.

How It Plays Out

A team building a search engine can’t hardcode expected results for every query. Instead, they use property-based oracles: every returned result must contain the search term, results must be sorted by relevance score, and the top result must score above a threshold. These properties hold for any query, so the tests work even as the index changes.

In agentic coding, the oracle problem becomes acute. When an AI agent generates code, you need to verify the output. If you have a test suite with clear oracles (expected values, property checks, reference outputs) the agent can run the tests and self-correct. But if the only oracle is “a human reads the code and decides if it looks right,” the agent can’t iterate autonomously. Investing in machine-checkable oracles is what makes agentic workflows scalable.

Tip

When you can’t define an exact oracle, define properties. “The output is valid JSON,” “the response is under 200ms,” “the total matches the sum of the line items” — partial oracles still catch real bugs.

Example Prompt

“The search results can’t be hardcoded, so write property-based tests instead. Every returned result must contain the search term, results must be sorted by score descending, and the top result’s score must exceed 0.5.”

Consequences

A well-chosen oracle makes tests trustworthy. When a test fails, you know something is genuinely wrong, not just different. This trust is what makes a test suite valuable.

The risk is oracle rot: the oracle itself becomes outdated or wrong, and tests pass even when the code is broken. This is especially dangerous with hardcoded expected values that someone copy-pasted without verifying. Review your oracles as carefully as you review your code.

Sources

William E. Howden coined the term “test oracle” in Theoretical and Empirical Studies of Program Testing (ICSE 1978; IEEE Transactions on Software Engineering, July 1978), introducing the vocabulary used throughout this entry.
Elaine J. Weyuker’s On Testing Non-Testable Programs (The Computer Journal, 1982) formalized the case where an oracle is pragmatically unattainable — the “oracle problem” that drives the choice between expected values, reference implementations, properties, and human judgment.
Koen Claessen and John Hughes introduced property-based testing in QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs (ICFP 2000), the origin of the property-check approach described in the Solution.
Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo’s The Oracle Problem in Software Testing: A Survey (IEEE Transactions on Software Engineering, 2015) is the standard modern reference mapping the landscape of oracle techniques, including metamorphic testing and pseudo-oracles.

Keyboard shortcuts