Checkpoint
A checkpoint is a gate in an agentic workflow where the agent pauses, verifies that conditions are met, and proceeds only if they pass.
Understand This First
- Verification Loop – checkpoints use verification to decide whether work should continue.
- Plan Mode – planning produces the stages that checkpoints enforce.
Context
This is an agentic pattern. You’ve asked an agent to do something that takes multiple steps: build a feature, run a migration, restructure a module. The agent works through them, and you hope each one finishes correctly before the next one starts. But hope isn’t a mechanism. Without explicit stopping points, the agent charges ahead, and a mistake in step two becomes the foundation for steps three through seven.
A checkpoint is a deliberate pause between stages. The agent stops, runs a defined check, and either moves forward or halts and reports. It’s the difference between a workflow that assumes success and one that verifies it.
The concept has roots in manufacturing and aviation, where checkpoints prevent small errors from propagating into large failures. In agentic coding, the same logic applies. Models are confident but fallible, and catching an error at step two costs far less than unwinding six steps of work built on a broken assumption.
Problem
How do you prevent an agent from building on top of broken work when a multi-step task fails partway through?
An agent working through a plan will generate plausible output at every stage. If step three produces code that compiles but violates a business rule, the agent doesn’t notice. It has no internal signal that says “this is wrong.” Steps four and five layer more work on top of the violation. By the time a human reviews the result, the error is buried under several layers of changes, and rolling back means losing everything, not just the broken step.
Forces
- Agents don’t doubt their own output. A model that just generated broken code will cheerfully build the next step on top of it.
- Checking everything after every change is expensive. Running the full test suite between each step slows the workflow to a crawl.
- Checking nothing leaves you with no safety net. You discover problems only at the end, when fixing them costs the most.
- Some steps are cheap to verify (does it compile? do the types check?) while others need heavier validation (does this match the spec? does it handle edge cases?). One-size-fits-all checking wastes effort.
- Human review at every step defeats the purpose of using an agent. The whole point is that the agent handles sequences of work without constant supervision.
Solution
Break the workflow into stages and place a verification gate between each one. At each gate, the agent runs a defined check before moving to the next stage. If the check passes, work continues. If it fails, the agent either retries the current stage or stops and surfaces the failure.
Match the check to the risk of the stage. Lightweight checks (compilation, type checking, linting) cost almost nothing and belong everywhere. Heavier checks (running tests, validating against acceptance criteria, comparing output to a spec) belong at stages where a missed error would be expensive. Not every checkpoint needs the same rigor.
A practical checkpoint structure for a feature-building workflow:
- Spec review. The agent reads the requirements and produces a summary of what it plans to build. Gate: does the summary match the spec? (This can be a human review or an automated comparison.)
- Implementation. The agent writes the code. Gate: does it compile? Do the types check? Do existing tests still pass?
- Testing. The agent writes tests for the new code. Gate: do the new tests pass? Do they cover the acceptance criteria?
- Integration. The agent verifies the new code works with the rest of the system. Gate: does the full test suite pass? Are there regressions?
Each gate is a decision point with three outcomes: proceed, retry, or stop. Proceed means the check passed and the workflow advances. Retry means the agent takes another attempt at the current stage, with the failure information added to its context. Stop means the failure is beyond what the agent can fix on its own, and a human needs to step in.
Checkpointing also means saving state. When the agent passes a gate, the current work should be preserved so that a failure at a later stage doesn’t require starting over from scratch. In code-based workflows, a Git Checkpoint at each gate handles this: commit after each passed gate, and any later failure can roll back to the last good state rather than the very beginning.
Some teams take this further by spinning up ephemeral environments at each checkpoint. The agent works in a disposable sandbox, and only the artifacts that pass the gate get promoted to the next stage. If a stage fails, the environment is torn down with no cleanup needed. This pairs well with CI pipelines where each gate runs in its own isolated container.
Workflow frameworks like LangGraph formalize checkpointing by attaching a checkpointer to the execution graph. Every completed stage writes a snapshot keyed to the session. If the process crashes or the agent fails mid-task, the next invocation resumes from the last snapshot rather than restarting. The pattern is the same whether you implement it with a framework or with discipline: save state at gates, verify before advancing.
When writing a plan for an agent, define the checkpoints explicitly: “After implementing the API endpoints, run the integration tests before writing the frontend. If tests fail, fix the endpoints before proceeding.” The agent can’t infer where the gates should be unless you tell it.
How It Plays Out
A developer asks an agent to add a payment processing feature. The plan has four stages: database schema changes, API endpoints, payment provider integration, and frontend forms. Without checkpoints, the agent writes all four in sequence. The schema migration has a subtle bug: a column type is wrong. The API endpoints build queries against that wrong type. The payment integration works around it with type coercion. The frontend renders garbage. The developer reviews the final result and has to untangle four layers of compensating errors to find the root cause.
With checkpoints, the agent runs the migration and then executes the migration tests. The column type error surfaces immediately. The agent retries the migration, gets it right, and the remaining three stages build on a correct foundation. Twenty minutes of retry at stage one costs less than two hours of forensics at stage four.
A team runs a nightly workflow where an agent audits documentation against the current codebase. The workflow visits each module, compares the docs to the code, and proposes updates. They add a checkpoint after each module: did the proposed doc changes render correctly? Does the updated documentation still link to valid references? One night, a module rename breaks every cross-reference in the docs for that module. The checkpoint catches it, the agent fixes the references, and the remaining modules process cleanly. Without the checkpoint, broken references would have cascaded through the rest of the documentation.
Consequences
Checkpoints catch errors close to their source. A bug found at the gate where it was introduced costs minutes to fix. The same bug found five stages later costs hours, because the agent and the human reviewing the result must trace backward through layers of work to find the root cause.
The tradeoff is speed. Every gate adds verification time, and a workflow with too many checkpoints feels sluggish. The right density depends on the risk: high-stakes workflows (production deployments, data migrations, security-sensitive changes) warrant more gates. Low-stakes exploratory work can use fewer. Calibrate by asking: if this stage fails silently, how expensive is the cleanup?
Checkpoints also enable resumability. When state is saved at each gate, an interrupted workflow can pick up where it left off instead of restarting. This matters for long-running agent tasks where context window limits, API timeouts, or session boundaries would otherwise force a restart from scratch. The checkpoint becomes both a quality gate and a save point.
The discipline cost is real but front-loaded. Defining the stages, writing the gate conditions, and wiring up the state-saving happens once per workflow type. After that, every execution benefits. Teams that skip the upfront work pay the same cost in debugging time, distributed unpredictably across every run.
Related Patterns
- Depends on: Verification Loop – each checkpoint gate runs a verification cycle.
- Depends on: Plan Mode – the plan defines the stages that checkpoints enforce.
- Uses: Git Checkpoint – git commits at each gate provide rollback points for code-based workflows.
- Complements: Progress Log – the progress log records what passed each gate and what didn’t; checkpoints decide whether to proceed.
- Enables: Acceptance Criteria – acceptance criteria define what each gate checks for.
- Informed by: Feedback Sensor – the checks at each gate are feedback sensors applied at workflow boundaries.
- Complements: Worktree Isolation – ephemeral environments at each checkpoint keep failed stages from contaminating the main workspace.
- Contrasts with: Compaction – compaction manages context window limits; checkpoints manage workflow correctness.
Sources
- LangGraph’s checkpointing system (LangChain, 2024-2025) formalized the pattern for agent workflow frameworks. Every node in the execution graph writes state to a checkpointer, enabling pause, resume, replay, and human-in-the-loop review at any stage.
- The Hugging Face agentic coding implementation guide (2026) codified the principle that no long-running agent should operate without an explicit plan object with per-step verification gates.
- AWS Kiro (GA November 2025) enforced checkpoints as part of its three-phase spec workflow, requiring acceptance criteria in EARS notation at each stage boundary before the agent can advance.
- Martin Fowler’s harness engineering essays (2026) described feedforward and feedback controls that map directly to checkpoint gates: feedforward controls constrain what the agent attempts, feedback controls verify what it produced.