Model

Concept

A foundational idea to recognize and understand.

Context

At the agentic level, the model is the foundation everything else rests on. A model (specifically, a large language model or LLM) is the inference engine that powers agents, coding assistants, and every other agentic workflow. When you interact with an AI coding assistant, the model is the part that reads your prompt, processes it within a context window, and produces a response.

Understanding what a model is and isn’t helps you work with it effectively. A model isn’t a database, a search engine, or a compiler. At its foundation, it’s a neural network trained on vast amounts of text and code that has learned statistical patterns in language. But that undersells what modern models actually do. Frontier models decompose multi-step problems, plan solutions, self-correct when they notice errors, and generate working code for tasks they’ve never seen expressed in exactly that form. The “just predicts the next word” framing is like saying a chess engine “just evaluates board positions.” Technically accurate, practically misleading.

Problem

How do you develop an accurate mental model of the model itself, so you can anticipate its strengths and weaknesses when directing it?

People new to agentic coding often treat the model as either a magic oracle (it knows everything) or a simple autocomplete (it just predicts the next word). Both framings lead to poor results. The oracle framing leads to uncritical acceptance of output. The autocomplete framing leads to underusing the model’s genuine capabilities for reasoning, planning, and synthesis.

Forces

Fluency makes model output sound authoritative regardless of correctness.
Training data shapes what the model “knows,” but that knowledge has a cutoff date and reflects the biases and errors of its sources.
Scale gives models broad competence across languages, frameworks, and domains, but depth varies.
Stochasticity means the same prompt can produce different outputs on different runs. Agent harnesses often drop the temperature to near zero to reduce variance on deterministic-feeling tasks, but bit-for-bit reproducibility is rarely achievable in practice. GPU floating-point ordering, tie-breaking at the top logit, and serving-layer batching each leak small amounts of non-determinism even at temperature zero. As of late 2025, a known engineering recipe (batch-invariant kernels combined with deterministic serving stacks like SGLang) can deliver bit-identical output across runs, but most production APIs still do not enable it.
Capability spectrum means no single model is best at everything. Fast models, reasoning models, and specialized coding models each suit different tasks.

Solution

Think of the model as a highly capable but context-dependent collaborator. It has broad knowledge but no persistent memory across sessions (unless you provide memory mechanisms). It reasons well within its context window but can’t access information outside that window. It generates plausible output by default and correct output when given sufficient context and clear constraints.

Properties worth internalizing:

Models are stateless between calls. Each request starts fresh. The model doesn’t remember your last conversation unless previous context is explicitly included. This is why instruction files and memory patterns exist.

Models have knowledge cutoffs. They were trained on data up to a specific date. They don’t know about libraries released last week or APIs that changed last month. In agentic settings, tools partially compensate: an agent with web search, file reading, and documentation retrieval can look up current information rather than relying on stale training data. But the model still can’t know what it doesn’t know, so providing current documentation for recent technologies remains good practice.

Models optimize for plausibility. When uncertain, a model produces the most likely-sounding response, not an admission of uncertainty. This is why AI smells exist and why verification loops matter.

Models respond to framing. The same question asked differently produces different quality responses. This is the entire basis of prompt engineering and context engineering.

Models process more than text. Frontier models accept images alongside text universally. Several (including GPT-5 and Gemini 2.5) accept native audio and video as well, though support varies by vendor — Claude Opus 4.5, for example, handles text and images but not audio or video. For agentic coding, this means a model can examine screenshots of a broken UI, read diagrams and architecture sketches, inspect visual test output, and (when the chosen model supports it) listen to a developer’s recorded explanation or watch a screencast of a failing test. Multimodal input expands what you can communicate in a prompt beyond what words alone can express.

Models differ and the differences matter. The frontier has converged on hybrid models that combine a fast mode and an extended-thinking mode in the same model, with a router or an effort parameter selecting per call. GPT-5 has a runtime router and a reasoning_effort API knob. Claude Opus 4.5 ships hybrid reasoning with an effort parameter. Gemini 2.5 exposes a thinkingBudget. Smaller and older models still ship as separate fast and reasoning SKUs, and specialized coding models can still beat general-purpose models on cost or local-deployment constraints (though on raw capability the gap has narrowed: Claude Opus 4.5 hit 80.9% on SWE-bench Verified at launch). Matching effort to task remains a practical skill. Spending high reasoning effort on string formatting wastes time and money; using minimal effort on a tricky concurrency bug wastes attempts.

How It Plays Out

A developer asks a model to implement a sorting algorithm. The model produces a clean, correct quicksort. Encouraged, the developer asks it to integrate with a proprietary internal API. The model produces confident-looking code that calls endpoints and uses data structures that don’t exist. It has no knowledge of this private API. The developer learns to provide API documentation in the context when asking for integration work.

A team uses a model to review a pull request. The model identifies a potential race condition that three human reviewers missed, because it systematically traced the concurrent access paths. The same model, in the same review, suggests a “best practice” that’s actually outdated advice from a deprecated framework. The team learns that model output requires verification even when parts of it are excellent.

Example Prompt

“I need you to integrate with our internal inventory API. Here is the full API documentation — read it before generating any code, because you won’t have training data on this private system.”

Consequences

Understanding the model’s nature lets you work with it productively rather than fighting its limitations. You learn to provide the context it needs, verify the output it produces, and choose the right model for each task.

The cost is that you must maintain a dual awareness: appreciating the model’s capabilities while remaining skeptical of any individual output. This is a cognitive skill that takes practice to develop. Over time, it becomes second nature, similar to how experienced developers learn to trust a compiler’s output while distrusting their own assumptions.

Sources

The concept of the large language model traces to Vaswani et al., “Attention Is All You Need” (2017), which introduced the transformer architecture underlying all modern LLMs.
Jason Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022), demonstrated that models can perform multi-step reasoning when prompted appropriately, challenging the “just predicts the next word” framing.
OpenAI’s release of o1 (September 2024) marked the emergence of dedicated reasoning models that spend compute on extended thinking before responding, establishing the fast-vs-reasoning model distinction as a practical concern for practitioners. The split it defined was later subsumed by hybrid models (GPT-5 in August 2025, Claude Opus 4.5 in November 2025, Gemini 2.5) that combine both modes in a single model with a runtime router or an effort dial.
Bartosz Mikulski, “The Temperature=0 Myth: Why Your LLM Still Isn’t Deterministic (And How to Fix It)”, explains why temperature zero gives greedy sampling rather than true determinism, and catalogs the non-determinism sources (GPU floating-point ordering, batching, mixture-of-experts routing) that persist below the sampling layer.
Horace He and Thinking Machines Lab, “Defeating Nondeterminism in LLM Inference” (September 2025), identified batch-invariance (not floating-point ordering) as the dominant practical cause of non-determinism in LLM inference, and shipped a companion library of batch-invariant kernels for matmul, RMSNorm, and attention that achieved bit-identical output across 1,000 runs even under dynamic batching.

Keyboard shortcuts