Context Engineering: The Discipline That Matters More Than Prompting
Most engineers building LLM systems spend the first few weeks obsessing over their prompts. They A/B test phrasing, argue about whether to use XML tags or JSON, and iterate on system prompt wording until the model outputs something that looks right. Then they hit production, add real data, memory, and tool calls — and the model starts misbehaving in ways that no amount of prompt tuning can fix. The problem was never the prompt.
The real bottleneck in production LLM systems is context — what information is present in the model's input, in what order, how much of it there is, and whether it's relevant to the decision the model is about to make. Context engineering is the discipline of designing and managing that input space as a first-class system concern. It subsumes prompt engineering the same way software architecture subsumes variable naming: the smaller skill still matters, but it doesn't drive outcomes at scale.
Why Prompts Are a Subset, Not a Strategy
Prompt engineering asks: "How should I phrase this instruction?" Context engineering asks: "What does the model need to know right now to make a good decision?"
The distinction is not semantic. A language model can only work with tokens in its context window. It cannot access external state, remember previous interactions, or intuit what you forgot to tell it. When a model fails, the first question should not be "how do I rewrite the instruction?" but "did the model have the information required to succeed?"
This reframing shifts responsibility from the prompt to the system. Reliable AI comes from architecture, not from clever phrasing. If you're repeatedly adjusting prompt wording while getting inconsistent results, the model almost certainly never had adequate context in the first place.
The mental model shift is simple:
- Prompt engineering: "What should I say?"
- Context engineering: "What should the model know?"
Both matter, but they operate at different leverage points. A well-designed context makes a mediocre prompt work. A poorly-designed context makes even the best prompt unreliable.
The Physics of Context Windows
To understand why context engineering is hard, you need to understand a few concrete failure modes that emerge from transformer architecture.
Context rot is the degradation of model performance as input length increases. A 2025 study tested 18 frontier models — including GPT-4.1, Claude, and Gemini — and found that every single one performed worse as context grew. The degradation is not gradual; it accelerates. Models are trained primarily on shorter sequences and have fewer specialized parameters for context-wide dependencies.
The lost-in-the-middle problem compounds this. LLMs pay the most attention to tokens at the beginning and end of their input. Information buried in the middle of a long context can suffer accuracy drops of more than 30% compared to identical information placed at the boundaries. In a 100,000-token context, you cannot assume the model processed everything evenly — it didn't.
Attention dilution is the architectural root cause. Transformer attention is quadratic: every token attends to every other token, creating billions of pairwise relationships. As context grows, that attention budget gets spread thinner and thinner. Adding irrelevant content doesn't just waste tokens — it competes for the model's limited attention capacity.
Distractor interference is the practical manifestation. Semantically similar but irrelevant content misleads models in ways that are hard to detect. If your context contains four documents that discuss database architecture and only one is relevant to the current query, the model may synthesize across all four, producing an answer that sounds plausible but draws from the wrong documents.
These are not bugs to be patched. They are fundamental properties of the architecture that context engineering must work around.
The Anatomy of a Production Context
A production context is not a prompt with some data appended. It is a dynamically assembled pipeline with distinct layers, each with different update frequencies and relevance criteria.
System instructions form the foundation. They should hit the right altitude — specific enough to guide behavior, flexible enough to let the model reason independently. Brittle if-else logic in system prompts creates fragile agents. Vague system prompts create unpredictable ones. The goal is a concise set of principles that constrain behavior without specifying every case.
Retrieved knowledge provides the factual grounding the model cannot have from training alone. This is where retrieval-augmented generation lives. The quality of retrieval — not the quality of the retrieval prompt — determines whether the model has what it needs. Reranking retrieved chunks and placing the most relevant at the top is not optional; it directly mitigates the lost-in-the-middle problem.
Persistent memory connects current reasoning to prior state. This includes user preferences, prior decisions, and task history. Without explicit memory management, agents either repeat mistakes or demand the user re-explain context on every interaction.
Conversation history is the most dangerous layer. It grows unboundedly, accumulates noise, and is the primary driver of context rot in production systems. Most teams don't manage it deliberately — they just append turns until the context overflows, then truncate from the top. Both choices are wrong.
Tool definitions occupy tokens that cannot hold other content. Every tool you add reduces the space available for everything else. Tool schemas should be minimal, non-overlapping, and unambiguous. Ambiguous tool sets force the model to waste context reasoning about which tool to invoke instead of reasoning about the actual task.
Just-in-Time Context vs. Pre-Loading
A common mistake is pre-loading all potentially relevant information at the start of an agent run. This approach optimizes for convenience over performance.
The alternative — just-in-time context retrieval — keeps the initial context small and loads information dynamically as the agent needs it. The agent maintains lightweight identifiers (file paths, database keys, document IDs) and fetches full content only when required. This mirrors how humans work: we don't memorize everything that might be relevant; we remember where to look.
For multi-step agents, this pattern is essential. At step one, the agent needs the task description and available tools. At step five, it needs the intermediate results from steps two through four, not the full original document set. Context should track the current state of the task, not the full history of how you got there.
Managing History: Compaction and Structured Notes
The most underengineered component in most agentic systems is conversation history management. Left unmanaged, it becomes the primary driver of context rot.
Compaction is the practice of summarizing conversation history before reaching context limits. A compaction pass reduces accumulated turns into a dense representation that preserves critical decisions, constraints, and outputs while discarding tool call details, intermediate reasoning, and redundant acknowledgments. Done well, compaction can maintain task coherence across sessions that would otherwise exceed context limits entirely.
The key to good compaction is knowing what to keep. Architectural decisions, user preferences, confirmed facts, and blocking constraints belong in the summary. The exact wording of a previous turn, intermediate API responses, and superseded hypotheses do not.
Structured note-taking is a complementary approach where agents write persistent notes outside the context window entirely — to files, databases, or scratch pads — and selectively load them on subsequent turns. This is how Claude Code works during long sessions: the model writes todo lists and notes to files, reads them at the start of each turn, and avoids filling its context with a full session history.
Sub-Agent Architectures and Clean Contexts
The cleanest solution to context pollution in complex agentic systems is architectural: give each sub-agent a focused task with a fresh context window.
An orchestrating agent decomposes a task, dispatches sub-agents with narrow scopes, and receives their condensed outputs (typically 1,000–2,000 tokens). The sub-agents do the heavy lifting in clean, uncluttered contexts. The orchestrator never accumulates the raw intermediate outputs — it works only with the summaries.
This pattern has direct parallels in software engineering. You don't call one function with the entire application state as arguments; you pass exactly what that function needs. Context engineering for multi-agent systems is the same discipline applied to token space.
State object isolation is a related technique: the agent's runtime state is designed as a structured schema, with different fields exposed to different parts of the reasoning process. Only the fields relevant to the current step enter the context window. Other fields are held outside and loaded on demand.
What This Means for How You Build
The practical implication is that context design should happen before prompt writing, not after. Before you write a single instruction, ask:
- What information does the model need at each step of this task?
- What is the maximum context length I can afford at each step, given latency and cost constraints?
- How will conversation history accumulate, and when will I compact it?
- Which tools are essential, and which can I eliminate?
- What does retrieval return today, and does the most relevant content end up where the model will actually attend to it?
Treat context as a system with its own architecture, lifecycle, and constraints — not as a string you append things to. Version control your retrieval pipelines with the same rigor you apply to code. Evaluate how your context management strategies affect task completion as sessions grow longer, not just on clean single-turn benchmarks.
The insight that unlocks production reliability is simple: the model is only as good as what you put in front of it. Prompt engineering gives you the instruction. Context engineering gives the model everything it needs to follow it.
Most systems fail not because the prompt was wrong, but because the model was asked to reason without the information required to succeed. Fix the context, and you often find the prompt was fine all along.
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://neo4j.com/blog/agentic-ai/context-engineering-vs-prompt-engineering/
- https://blog.jetbrains.com/research/2025/12/efficient-context-management/
- https://www.morphllm.com/context-rot
- https://redis.io/blog/context-rot/
- https://www.firecrawl.dev/blog/context-engineering
- https://blog.langchain.com/context-engineering-for-agents/
- https://www.comet.com/site/blog/context-engineering/
