Context Engineering: Why What You Feed the LLM Matters More Than How You Ask
Most LLM quality problems aren't prompt problems. They're context problems.
You spend hours crafting the perfect system prompt. You add XML tags, chain-of-thought instructions, and careful persona definitions. You test it on a handful of inputs and it looks great. Then you ship it, and two weeks later you're staring at a ticket where the agent confidently told a user the wrong account balance — because it retrieved the previous user's transaction history. The model understood the instructions perfectly. It just had the wrong inputs.
This is the core distinction between prompt engineering and context engineering. Prompt engineering asks: "How should I phrase this?" Context engineering asks: "What does the model need to know right now, and how do I make sure it gets exactly that?" One is copywriting. The other is systems architecture.
Andrej Karpathy put it well: context engineering is "the delicate art and science of filling the context window with just the right information for the next step." That framing implies something that most teams discover the hard way — if you keep refining prompts and still get inconsistent results, the model probably never received the right context in the first place.
The Four Ways Context Goes Wrong
Before looking at solutions, it helps to name the failure modes precisely. Most production LLM incidents fall into four categories:
Context poisoning: Incorrect or hallucinated information enters the context and gets reused. An agent writes a factual error to its scratchpad. It then retrieves that note in a later step and treats it as ground truth. The error compounds. By the time a human sees it, the agent has built several reasoning steps on top of a lie.
Context distraction: The context contains too much history or too many retrieved documents. The model's attention dilutes across irrelevant content. It starts echoing old patterns instead of reasoning freshly about the current state. This is common in long-running conversations or multi-step agent tasks where no compaction strategy exists.
Context confusion: The context contains too many tools, too many retrieved chunks, or overlapping documents with similar-sounding content. The model can't confidently determine which source to trust or which tool to use. It picks one somewhat arbitrarily. You see high variance in outputs across runs with identical inputs.
Context clash: Two sources in the same context say contradictory things. The agent has a retrieved document from 2023 about a product feature that was deprecated in 2024, alongside a more recent document that doesn't mention the old behavior. The model attempts to reconcile them, usually incorrectly.
Understanding which failure mode you're hitting tells you immediately what to fix. Poisoning requires validation before writing to memory. Distraction requires compaction or trimming. Confusion requires reducing scope — fewer tools, more targeted retrieval. Clash requires recency filtering and source prioritization.
The Four Primitives: Write, Select, Compress, Isolate
There's a clean framework for thinking about context management that maps to most real production patterns. Every context engineering decision is one of four operations:
Write — persist information outside the context window for later retrieval. This includes scratchpads agents write during task execution, long-term memory stores that persist across sessions, and structured knowledge bases. The key constraint is that not everything deserves to be written. Filter before storing. Write selectively, with explicit validity windows where appropriate.
Select — pull relevant information back into the context window at the moment it's needed. This is retrieval: embedding-based semantic search, BM25 keyword matching, knowledge graph traversal, or combinations. The goal isn't to retrieve everything potentially relevant — it's to retrieve exactly what's needed for the current step and nothing else. Research consistently shows that a focused 300-token context outperforms an unfocused 100k-token context for most tasks.
Compress — reduce what's already in the context to its essential signal. Summarize tool outputs. Drop older conversation turns. Run the agent's reasoning trace through a distillation step before passing it to the next step. Claude Code implements auto-compact at 95% context window usage — a useful engineering reference point for where to trigger compaction in your own systems. Well-implemented compression can cut token usage by up to 80% while preserving task-relevant information.
Isolate — split context across separate agents or processes. In a multi-agent system, each subagent handles a focused subtask with a clean context window. The subagent returns a condensed summary (typically 1,000–2,000 tokens) rather than its full working context. This prevents one task's context from contaminating another's. The tradeoff is real — multi-agent architectures can use 15x more tokens than single-agent approaches — so isolation should be the last tool you reach for, not the first.
Why Long Context Windows Don't Eliminate Retrieval
With models advertising 1M+ token context windows, some teams have concluded that RAG is obsolete. The thinking goes: if you can just put the entire knowledge base in the context, why bother with retrieval infrastructure?
The benchmarks disagree. RAG achieves an 82% win rate over direct long-context prompting in structured comparisons. Two phenomena explain why.
The first is the effective context window problem. The advertised context limit and the context where the model actually pays attention diverge significantly as inputs grow. Research measuring the Minimum Effective Context Window (MECW) found performance can degrade to effectively 1% of the advertised limit on complex tasks, due to KV cache constraints and attention degradation at scale. Even Gemini's best models maintain only 77% accuracy at full 1M-token load.
The second is the lost-in-the-middle problem. LLMs have a U-shaped attention curve: they pay most attention to content at the beginning and end of the context, with performance dropping 30%+ for information positioned in the middle. This isn't a bug that will be patched — it's a structural property of how positional embeddings work. The implication is that dumping large volumes of retrieved content into the middle of a prompt is reliably worse than retrieving fewer, higher-quality documents.
Long context does win for tasks that require holistic document understanding, or where the relevant boundaries aren't known in advance. But for most agent tasks — where you're looking for specific facts, recent data, or user-specific information — targeted retrieval still wins on quality, cost, and latency.
Chunking Is Higher-Leverage Than Most Teams Think
If you're doing RAG, your chunking strategy affects retrieval quality more than almost any other parameter. The core tradeoff: smaller chunks give better retrieval precision but strip out surrounding context; larger chunks preserve context but produce noisier embeddings and consume more of your context budget.
A few concrete data points from 2025-2026 benchmarks:
- Recursive chunking with 512-token chunks achieved 69% accuracy across a 50-paper academic benchmark, making it the most accurate strategy tested — and it's computationally cheap since it requires no embedding at chunk time.
- Semantic chunking, despite its appeal, landed at 54% accuracy in the same benchmark, producing fragments averaging 43 tokens — too small to be useful. The computational cost is also significant: chunking a 10,000-word document semantically requires generating 200–300 embeddings just for the segmentation step.
- Adaptive chunking aligned to logical topic boundaries hit 87% accuracy in a clinical decision support study, versus 13% for fixed-size baselines. But adaptive chunking requires domain-specific segmentation logic, so the gain comes with implementation cost.
The practical heuristic: start with recursive 512-token chunking. It's the right default for most use cases. Add overlap (typically 10-20% of chunk size) to avoid splitting reasoning across boundaries. Move to adaptive or semantic approaches only when you have clear evidence that document structure matters — technical documentation, legal documents, clinical notes.
One often-overlooked option: for short, single-purpose documents like FAQs or product descriptions, no chunking at all is usually best. The overhead of retrieval against small documents isn't worth the complexity.
Memory Architecture: What Needs to Persist and What Doesn't
Agents that work well across sessions treat memory as a deliberately managed resource, not an append-only log. There are three distinct memory types, each with different persistence and retrieval characteristics:
Episodic memory captures past interactions and user preferences. Useful for personalization, avoiding repetition, and maintaining conversational coherence. Decays fastest — user preferences from three months ago may actively harm quality today.
Semantic memory stores domain knowledge, facts about the world, and structured information. Appropriate for longer-term persistence, but requires versioning: a semantic memory entry about a product's pricing that hasn't been updated becomes a liability.
Procedural memory encodes how to execute workflows — essentially learned agent behavior. The most dangerous to store incorrectly, since errors in procedural memory get executed repeatedly until someone notices.
The engineering decisions that matter most:
Filter before writing. An agent turn that consisted of "ok, noted" doesn't need a memory entry. Selective writing prevents memory pollution — the gradual accumulation of low-signal entries that dilute retrieval quality and fill context with noise.
Prune periodically. Most teams build memory write paths and never build memory expiration. The result is agents that develop stale beliefs about users, products, and systems that persist for months. Schedule garbage collection based on recency and explicit validity windows.
Compress across sessions. When a context window fills during a long task, summarize the full conversation and restart with a compressed representation. The right compression strategy maximizes recall first (preserve everything that might matter), then iterates toward precision (discard what definitively doesn't).
The System Prompt Is Architecture
System prompts aren't just instructions — they're the fixed, always-present portion of the context. Every token in the system prompt is a token taken from the budget for retrieved documents, tool outputs, and conversation history. This creates a design constraint most teams ignore until they're optimizing token costs.
The common failure modes at either extreme: too prescriptive (hardcoded logic that creates fragility and maintenance debt) or too vague (high-level guidance with no concrete signal). The right altitude is specific enough to guide behavior in known edge cases, flexible enough to generalize across inputs the system hasn't seen.
Practical guidance: organize prompts into labeled sections with XML tags or Markdown headers so both you and the model can navigate them. Start minimal. Add instructions based on observed failure modes from production, not speculation about hypotheticals. Use few-shot examples rather than exhaustive edge-case prose — for LLMs, examples are genuinely worth a thousand words.
The same discipline applies to tools. A bloated tool set with overlapping functionality creates context confusion. If a human engineer can't definitively choose between two tools given a specific situation, the agent can't either. When tool counts grow large, apply RAG to tool descriptions themselves — retrieve the three most relevant tools for a given subtask rather than exposing all 40.
Monitoring Context as Infrastructure
Most observability setups monitor outputs: response quality, latency, error rates. Few monitor context composition. This is a gap — because most quality failures originate in what went into the context, not how the model processed it.
The minimal context monitoring stack:
Track token consumption per component. Know how much of your context budget is consumed by the system prompt, conversation history, retrieved documents, and tool outputs respectively. When quality degrades, the first question is whether the context budget shifted — more history crowding out retrieved documents, for example.
Alert on retrieval quality, not just retrieval latency. A retrieval call that returns in 50ms but pulls the wrong documents is worse than one that takes 200ms and returns the right ones. Add relevance scoring to your retrieval pipeline and monitor distribution shifts.
Log full context for sampled requests. Not every request — storage is expensive — but enough to diagnose production failures. When an agent does something unexpected, the first debugging step is reading exactly what it saw.
The Gradient of Complexity
Context engineering is not a binary choice between "simple RAG" and "full multi-agent architecture." There's a progression:
Start with the simplest approach that could work: a minimal system prompt, a good model, and a single retrieval step. Many problems are solvable here. Resist the urge to build infrastructure before you understand where your actual failure modes are.
Add hybrid retrieval (semantic + BM25) once you have evidence that pure embedding-based retrieval is failing on keyword-specific queries. Hybrid consistently outperforms either approach alone.
Add compaction once context window usage becomes a constraint. Monitor this proactively — running out of context window mid-task is a hard failure.
Add multi-agent isolation only when single-agent context management becomes genuinely intractable. The token cost is real, and the coordination overhead adds latency and complexity.
The trend as models improve is worth noting: better models are better at navigating context they're given, which means less human curation of exactly what goes in. Over-engineering context management for a 2023-era model may become a liability when you upgrade. Build for today's requirements, but don't make the context management logic so rigid that it fights better models.
The goal, as one engineering team put it: "the smallest possible set of high-signal tokens that maximize the likelihood of the desired outcome." That's a harder target than it sounds, but it's the right one.
- https://weaviate.io/blog/context-engineering
- https://blog.langchain.com/context-engineering-for-agents/
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.elastic.co/search-labs/blog/context-engineering-vs-prompt-engineering
- https://www.firecrawl.dev/blog/context-engineering
- https://weaviate.io/blog/chunking-strategies-for-rag
- https://www.firecrawl.dev/blog/best-chunking-strategies-rag
- https://www.getmaxim.ai/articles/context-window-management-strategies-for-long-context-ai-agents-and-chatbots/
- https://www.getmaxim.ai/articles/solving-the-lost-in-the-middle-problem-advanced-rag-techniques-for-long-context-llms/
- https://byteiota.com/rag-vs-long-context-2026-retrieval-debate/
- https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/
- https://blog.bytebytego.com/p/a-guide-to-context-engineering-for
- https://redis.io/blog/context-window-management-llm-apps-developer-guide/
