The Context Stuffing Antipattern: Why More Context Makes LLMs Worse
When 1M-token context windows shipped, many teams took it as permission to stop thinking about context design. The reasoning was intuitive: if the model can see everything, just give it everything. Dump the document. Pass the full conversation history. Forward every tool output to the next agent call. Let the model sort it out.
This is the context stuffing antipattern, and it produces a characteristic failure mode: systems that work fine in early demos, then hit a reliability ceiling in production that no amount of prompt tweaking seems to fix. Accuracy degrades on questions that should be straightforward. Answers become hedged and non-committal. Agents start hallucinating joins between documents that aren't related. The model "saw" all the right information — it just couldn't find it.
The Lost-in-the-Middle Problem Is Structural
The reliability ceiling has a well-documented cause. Research on how language models actually use long contexts found a consistent U-shaped performance curve: models attend strongly to information at the beginning and end of their context window, and attention degrades sharply for content in the middle.
This isn't a bug in any specific model — it reflects how transformer architectures work. Positional encoding schemes create a primacy bias toward early tokens and a recency bias toward late tokens. At 100K tokens, attention between tokens 20,000 and 80,000 apart is so diluted that the model functionally ignores large stretches of content. Multi-document QA experiments show accuracy dropping 30%+ when the relevant document moves from position 1 to position 10 in a 20-document context. The information was there. The model just couldn't retrieve it.
A 2025 benchmark testing 18 frontier models across increasing input lengths confirmed that every single model shows accuracy degradation as context grows. Not most models — all of them. Some hold steady until a threshold and then nosedive. Others degrade gradually from the start. Claude Sonnet showed the most graceful degradation curve, staying under 5% accuracy drop across its full 200K range. Most others are reliable to about 60-70% of their advertised context window, not 100%.
The advertised context window is a capacity limit, not a performance guarantee.
Why You Can't Measure the Damage Until It's Too Late
The failure mode is insidious because it doesn't surface on your evaluation set. Standard evals test whether the model can answer a question — they rarely test position sensitivity. If your eval set includes 50 questions and the relevant information always lands in the first or last 20% of context, you'll get high scores while your production system quietly degrades on users whose queries land in the middle.
A few behavioral signals indicate context stuffing is hurting you before you've set up the right metrics:
Increased hedging. The model starts adding qualifiers — "based on the information provided," "I'm not certain but" — on queries where you'd expect confident answers. This is often a sign the model is pattern-matching uncertainty rather than retrieving a clear answer from a cluttered context.
Token bloat with no accuracy improvement. One study compared a context-stuffed system with a focused retrieval approach on identical queries. The stuffed version consumed 3,729 tokens. The retrieval version used 67 tokens. Same answer. When adding more context stops improving output quality, you've passed the saturation point.
Latency inflation. A 70B parameter model showed 719% latency increase when serving stuffed versus curated contexts. If your time-to-first-token climbs as users have longer sessions, context growth is the probable cause.
Sub-agent confusion in multi-agent chains. If a root agent passes its full 50K-token conversation history to a sub-agent, and that sub-agent does the same to its child, you can easily reach 150K tokens of context in three hops — most of it irrelevant to the leaf-level task. Multi-agent systems amplify context bloat exponentially.
The right measurement framework tracks the token-to-answer ratio (tokens sent per unit of answer quality) and accuracy at different document positions in the context. Most teams only track the latter halfway through — after they've already hit the ceiling.
What Budget-Aware Context Curation Actually Looks Like
The alternative to context stuffing isn't just "use less context." It's deliberately allocating your context window like a budget with a return expectation on every token.
Relevance filtering before loading. Don't retrieve first and then hope the model ignores the irrelevant chunks. Filter before sending. Run a semantic similarity pass against the user's query and exclude documents that fall below a relevance threshold. If you're doing RAG, your pre-retrieval step should cut the candidate pool significantly before anything reaches the model.
Position-aware document ordering. Given the U-shaped attention curve, put your most critical content at the start and end of context, not in the middle. If you have five retrieved documents, the most relevant one should be first. The second most relevant should be last. Don't sort by recency or arbitrary retrieval score without considering where in the window each chunk will land.
Extractive summarization for tool outputs. Raw tool outputs — JSON API responses, database records, intermediate computation results — are notoriously noisy. Before passing a tool result back into context, run an extraction step that pulls only the fields relevant to the current task. A REST API response with 40 fields, where only 3 are relevant, shouldn't be passed wholesale. Summarize it, or extract the relevant keys explicitly.
Sliding windows for conversation history. Conversation memory is the fastest path to context bloat. A session with 30 turns can easily accumulate 20K tokens before the user has asked anything particularly complex. The mitigation is a sliding window: keep the last N turns verbatim, and summarize everything older into a compact state representation. Key decisions, open questions, and confirmed facts get preserved. Filler and repetition get discarded.
Token zone allocation. Treat the context window as four zones: system prompt, few-shot examples, user query plus retrieved content, and response buffer. Allocate explicit token budgets to each zone and enforce them programmatically. Never fill the context window to capacity — reserve at least 20-30% for the model's output. A 200K context window should be planned to max out around 150K of input.
Multi-Agent Systems Require Explicit Context Contracts
Standard context management is hard enough. Multi-agent systems make it harder because every agent hand-off is an opportunity for context to expand without oversight.
The pattern that fails: Agent A accumulates 30K tokens of context over a long task. It spawns Agent B, passing its full conversation history "so B has all the context." B does the same when it spawns C. Each agent has nominally complete information but practically can't use any of it — and you're paying for all of it.
The pattern that works: define explicit context contracts for each agent boundary. When a root agent delegates a task to a sub-agent, it should pass only what that sub-agent needs for its specific task: a task description, the relevant subset of retrieved data, and any output constraints. Not the full history. The analogy is function arguments: you wouldn't pass your entire program state to a function that only needs two parameters.
The implementation is a summarization step at each delegation boundary. Before spawning a sub-agent, the parent produces a structured handoff — goal, constraints, available data (filtered to what's actually needed). The sub-agent starts with a clean, focused context rather than inheriting a bloated conversation. This adds a step but reliably avoids the exponential blowup.
The Cost Asymmetry You're Missing
Reliability aside, context stuffing has a direct economic cost that scales with usage. Processing 1M tokens costs between $2 and $15 on frontier models depending on provider and tier. At that rate, the difference between a 67-token focused retrieval and a 3,729-token stuffed context is roughly 55× the per-query cost. On a system handling 10,000 queries per day, that compounds quickly.
The 5K token threshold is a rough heuristic from empirical cost comparisons: below that, the engineering cost of building good retrieval may exceed the cost savings from token efficiency. Above it, focused retrieval is almost always more economical — and usually more accurate.
The business case for context curation isn't just about reliability. It's about the cost structure of your system at scale.
Where to Start If Your System Already Has This Problem
The context stuffing antipattern is usually layered throughout a codebase by the time someone notices it. Rather than trying to fix everything at once, audit from the highest-impact point:
Start by measuring your actual token usage per query, broken down by source: how many tokens come from the system prompt, retrieved documents, conversation history, and tool outputs. Tools like LangSmith, Arize, or a simple logging middleware can surface this breakdown. Look for any single source that accounts for more than 50% of your total context — that's your first target.
Then check whether your retrieval results are filtered before loading. If you're stuffing entire documents rather than relevant passages, that's the easiest win: chunk, embed, retrieve the top-k passages, and limit k to what fits in a fraction of your context budget rather than filling it.
Finally, audit your agent hand-offs. If sub-agents are receiving full parent conversation histories, replace those with structured handoff summaries. This is usually a small code change with a large impact on both token count and reliability.
Context engineering is ultimately about treating attention as a scarce resource. Every token you put in context is competing with every other token for the model's effective attention. The teams that ship reliable LLM applications aren't the ones with the biggest context windows — they're the ones who spend the most carefully.
- https://aclanthology.org/2024.tacl-1.9/
- https://research.trychroma.com/context-rot
- https://www.morphllm.com/context-rot
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.marktechpost.com/2026/02/24/rag-vs-context-stuffing-why-selective-retrieval-is-more-efficient-and-reliable-than-dumping-all-data-into-the-prompt/
- https://www.copilotkit.ai/blog/rag-vs-context-window-in-gpt-4
- https://www.sitepoint.com/optimizing-token-usage-context-compression-techniques/
- https://eval.16x.engineer/blog/llm-context-management-guide
- https://oneuptime.com/blog/post/2026-01-30-context-compression/view
