The Context Stuffing Antipattern: Why More Context Makes LLMs Worse
When 1M-token context windows shipped, many teams took it as permission to stop thinking about context design. The reasoning was intuitive: if the model can see everything, just give it everything. Dump the document. Pass the full conversation history. Forward every tool output to the next agent call. Let the model sort it out.
This is the context stuffing antipattern, and it produces a characteristic failure mode: systems that work fine in early demos, then hit a reliability ceiling in production that no amount of prompt tweaking seems to fix. Accuracy degrades on questions that should be straightforward. Answers become hedged and non-committal. Agents start hallucinating joins between documents that aren't related. The model "saw" all the right information — it just couldn't find it.
The Lost-in-the-Middle Problem Is Structural
The reliability ceiling has a well-documented cause. Research on how language models actually use long contexts found a consistent U-shaped performance curve: models attend strongly to information at the beginning and end of their context window, and attention degrades sharply for content in the middle.
This isn't a bug in any specific model — it reflects how transformer architectures work. Positional encoding schemes create a primacy bias toward early tokens and a recency bias toward late tokens. At 100K tokens, attention between tokens 20,000 and 80,000 apart is so diluted that the model functionally ignores large stretches of content. Multi-document QA experiments show accuracy dropping 30%+ when the relevant document moves from position 1 to position 10 in a 20-document context. The information was there. The model just couldn't retrieve it.
A 2025 benchmark testing 18 frontier models across increasing input lengths confirmed that every single model shows accuracy degradation as context grows. Not most models — all of them. Some hold steady until a threshold and then nosedive. Others degrade gradually from the start. Claude Sonnet showed the most graceful degradation curve, staying under 5% accuracy drop across its full 200K range. Most others are reliable to about 60-70% of their advertised context window, not 100%.
The advertised context window is a capacity limit, not a performance guarantee.
Why You Can't Measure the Damage Until It's Too Late
The failure mode is insidious because it doesn't surface on your evaluation set. Standard evals test whether the model can answer a question — they rarely test position sensitivity. If your eval set includes 50 questions and the relevant information always lands in the first or last 20% of context, you'll get high scores while your production system quietly degrades on users whose queries land in the middle.
A few behavioral signals indicate context stuffing is hurting you before you've set up the right metrics:
Increased hedging. The model starts adding qualifiers — "based on the information provided," "I'm not certain but" — on queries where you'd expect confident answers. This is often a sign the model is pattern-matching uncertainty rather than retrieving a clear answer from a cluttered context.
Token bloat with no accuracy improvement. One study compared a context-stuffed system with a focused retrieval approach on identical queries. The stuffed version consumed 3,729 tokens. The retrieval version used 67 tokens. Same answer. When adding more context stops improving output quality, you've passed the saturation point.
Latency inflation. A 70B parameter model showed 719% latency increase when serving stuffed versus curated contexts. If your time-to-first-token climbs as users have longer sessions, context growth is the probable cause.
Sub-agent confusion in multi-agent chains. If a root agent passes its full 50K-token conversation history to a sub-agent, and that sub-agent does the same to its child, you can easily reach 150K tokens of context in three hops — most of it irrelevant to the leaf-level task. Multi-agent systems amplify context bloat exponentially.
The right measurement framework tracks the token-to-answer ratio (tokens sent per unit of answer quality) and accuracy at different document positions in the context. Most teams only track the latter halfway through — after they've already hit the ceiling.
What Budget-Aware Context Curation Actually Looks Like
The alternative to context stuffing isn't just "use less context." It's deliberately allocating your context window like a budget with a return expectation on every token.
Relevance filtering before loading. Don't retrieve first and then hope the model ignores the irrelevant chunks. Filter before sending. Run a semantic similarity pass against the user's query and exclude documents that fall below a relevance threshold. If you're doing RAG, your pre-retrieval step should cut the candidate pool significantly before anything reaches the model.
- https://aclanthology.org/2024.tacl-1.9/
- https://research.trychroma.com/context-rot
- https://www.morphllm.com/context-rot
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.marktechpost.com/2026/02/24/rag-vs-context-stuffing-why-selective-retrieval-is-more-efficient-and-reliable-than-dumping-all-data-into-the-prompt/
- https://www.copilotkit.ai/blog/rag-vs-context-window-in-gpt-4
- https://www.sitepoint.com/optimizing-token-usage-context-compression-techniques/
- https://eval.16x.engineer/blog/llm-context-management-guide
- https://oneuptime.com/blog/post/2026-01-30-context-compression/view
