Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies
Most engineering teams treat the LLM context window the way early web developers treated global variables: throw everything in, fix it later. The context is full of the last 40 conversation turns, three entire files from the repository, a dozen retrieved documents, and a system prompt that's grown by committee over six months. It works — until it doesn't, and by then it's hard to tell what's causing the degradation.
The context window is not heap memory. It is closer to a CPU register file: finite, expensive per unit, and its contents directly affect every computation the model performs. When you treat registers as scratch space and forget to manage them, programs crash in creative ways. When you treat context windows as scratch space, LLMs degrade silently and expensively.
Why Stuffing Everything In Feels Correct
The intuition behind context stuffing is reasonable on its face. More information should mean better answers. The model can't use what it hasn't seen. Retrieval is latency; just keep everything around. These arguments were partially valid when context windows were 4K tokens and scarcity forced discipline. As windows expanded to 128K, 200K, and beyond, the scarcity pressure evaporated — and so did the discipline.
What replaced it was a different intuition: bigger is safe. A 200K token window can hold an entire codebase. Why not? The model will figure out what's relevant.
It won't, reliably. And the consequences compound in three ways that most teams aren't measuring.
Cost scales quadratically, not linearly. Transformer attention is O(n²) with respect to sequence length. Doubling your context doesn't double your compute — it quadruples it. A system that passes 8,000 tokens of file content costs 64x more to process than one that passes 1,000 tokens to answer the same question. Teams that see a billing spike and attribute it to "more usage" are often looking at a context inflation problem dressed up as a scale problem.
Quality degrades before you hit the limit. Chroma's 2025 research on "context rot" — tested across 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5 Pro — shows output quality degrading measurably well before the context window fills. The degradation isn't uniform: it follows a U-shaped attention curve where tokens at the very start and very end of context receive disproportionate attention, and tokens in the middle get partially ignored. Research measuring multi-document question answering found accuracy drops of 30% or more when the relevant document moved from position 1 to position 10 in a 20-document context. GPT-3.5-Turbo sometimes performed worse with additional context than with none at all.
Effective capacity is lower than advertised. A model claiming 200K tokens has an effective reliable range closer to 120K–140K. At 60–70% of the marketed window, accuracy starts degrading in ways that are hard to detect in aggregate metrics because wrong answers look like right answers until someone checks.
None of these show up in request latency dashboards or error rate monitors. They appear as subtle quality drift — output that's technically coherent but missing the nuance, making the wrong tradeoff, or confidently ignoring a relevant constraint that was buried on page 3 of the retrieved document.
The Measurement Gap
The reason most teams absorb these costs without realizing it is that context utilization is rarely instrumented. Teams monitor token counts (often as a cost signal, not a quality signal), and they log latency. Almost nobody tracks:
- Context utilization efficiency: what fraction of the tokens in context were referenced by the model's output? If you pass 20K tokens and the answer uses information from 500 of them, you have a 97.5% waste rate you're paying for.
- Context position of relevant content: where in the context window does the information the model actually used appear? Consistently in the middle is a warning sign.
- Quality vs. context size correlation: does output quality (as measured by evaluation, not vibes) degrade as context grows? This is the most direct signal of context rot and almost no one runs this experiment.
- Per-query context breakdown: how many tokens come from system prompt vs. conversation history vs. retrieved documents vs. tool results? Which of these grew since last quarter?
Without this instrumentation, context stuffing is invisible. The bill goes up, quality goes down, and the diagnosis is "the model got worse" or "users changed" rather than "we started passing the entire chat history plus three support docs into every call."
Context Budgeting as Engineering Practice
The response to context scarcity in operating systems is not to buy more registers — it is to manage register allocation explicitly through discipline and tooling. The same approach applies to context windows.
A context budget is a per-request allocation of token capacity broken down by source. A concrete example for an agent handling a coding task:
- System prompt: 2,000 tokens (fixed, reviewed quarterly)
- Task description and current intent: 500 tokens
- Conversation history (summarized): 1,500 tokens
- Retrieved code context: 4,000 tokens
- Tool call results from this turn: 2,000 tokens
- https://arxiv.org/abs/2307.03172
- https://diffray.ai/blog/context-dilution/
- https://arxiv.org/html/2601.11564v1
- https://www.morphllm.com/context-rot
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://aclanthology.org/2025.findings-acl.1274/
- https://introl.com/blog/long-context-llm-infrastructure-million-token-windows-guide
- https://www.getmaxim.ai/articles/reduce-llm-cost-and-latency-a-comprehensive-guide-for-2026/
