Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies
Most engineering teams treat the LLM context window the way early web developers treated global variables: throw everything in, fix it later. The context is full of the last 40 conversation turns, three entire files from the repository, a dozen retrieved documents, and a system prompt that's grown by committee over six months. It works — until it doesn't, and by then it's hard to tell what's causing the degradation.
The context window is not heap memory. It is closer to a CPU register file: finite, expensive per unit, and its contents directly affect every computation the model performs. When you treat registers as scratch space and forget to manage them, programs crash in creative ways. When you treat context windows as scratch space, LLMs degrade silently and expensively.
Why Stuffing Everything In Feels Correct
The intuition behind context stuffing is reasonable on its face. More information should mean better answers. The model can't use what it hasn't seen. Retrieval is latency; just keep everything around. These arguments were partially valid when context windows were 4K tokens and scarcity forced discipline. As windows expanded to 128K, 200K, and beyond, the scarcity pressure evaporated — and so did the discipline.
What replaced it was a different intuition: bigger is safe. A 200K token window can hold an entire codebase. Why not? The model will figure out what's relevant.
It won't, reliably. And the consequences compound in three ways that most teams aren't measuring.
Cost scales quadratically, not linearly. Transformer attention is O(n²) with respect to sequence length. Doubling your context doesn't double your compute — it quadruples it. A system that passes 8,000 tokens of file content costs 64x more to process than one that passes 1,000 tokens to answer the same question. Teams that see a billing spike and attribute it to "more usage" are often looking at a context inflation problem dressed up as a scale problem.
Quality degrades before you hit the limit. Chroma's 2025 research on "context rot" — tested across 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5 Pro — shows output quality degrading measurably well before the context window fills. The degradation isn't uniform: it follows a U-shaped attention curve where tokens at the very start and very end of context receive disproportionate attention, and tokens in the middle get partially ignored. Research measuring multi-document question answering found accuracy drops of 30% or more when the relevant document moved from position 1 to position 10 in a 20-document context. GPT-3.5-Turbo sometimes performed worse with additional context than with none at all.
Effective capacity is lower than advertised. A model claiming 200K tokens has an effective reliable range closer to 120K–140K. At 60–70% of the marketed window, accuracy starts degrading in ways that are hard to detect in aggregate metrics because wrong answers look like right answers until someone checks.
None of these show up in request latency dashboards or error rate monitors. They appear as subtle quality drift — output that's technically coherent but missing the nuance, making the wrong tradeoff, or confidently ignoring a relevant constraint that was buried on page 3 of the retrieved document.
The Measurement Gap
The reason most teams absorb these costs without realizing it is that context utilization is rarely instrumented. Teams monitor token counts (often as a cost signal, not a quality signal), and they log latency. Almost nobody tracks:
- Context utilization efficiency: what fraction of the tokens in context were referenced by the model's output? If you pass 20K tokens and the answer uses information from 500 of them, you have a 97.5% waste rate you're paying for.
- Context position of relevant content: where in the context window does the information the model actually used appear? Consistently in the middle is a warning sign.
- Quality vs. context size correlation: does output quality (as measured by evaluation, not vibes) degrade as context grows? This is the most direct signal of context rot and almost no one runs this experiment.
- Per-query context breakdown: how many tokens come from system prompt vs. conversation history vs. retrieved documents vs. tool results? Which of these grew since last quarter?
Without this instrumentation, context stuffing is invisible. The bill goes up, quality goes down, and the diagnosis is "the model got worse" or "users changed" rather than "we started passing the entire chat history plus three support docs into every call."
Context Budgeting as Engineering Practice
The response to context scarcity in operating systems is not to buy more registers — it is to manage register allocation explicitly through discipline and tooling. The same approach applies to context windows.
A context budget is a per-request allocation of token capacity broken down by source. A concrete example for an agent handling a coding task:
- System prompt: 2,000 tokens (fixed, reviewed quarterly)
- Task description and current intent: 500 tokens
- Conversation history (summarized): 1,500 tokens
- Retrieved code context: 4,000 tokens
- Tool call results from this turn: 2,000 tokens
- Reserved for model output: 2,000 tokens
- Total: 12,000 tokens
The budget is not a ceiling — it is a contract. When one category overruns, something else must shrink. This constraint forces every content source to justify its presence.
Budgeting also exposes categories that grew silently. A system prompt that started at 500 tokens and is now 2,500 tokens after six months of "just adding one more instruction" is consuming budget that should go to retrieved context. A conversation history that isn't summarized passes raw turns indefinitely; the 40th turn adds nearly zero information over a good summary of turns 1–40 but costs the same as any other 100-token block.
Priority-Ranked Eviction
Once you have a budget, you need an eviction policy for when content exceeds it. Operating systems have developed several eviction strategies — LRU, LFU, cost-aware eviction — that translate well to context management, but with modifications specific to LLM inference.
Recency isn't always the right axis. LRU works in caches because recently accessed data is statistically likely to be accessed again. In conversation context, the most recent turns are always kept, but the oldest turns may still be highly relevant — the user's original goal, stated constraints, or prior decisions that everything else depends on. Evicting purely by recency discards the anchor while keeping the noise.
Relevance to current intent is the primary signal. The right eviction question is: given what the model is doing right now, which context items are least likely to affect the output? For a multi-step agent, the task decomposition from step 1 may be critical through step 10; intermediate tool results from step 3 may be safely summarized by step 6. Relevance-ranked eviction requires either a lightweight classifier or a simpler heuristic — distance from current semantic intent.
Compression before eviction. Before removing a context item entirely, compress it. A 2K-token tool result can often be summarized to 200 tokens without meaningful loss. Compression preserves the signal while recovering budget. Prompt compression techniques applied to retrieved documents show 50–70% token reductions while maintaining 98% verbatim accuracy on downstream tasks — and in some benchmarks, compressed context outperforms raw context at twice the length because compression forces relevance.
Hard floors for critical content. Some context items must survive any eviction: the task description, the user's latest message, and any explicit constraints the user stated. These get reserved capacity that no other category can encroach on. Without hard floors, systems have a failure mode where the context fills with retrieved documents and the model answers a question the user stopped asking three turns ago.
RAG Context Construction as a First-Class Concern
Retrieval-augmented systems make context management decisions implicitly with every retrieval call. The default behavior — retrieve top-k chunks and concatenate — produces high token counts and variable quality. Better approaches treat context construction as an explicit assembly step:
Retrieve more, pass less. Retrieve 10–15 candidates with a fast lexical-semantic hybrid (BM25 + embedding). Rerank to 3–5 with a cross-encoder. Pass only the reranked set to the LLM. This sequence typically cuts input tokens by 60–70% compared to naive top-k while improving precision because the reranker has a global view of relevance across candidates.
Chunk for the question, not for the document. Most RAG pipelines chunk documents at fixed sizes for indexing. For retrieval, the relevant unit is smaller — often a paragraph or a few sentences around the key fact. Extracting the minimum relevant excerpt rather than the full chunk reduces noise and improves position placement.
Position relevant content deliberately. Given the U-shaped attention curve, the highest-value retrieved content should appear either at the very beginning or the very end of the assembled context, not buried in the middle. When you have five retrieved chunks of roughly equal relevance, the order matters more than most teams realize.
Track retrieval precision in production. If retrieved documents are rarely referenced in model outputs — which you can measure by checking whether the output contains claims that trace back to retrieved content — the retrieval quality is low and no amount of context engineering will fix it. Retrieval precision is a separate dial from context construction, and both require instrumentation.
Getting Started Without a Rewrite
Context budgeting and eviction policies sound like infrastructure work, but the first versions are just discipline and logging:
-
Audit your current context composition. For one week, log the token count breakdown by source (system prompt, history, retrieved, tools) for a sample of requests. This alone usually reveals one category that's grown out of control.
-
Set a per-category budget. Based on the audit, assign token limits to each source. Make the limits visible in code — not as magic numbers but as named constants with comments explaining the tradeoff.
-
Implement summarization for conversation history. Replace raw history with a rolling summary after 5–10 turns. This is the highest-leverage change for most chat-style systems and can be implemented in a day.
-
Add reranking to your RAG pipeline. A cross-encoder reranker applied to top-10 retrieval candidates before passing to the LLM is a single additional API call that consistently improves quality and cuts context size.
-
Add a quality-vs-context-size experiment. Sample 5% of requests, vary context size, and compare output quality. If quality peaks well below your maximum context allocation, you've found budget to reclaim.
Context windows will keep growing. The temptation to stuff more into them will grow with them. But the cost and quality dynamics don't get better just because the window is larger — they shift the degradation point further out while keeping the penalty curve the same. The teams that treat context as a scarce, explicitly managed resource will build systems that remain coherent and predictable as they scale. The teams that treat it as free storage will spend the next few years debugging quality regressions they can't locate and paying inference bills they can't explain.
- https://arxiv.org/abs/2307.03172
- https://diffray.ai/blog/context-dilution/
- https://arxiv.org/html/2601.11564v1
- https://www.morphllm.com/context-rot
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://aclanthology.org/2025.findings-acl.1274/
- https://introl.com/blog/long-context-llm-infrastructure-million-token-windows-guide
- https://www.getmaxim.ai/articles/reduce-llm-cost-and-latency-a-comprehensive-guide-for-2026/
