The Right-Edge Accuracy Drop: Why the Last 20% of Your Context Window Is a Trap
A 200K-token context window is not a 200K-token context window. Fill it to the brim and the model you just paid for quietly becomes a worse version of itself — not at the middle, where "lost in the middle" would predict, but at the right edge, exactly where recency bias was supposed to save you. The label on the box sold you headroom; the silicon sells you a cliff.
This is a different failure mode from the one most teams have internalized. "Lost in the middle" trained a generation of prompt engineers to stuff the critical instruction at the top and the critical question at the bottom, confident that primacy and recency would carry the signal through. That heuristic silently breaks when utilization approaches the claimed window. The drop-off is not gradual, not linear, and not symmetric with how the model behaves at half-fill. Past a utilization threshold that varies by model, you are operating in a different regime, and the prompt shape that worked at 30K fails at 180K.
The economic temptation makes it worse. If you just paid for a million-token window, the pressure to use it is enormous — dump the entire repo, feed it every support ticket, hand it the quarterly filings and let it figure out what matters. That is how you get a confidently wrong answer that looks well-reasoned on the surface and disintegrates on audit.
Beyond "lost in the middle": a different failure mode
The original Lost in the Middle paper from 2023 found a U-shaped performance curve: models were reliable at the start (primacy bias) and at the end (recency bias) of a context, and sagged in between. That finding was robust enough to become prompt-engineering folklore. Put the system instruction first. Put the question last. Fill the middle with whatever you must.
What recent work shows is that the U-shape holds only while the window is less than half full. Positional Biases Shift as Inputs Approach Context Window Limits (Veseli et al., 2025) traced how the curve deforms as you push toward the advertised ceiling. Past 50% utilization, the primacy bias weakens materially. Past 80%, the curve stops looking like a U at all. What you are left with is a raw recency gradient — and even that gradient operates on a lower absolute performance floor than the short-context baseline.
Translation: you do not get to keep the "information at the end is safe" heuristic past the halfway mark. The end is still the least bad region, but "least bad" is not "good." The whole performance surface shifts down.
The Chroma Context Rot study reinforced this with a blunt empirical claim across eighteen frontier models — Claude Opus 4, Sonnet 4, GPT-4.1, GPT-4o, Gemini 2.5 Pro/Flash, Qwen3 variants, and more. Every single model degraded as input length grew. Not some. Every one. Tasks that held near-perfect accuracy at 1K tokens fell off unpredictably as the window filled. Some models dropped gently and late; some — notably Gemini variants — exhibited wild variance much earlier. Claude decayed slowest, though it sometimes substituted a refusal for an answer as contexts got large, which is its own kind of failure.
What the benchmarks actually say
Aggregate numbers are worth pulling out of the papers because the gap between marketing and reality is larger than most teams assume.
- RULER (NVIDIA) tested seventeen long-context models and found that while every model claimed at least 32K tokens of context, only four cleared the qualitative performance threshold at 32K. The rest fell below the bar well before reaching their advertised maximum. The paper coined the useful term effective context length: the point past which your model's real-task accuracy is no longer reliable.
- NoLiMa removed the literal keyword-match shortcut that needle-in-a-haystack tests accidentally reward and retested thirteen models that all claim at least 128K. At 32K, eleven of the thirteen had dropped below 50% of their short-context baseline. GPT-4o — one of the best performers — still fell from 99.3% at short context to 69.7% at 32K. Most models never come close to using the window they advertise.
- Chroma's context rot analysis showed that coherent, logically-ordered documents actually hurt retrieval accuracy more than shuffled versions of the same content, across all eighteen models tested. Coherence concentrates attention in ways that make distractors more seductive. Randomness breaks the spell.
- Long-context RAG benchmarks from Databricks showed named models degrading at very specific thresholds: Llama-3.1-405B started decaying past 32K, GPT-4-0125-preview held until 64K, and newer Claude and GPT variants pushed the knee out but did not eliminate it.
Pattern across all of these: the degradation is not gradual. Performance stays near-perfect, and then it falls. A model rated for 200K is often unreliable past roughly 130K on non-trivial tasks — roughly 65% of the advertised ceiling. A model rated for 1M is often unreliable past roughly 200K–300K. The last chunk of the advertised window is the part you are paying for that does not work.
Why the right edge specifically fails
Three mechanisms stack up to produce the right-edge drop, and understanding them is what lets you predict how your particular deployment will behave.
Attention dilution. Softmax attention normalizes to a probability distribution over all tokens. As the window fills, each individual token's weight shrinks. A single relevant sentence at position 950K competes with 999,999 tokens of distraction. The signal does not grow — the noise floor rises. The relevant chunk is still in there; the model just can't find it through the haze. This is why coherent documents hurt: coherence creates dense, plausible distractors that look like the answer.
Positional encoding decay. Most modern open models use some form of Rotary Position Embedding (RoPE). RoPE has a built-in long-term decay: the further apart two tokens are, the less the position signal ties them together. Frontier labs mitigate this with various interpolation and extrapolation tricks to extend their claimed context windows, but those tricks do not cleanly extend the effective window. The extension is sometimes aesthetic — the model processes the tokens, it just cannot associate them reliably with each other.
Bias shift near the ceiling. This is the Veseli et al. finding: once you approach the limit, the primacy bias that usually protects the top of the prompt fades. The model's effective attention collapses toward the tail. Combine that with attention dilution across the tail itself and you get a situation where the last ten to twenty percent of the window — which recency bias was supposed to make safe — becomes the zone where the model behaves most erratically in absolute terms. The end is still relatively better than the middle, but the middle at 180K is already worse than the middle at 50K, so "better than the middle" is not a high bar.
The practical upshot: the right-edge drop is not one bug, it is three failure modes converging. You cannot fix it with a single prompt trick.
Safety margins by task type
You should have a budget in your head for how much of the advertised window you will actually use, stratified by task difficulty. The specific numbers will move as models improve, but the ordering is stable and the rough magnitudes are defensible today.
- Exact-string retrieval ("find this token"). The closest-to-perfect long-context task. Assume you can use roughly 80% of the advertised window on frontier models before accuracy starts wobbling. This is the benchmark most model cards are implicitly reporting.
- Semantic retrieval (find the passage that answers this question, when no keyword in the passage matches the question). This is what NoLiMa tests. Budget more like 40–50% of the advertised window. Many models halve their short-context accuracy well before they hit 32K on this kind of task.
- Multi-hop reasoning (combine facts from three different regions of the context). Budget 25–35%. Every hop compounds the retrieval failure rate, so errors stack multiplicatively. A 200K window becomes a 50K-ish working ceiling.
- Aggregation and counting ("how many times does X happen in this log"). The most punishing category. Accuracy can collapse past 10–20% of the advertised window. If you have a 1M token log to count events in, a model is probably the wrong tool — SQL or regex are what you want.
- Code understanding across large repos. Task-dependent, but assume 30–40%. Code has dense cross-references that trigger multi-hop failure modes even when the user-facing task sounds like retrieval.
These are conservative defaults for production. Move them as your evals justify. Do not move them because your vendor's marketing page moved.
The corollary: if your real working budget on a 200K model is around 80K of usable context for multi-hop work, it is often cheaper and more accurate to use a smaller model with aggressive retrieval than a larger model with a stuffed window. The bigger window has value as a buffer against context-management bugs, not as a replacement for context management.
Prompt restructuring when you genuinely need the window
Sometimes you do need to push into the upper half of the window — long legal documents, consolidated agent state, dense session histories. A few restructuring techniques recover usable accuracy, and they are worth applying as a matter of discipline even when you have margin.
Restate the question twice. Put the instruction at the top — so the model knows what to look for while it reads — and restate it at the very bottom, after the context. At high utilization the primacy bias has weakened, so the top-only placement you used at short context is no longer sufficient. The tail restatement exploits the one bias that stays strong near the limit.
Shorten the tail. If you must fill the window, keep the very last chunk — roughly the last 5–10% of tokens — reserved for small, high-precision content: the question, the schema, the output format, the short list of allowed tools. Do not let bulk documents push into the last slice. The tail is precious bandwidth; treat it as such.
Break coherence deliberately. The Chroma finding that shuffled haystacks outperform coherent ones is counterintuitive and useful. If the model keeps locking onto the wrong plausible-looking passage, consider inserting clear section boundaries (--- DOCUMENT 14 ---), even randomizing document order inside a retrieval bundle. The goal is to prevent the narrative of the context from hijacking the model's attention away from the actual task.
Prefer structured over prose. At long contexts, JSON with labeled fields, markdown tables, and XML-tagged sections meaningfully outperform plain prose of the same information density. The structural tokens act as anchors that the attention mechanism can latch onto when the positional signal is decaying. This effect is larger at 100K than at 10K.
Summarize and re-summarize recursively. For multi-turn agent loops, do not let raw history grow unboundedly. Compact older turns into terse state descriptions. The tokens you save move your working region back into the left half of the window, where the whole performance surface is higher. This is what Anthropic's context engineering guidance is really about: the goal is not to use the window, it is to keep your working region in the part of the window that works.
Run an eval on your actual task at three lengths. Short (1K–4K), medium (25K–50K), and long (80% of claimed window). If you cannot explain the knee in your own accuracy curve, you do not know where your effective context ends, and any budget number in this post is just someone else's guess applied to your problem.
The takeaway
Context windows grew roughly 30x per year between mid-2023 and mid-2025. That is genuinely useful — 1M-token windows exist and do things that 4K-token windows could not. But treating the advertised ceiling as your usable ceiling is the most common and most expensive mistake in production LLM work right now. The label tells you what the model processes; the effective context tells you what it uses.
Budget accordingly. Measure your own knee. Keep the tail of your prompt clean and precious. And if you find yourself about to feed a 900K-token blob into a model because the invoice says you paid for it — remember that the last 20% of the window is the part most likely to lie to you.
- https://arxiv.org/abs/2508.07479
- https://arxiv.org/abs/2502.05167
- https://arxiv.org/abs/2404.06654
- https://www.trychroma.com/research/context-rot
- https://arxiv.org/abs/2307.03172
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.databricks.com/blog/long-context-rag-performance-llms
