Skip to main content

The Right-Edge Accuracy Drop: Why the Last 20% of Your Context Window Is a Trap

· 11 min read
Tian Pan
Software Engineer

A 200K-token context window is not a 200K-token context window. Fill it to the brim and the model you just paid for quietly becomes a worse version of itself — not at the middle, where "lost in the middle" would predict, but at the right edge, exactly where recency bias was supposed to save you. The label on the box sold you headroom; the silicon sells you a cliff.

This is a different failure mode from the one most teams have internalized. "Lost in the middle" trained a generation of prompt engineers to stuff the critical instruction at the top and the critical question at the bottom, confident that primacy and recency would carry the signal through. That heuristic silently breaks when utilization approaches the claimed window. The drop-off is not gradual, not linear, and not symmetric with how the model behaves at half-fill. Past a utilization threshold that varies by model, you are operating in a different regime, and the prompt shape that worked at 30K fails at 180K.

The economic temptation makes it worse. If you just paid for a million-token window, the pressure to use it is enormous — dump the entire repo, feed it every support ticket, hand it the quarterly filings and let it figure out what matters. That is how you get a confidently wrong answer that looks well-reasoned on the surface and disintegrates on audit.

Beyond "lost in the middle": a different failure mode

The original Lost in the Middle paper from 2023 found a U-shaped performance curve: models were reliable at the start (primacy bias) and at the end (recency bias) of a context, and sagged in between. That finding was robust enough to become prompt-engineering folklore. Put the system instruction first. Put the question last. Fill the middle with whatever you must.

What recent work shows is that the U-shape holds only while the window is less than half full. Positional Biases Shift as Inputs Approach Context Window Limits (Veseli et al., 2025) traced how the curve deforms as you push toward the advertised ceiling. Past 50% utilization, the primacy bias weakens materially. Past 80%, the curve stops looking like a U at all. What you are left with is a raw recency gradient — and even that gradient operates on a lower absolute performance floor than the short-context baseline.

Translation: you do not get to keep the "information at the end is safe" heuristic past the halfway mark. The end is still the least bad region, but "least bad" is not "good." The whole performance surface shifts down.

The Chroma Context Rot study reinforced this with a blunt empirical claim across eighteen frontier models — Claude Opus 4, Sonnet 4, GPT-4.1, GPT-4o, Gemini 2.5 Pro/Flash, Qwen3 variants, and more. Every single model degraded as input length grew. Not some. Every one. Tasks that held near-perfect accuracy at 1K tokens fell off unpredictably as the window filled. Some models dropped gently and late; some — notably Gemini variants — exhibited wild variance much earlier. Claude decayed slowest, though it sometimes substituted a refusal for an answer as contexts got large, which is its own kind of failure.

What the benchmarks actually say

Aggregate numbers are worth pulling out of the papers because the gap between marketing and reality is larger than most teams assume.

  • RULER (NVIDIA) tested seventeen long-context models and found that while every model claimed at least 32K tokens of context, only four cleared the qualitative performance threshold at 32K. The rest fell below the bar well before reaching their advertised maximum. The paper coined the useful term effective context length: the point past which your model's real-task accuracy is no longer reliable.
  • NoLiMa removed the literal keyword-match shortcut that needle-in-a-haystack tests accidentally reward and retested thirteen models that all claim at least 128K. At 32K, eleven of the thirteen had dropped below 50% of their short-context baseline. GPT-4o — one of the best performers — still fell from 99.3% at short context to 69.7% at 32K. Most models never come close to using the window they advertise.
  • Chroma's context rot analysis showed that coherent, logically-ordered documents actually hurt retrieval accuracy more than shuffled versions of the same content, across all eighteen models tested. Coherence concentrates attention in ways that make distractors more seductive. Randomness breaks the spell.
  • Long-context RAG benchmarks from Databricks showed named models degrading at very specific thresholds: Llama-3.1-405B started decaying past 32K, GPT-4-0125-preview held until 64K, and newer Claude and GPT variants pushed the knee out but did not eliminate it.

Pattern across all of these: the degradation is not gradual. Performance stays near-perfect, and then it falls. A model rated for 200K is often unreliable past roughly 130K on non-trivial tasks — roughly 65% of the advertised ceiling. A model rated for 1M is often unreliable past roughly 200K–300K. The last chunk of the advertised window is the part you are paying for that does not work.

Why the right edge specifically fails

Three mechanisms stack up to produce the right-edge drop, and understanding them is what lets you predict how your particular deployment will behave.

Attention dilution. Softmax attention normalizes to a probability distribution over all tokens. As the window fills, each individual token's weight shrinks. A single relevant sentence at position 950K competes with 999,999 tokens of distraction. The signal does not grow — the noise floor rises. The relevant chunk is still in there; the model just can't find it through the haze. This is why coherent documents hurt: coherence creates dense, plausible distractors that look like the answer.

Positional encoding decay. Most modern open models use some form of Rotary Position Embedding (RoPE). RoPE has a built-in long-term decay: the further apart two tokens are, the less the position signal ties them together. Frontier labs mitigate this with various interpolation and extrapolation tricks to extend their claimed context windows, but those tricks do not cleanly extend the effective window. The extension is sometimes aesthetic — the model processes the tokens, it just cannot associate them reliably with each other.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates