Skip to main content

The Middle-Context Blindness Your Retrieval Pipeline Never Measured

· 8 min read
Tian Pan
Software Engineer

The retrieval logs are clean. Recall@10 against your hand-labeled query set has not regressed in months. The answer-quality dashboard says faithfulness is holding above 90%. Then a customer pastes a question into your support agent, the gold passage is right there at position 7 of 12 in the assembled prompt, and the model answers as if it were never retrieved.

The retrieval team will tell you the chunk was there. The prompt team will tell you the prompt was correct. Both are technically right. The model attended to the first thousand tokens, attended to the last thousand tokens, and skimmed the middle band where the answer lived. Your pipeline is hitting a positional attention bias that neither team owns, neither dashboard tracks, and neither benchmark catches.

The U-shaped curve no one drew on your dashboard

The "lost in the middle" effect was named by Liu et al. in 2023. The setup was simple: place a single answer-bearing document somewhere in a list of 10, 20, or 30 documents and ask a frontier model to answer from it. Accuracy plotted against the gold document's position formed a U — strong at position 1, strong at the final position, sagging in between. The model with a nominal 16K window was answering as if it had a 4K window with a deaf zone in the middle.

Three years on, the curve is still there. Recent benchmarks across 128K+ context models show the same pattern, sometimes flattened, never gone. The cause is structural: causal masking concentrates attention mass at the start of the sequence, and the way models are trained on shorter sequences pushes attention back to the tail. The middle band is a slot where the math weakens, not a slot the model chooses to ignore.

Your RAG pipeline is built on the inverse assumption. Retrieval scores treat all slots in the assembled prompt as equivalent. The reranker maximizes relevance, then the assembler appends the passages in score order or sometimes in original-document order. The system prompt sits at the top. The user question sits at the bottom. The top-1 passage by retrieval score very often lands two or three slots into the middle band, where the curve says the model's attention has thinned out.

The retrieval scored that passage. The model answered from a different one.

Why retrieval@10 stays healthy while answer@10 quietly drifts

A two-stage RAG pipeline gives you two natural eval surfaces. Retrieval@K asks: did the gold chunk appear anywhere in the top-K? Faithfulness asks: did every claim in the answer trace back to a retrieved chunk? Both can read green while the user experience degrades.

Retrieval@10 is computed against the candidate list, not against what the model actually used. As long as the chunk is in the top-10, retrieval is "correct" — even when the model answered from chunk 1 because it never read chunk 7. Faithfulness scores claims in the response against the retrieved set. When the model improvises a plausible-sounding answer from the high-attention passages, faithfulness still rates the response as grounded because the supporting text is in the candidate set somewhere. The fact that the supporting text is the wrong one is a semantic gap your evaluation does not measure.

The first place this surfaces is in user complaints that look random. The retrieval looks correct, the prompt looks correct, the answer looks confident, and the answer is wrong. Postmortems blame the model. The actual cause is a deterministic property of the prompt assembly that no one is measuring.

What the right metric looks like

A per-position metric closes the gap. Instead of computing recall against the whole retrieved set, you compute answer accuracy as a function of where the gold chunk landed in the assembled prompt. Bucket the prompts by gold-chunk slot — top, middle, bottom — and track accuracy per bucket. If the middle-bucket accuracy is 12 points below the top and bottom buckets, the U-curve is alive in your pipeline and your reranker is feeding the model a slot it cannot read.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates