RAG Position Bias: Why Chunk Order Changes Your Answers
You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.
The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.
The U-Shaped Attention Curve
The phenomenon has a specific shape. Give a model 20 retrieved passages and ask it a question that one of them answers. If the relevant passage is first, accuracy is high — typically 85–95% on standard benchmarks. If it's last, accuracy is also high, around 85%. But if it's in the middle, accuracy can collapse to 40–60%.
This U-shaped curve isn't a bug in any one model. It emerges from two architectural features present in every autoregressive transformer:
Primacy bias comes from causal masking. Tokens at the beginning of a sequence influence all subsequent tokens during attention computation, making early content structurally over-represented in the model's internal activations — independent of whether that content is actually relevant.
Recency bias comes from residual connections and the way attention weights are computed near the end of a sequence. The last few thousand tokens remain in the model's effective "working memory" more reliably than middle sections.
The result: a model reading 20 chunks doesn't treat them equally. It reads the first few carefully, skims the middle, and catches the last few. Anything critical buried in positions 5 through 15 is structurally disadvantaged.
What This Looks Like in Production
The failure mode is subtle. Your eval metrics look fine because evals often test retrieval precision — whether the right chunk appears in the top-k results — not whether the model actually uses it once it's there. You can have 90% retrieval recall and still get badly wrong answers if the relevant chunk consistently lands in the middle.
A few scenarios where position bias shows up acutely:
Long context with many chunks. Retrieval pipelines that return 10–20 chunks to maximize recall create deep middle zones. A relevant passage at position 8 of 20 sits squarely in the lowest-attention region.
Fixed ordering based on embedding similarity. Pure cosine similarity rankings don't account for the model's positional preferences. The second or third most similar chunk might be more useful if placed first.
Multi-hop questions. When the answer requires synthesizing information from multiple passages, chunks that happen to land in the middle get under-weighted even when they contain necessary context.
Semantic search for rare but critical facts. A single highly-relevant passage that your retriever ranks fourth or fifth might sit in the attention dead zone while three lower-relevance passages dominate the model's response.
The Distractor Problem Complicates Everything
Before reaching for an ordering fix, it's worth understanding what the research actually shows about production impact.
A 2025 study that examined real RAG scenarios rather than controlled lab settings found that position effects, while real, are often swamped by a different problem: distracting passages. In real retrieval results, over 60% of queries return at least one highly distracting passage — content that is topically adjacent but incorrect, confidently stated, and likely to mislead the model.
When you try to exploit primacy bias by placing your most relevant chunks first, you often end up placing distracting chunks in positions 2 and 3 as well. The effect of position optimization gets cancelled by accidentally elevating noise. Across five different ordering strategies — sequential, inverse, random shuffle, max-relevance, and min-distraction — accuracy differences in realistic conditions were statistically insignificant.
This doesn't mean position doesn't matter. It means retrieval quality dominates position effects when your retrieved set contains distractors. Fix the retrieval first. Then address ordering.
Reordering Strategies That Work
When your retrieval pipeline is solid and you're still seeing position-related degradation, these approaches produce consistent gains:
Cross-encoder reranking. A cross-encoder model reads the query and each passage together, producing a relevance score that captures semantic interaction — not just vector proximity. Unlike embedding similarity (which encodes query and passage independently), cross-encoders see the pairing. In benchmarks, reranked results reduce hallucination rates by around 35% compared to raw embedding rankings. The cost is latency: cross-encoders are slower than embedding lookups, so they're typically applied to a pre-filtered top-20 set.
Placing high-relevance chunks at the boundaries. Given the U-shaped curve, the pragmatic intervention is mechanical: put your highest-confidence chunks first, your second-tier chunks last, and accept that the middle will be under-attended. For a 10-chunk context, positions 1–2 and 9–10 are your most valuable real estate.
Reducing context length aggressively. Every passage you add to the context extends the middle zone. Cutting from 20 chunks to 6 chunks eliminates most of the attention dead zone. The tradeoff is recall, but if your retriever is precise, smaller is often better.
Instruction prompting. Explicitly telling the model to read all passages carefully before answering can partially compensate for positional bias. This works better on instruction-tuned models and degrades with model tier — smaller models respond less reliably to meta-instructions about their own attention.
The "Found in the Middle" Calibration Approach
A more principled mitigation emerged from research on attention score calibration. The core idea: if you can model how much of any attention score is attributable to position versus relevance, you can subtract the positional component.
The technique, applied during inference, decomposes the attention weights the model assigns to each passage into a relevance term and a positional bias term. By removing the positional bias component, the model attends to passages more uniformly. In benchmarks, this achieves roughly 15 percentage point accuracy improvements on question-answering tasks and nearly 50 Recall@3 point improvements in document ranking — without changing the model weights or retrieval pipeline.
This approach is available as a post-processing layer on attention scores, which means it can be applied to any transformer-based LLM without fine-tuning. The practical limitation is that it requires access to intermediate attention activations, which not all inference APIs expose.
How to Measure Position Bias in Your System
Most teams don't know whether position bias is affecting them because they're measuring the wrong thing. Retrieval metrics (precision@k, recall@k, NDCG) don't capture whether the model uses what it retrieves. End-to-end accuracy metrics don't isolate ordering effects from retrieval quality.
A targeted measurement protocol:
-
Controlled position permutation test. For a fixed set of test queries with known-correct answers, retrieve your standard top-k chunks. Run the query 4–5 times, shuffling the relevant chunk to different positions (first, middle, last) while holding the distractor chunks constant. If accuracy varies by more than 10–15 percentage points across positions, you have a measurable ordering problem.
-
eRAG document-level scoring. For each retrieved document, run the model using only that document as context and score the response quality. Compare these per-document scores to the ordering your pipeline produces. If highly-scoring documents are consistently landing in middle positions, your ordering is misaligned with model preference.
-
Distractor density audit. For a sample of queries, manually label which retrieved passages are genuinely relevant versus distracting. If your top-10 results regularly contain 3–4 distractors, ordering optimization will have limited effect until retrieval is improved.
The Model Evolution Caveat
Position bias is not static. Models trained on more varied positional distributions, or with techniques like attention calibration baked into training, show reduced sensitivity. Gemini 1.5, later Claude versions, and GPT-4o all exhibit less pronounced lost-in-the-middle degradation than their predecessors, particularly in longer contexts.
This suggests a trajectory: position bias is a current engineering constraint, not a permanent architectural law. As model training improves, the practical impact will diminish. For now, though, most deployed LLMs — especially open-weight models that many teams run on-premises — still exhibit the classic U-shaped curve.
The engineering conclusion is straightforward: measure whether ordering affects your system's accuracy, fix retrieval quality first if distractors are present, then apply cross-encoder reranking and boundary placement for the remaining gap. Don't assume that retrieving the right content is enough to guarantee the model uses it.
The attention mechanism doesn't care that you worked hard to retrieve the right passage. It cares where you put it.
- https://arxiv.org/abs/2307.03172
- https://arxiv.org/html/2406.16008v1
- https://arxiv.org/html/2505.15561v1
- https://arxiv.org/html/2512.14313v1
- https://arxiv.org/html/2603.10123
- https://pinecone.io/learn/series/rag/rerankers/
- https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/
- https://toloka.ai/blog/rag-evaluation-a-technical-guide-to-measuring-retrieval-augmented-generation/
