RAG Position Bias: Why Chunk Order Changes Your Answers
You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.
The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.
The U-Shaped Attention Curve
The phenomenon has a specific shape. Give a model 20 retrieved passages and ask it a question that one of them answers. If the relevant passage is first, accuracy is high — typically 85–95% on standard benchmarks. If it's last, accuracy is also high, around 85%. But if it's in the middle, accuracy can collapse to 40–60%.
This U-shaped curve isn't a bug in any one model. It emerges from two architectural features present in every autoregressive transformer:
Primacy bias comes from causal masking. Tokens at the beginning of a sequence influence all subsequent tokens during attention computation, making early content structurally over-represented in the model's internal activations — independent of whether that content is actually relevant.
Recency bias comes from residual connections and the way attention weights are computed near the end of a sequence. The last few thousand tokens remain in the model's effective "working memory" more reliably than middle sections.
The result: a model reading 20 chunks doesn't treat them equally. It reads the first few carefully, skims the middle, and catches the last few. Anything critical buried in positions 5 through 15 is structurally disadvantaged.
What This Looks Like in Production
The failure mode is subtle. Your eval metrics look fine because evals often test retrieval precision — whether the right chunk appears in the top-k results — not whether the model actually uses it once it's there. You can have 90% retrieval recall and still get badly wrong answers if the relevant chunk consistently lands in the middle.
A few scenarios where position bias shows up acutely:
Long context with many chunks. Retrieval pipelines that return 10–20 chunks to maximize recall create deep middle zones. A relevant passage at position 8 of 20 sits squarely in the lowest-attention region.
Fixed ordering based on embedding similarity. Pure cosine similarity rankings don't account for the model's positional preferences. The second or third most similar chunk might be more useful if placed first.
Multi-hop questions. When the answer requires synthesizing information from multiple passages, chunks that happen to land in the middle get under-weighted even when they contain necessary context.
Semantic search for rare but critical facts. A single highly-relevant passage that your retriever ranks fourth or fifth might sit in the attention dead zone while three lower-relevance passages dominate the model's response.
The Distractor Problem Complicates Everything
Before reaching for an ordering fix, it's worth understanding what the research actually shows about production impact.
A 2025 study that examined real RAG scenarios rather than controlled lab settings found that position effects, while real, are often swamped by a different problem: distracting passages. In real retrieval results, over 60% of queries return at least one highly distracting passage — content that is topically adjacent but incorrect, confidently stated, and likely to mislead the model.
When you try to exploit primacy bias by placing your most relevant chunks first, you often end up placing distracting chunks in positions 2 and 3 as well. The effect of position optimization gets cancelled by accidentally elevating noise. Across five different ordering strategies — sequential, inverse, random shuffle, max-relevance, and min-distraction — accuracy differences in realistic conditions were statistically insignificant.
This doesn't mean position doesn't matter. It means retrieval quality dominates position effects when your retrieved set contains distractors. Fix the retrieval first. Then address ordering.
- https://arxiv.org/abs/2307.03172
- https://arxiv.org/html/2406.16008v1
- https://arxiv.org/html/2505.15561v1
- https://arxiv.org/html/2512.14313v1
- https://arxiv.org/html/2603.10123
- https://pinecone.io/learn/series/rag/rerankers/
- https://developer.nvidia.com/blog/enhancing-rag-pipelines-with-re-ranking/
- https://toloka.ai/blog/rag-evaluation-a-technical-guide-to-measuring-retrieval-augmented-generation/
