The RAG Eval Antipattern That Hides Retriever Bugs
There's a failure mode common in RAG systems that goes undetected for months: your retriever is returning the wrong documents, but your generator is good enough at improvising that end-to-end quality scores stay green. You keep tuning the prompt. You upgrade the model. Nothing helps. The bug is three layers upstream and your metrics are invisible to it.
This is the retriever eval antipattern — evaluating your entire RAG pipeline as a single unit, which lets the generator absorb and hide retrieval failures. The result is a system where you cannot distinguish between "the generator failed" and "the retriever failed," making systematic improvement nearly impossible.
The fix is not complicated, but it requires stepping back from end-to-end eval and building a standalone harness that tests your retriever on its own terms. This post explains how.
Why End-to-End Eval Hides Retriever Bugs
When you measure only final answer quality — whether via LLM-as-judge, human rating, or exact match — you're collapsing a four-stage pipeline (indexing, retrieval, reranking, generation) into a single signal. That signal is dominated by your generator's capabilities.
Modern LLMs are remarkably good at generating plausible answers even when the retrieved context is wrong, incomplete, or entirely missing. A GPT-4-class model confronted with irrelevant context will often fall back on its parametric knowledge and produce something that looks correct to a human reviewer, especially for common factual queries. Research in 2025 formalized this as "context neglect" — models generating from priors rather than retrieved content, even when correct context is present.
The implication: a retriever with 40% recall can drive a system with 80% end-to-end answer quality if your generator is strong enough and your queries are close to training data. You'd conclude the pipeline is working. It isn't. Edge cases, uncommon queries, and domain-specific content will all fail silently.
There's a second compounding problem. Retrievers systematically bias toward documents with surface-level characteristics: shorter documents, documents where the answer appears early, documents with lexical overlap with the query, and documents that repeat the query's entities. None of these correlate reliably with actual relevance. Your end-to-end eval can't measure this bias because the generator smooths over it.
The Metrics That Actually Matter
A retriever-only eval harness needs a small number of well-chosen metrics. The right choice depends on your use case.
Recall@K answers the question: "Did we find the relevant documents at all?" It measures what fraction of all relevant documents appear in the top K results. If you have 10 relevant documents for a query and your retriever puts 6 of them in the top 10, your Recall@10 is 0.6. This metric is not rank-aware — it only counts presence. That's a feature, not a bug: if your downstream generator uses all K results, retrieval order matters less than coverage.
Use Recall@K when your use case is completeness-sensitive. Legal research, medical literature review, compliance checking — anywhere that missing a relevant document is a hard failure. A Recall@K below 0.5–0.6 is a strong signal your retriever is the bottleneck.
Precision@K answers: "How much noise are we sending the generator?" It measures the fraction of your top K results that are actually relevant. If you retrieve 10 documents and only 4 are relevant, Precision@10 is 0.4. High noise hurts generators in two ways: it crowds out relevant context (especially with the "lost in the middle" effect discussed below), and it introduces contradictory information that increases hallucination risk. Precision@K below 0.4 suggests your retriever is adding too much noise.
MRR (Mean Reciprocal Rank) is the right metric for systems where one correct answer exists — Q&A assistants, chatbots, search interfaces. It measures how high the first relevant result ranks, averaged across queries. An MRR below 0.3 means the correct answer is, on average, below position 3 or 4 — far enough down that most users or rerankers won't find it. MRR is simple and interpretable, which makes it a good first metric to instrument.
NDCG@K (Normalized Discounted Cumulative Gain) is the most informative metric when documents have graded relevance — when some documents are highly relevant, others partially relevant, and others irrelevant. It incorporates both relevance and ranking position, with a logarithmic discount for lower-ranked results. If you're building over technical documentation, scientific literature, or any corpus where document quality varies significantly, NDCG should be your primary metric.
For most production RAG systems, start with Recall@10 and Precision@10. Add MRR if you're building a Q&A system. Reach for NDCG when you've collected graded relevance labels.
Building the Eval Harness
The harness has three components: a test set, a ground-truth mapping, and a runner.
The test set is a collection of queries representative of what your system will actually receive. The critical property is representativeness — queries from the same distribution as production traffic. This is harder than it sounds. Synthetic queries generated by asking an LLM to produce questions from your documents are easy to collect but heavily skewed toward simple factual lookups. Studies show ~95% of naive LLM-generated queries fall into single-fact categories, creating unrealistically high performance expectations that mask failures on complex queries.
Better approaches for building the test set:
- Sample real user queries from production logs, if you have them. Even 100 labeled examples from production traffic is worth more than 10,000 synthetic queries.
- For common question types you know exist, write them manually. 50 hand-crafted queries are cheap and highly representative.
- For augmenting a small seed set, use an LLM to generate variations at different abstraction levels: direct factual questions, multi-hop questions that require combining information from multiple documents, and paraphrased versions of seed queries.
Ground-truth mapping labels which documents are relevant to each query. This is the expensive step. For each query, you need to know which document IDs should be retrieved, and ideally at what relevance grade. Build your ground truth incrementally: start with 50–100 queries with binary relevance labels (relevant/not relevant). Expand gradually, adding graded labels (highly relevant / partially relevant / not relevant) once you've identified where binary labels are insufficient.
For internal corpora where you're starting from scratch, use an LLM to do initial relevance labeling, then have a human review the uncertain cases. LLM-as-judge for relevance labeling is accurate enough for most applications when the query-document pairs are clear-cut, and flagging borderline cases for human review is cheap.
The runner is straightforward: for each query in your test set, run the retriever and collect the top K document IDs. Compare against ground truth and compute your metrics. Run this as part of your CI pipeline so retriever regressions are caught before they reach production.
Diagnosing the Bottleneck
Once you have metrics, the question becomes: is the retriever actually your bottleneck? Three diagnostic patterns:
Pattern 1: Low recall, high generation quality. Your generator is compensating with parametric knowledge. The retriever bug is invisible end-to-end but will surface on out-of-distribution queries, recent events, or domain-specific content your model wasn't trained on. Fix the retriever.
Pattern 2: High recall, low generation quality. You're retrieving the right documents but your generator isn't using them. Check for the "lost in the middle" problem (see below), and verify that your retrieved documents are actually being passed correctly to the generator.
Pattern 3: High precision, mediocre recall. Your retriever is conservative — when it returns something, it's usually relevant, but it misses a lot. This pattern often indicates a query-document mismatch problem: your queries are phrased differently than the documents. Hybrid search (combining dense and sparse retrieval) typically fixes this.
The lost in the middle diagnostic: Stanford and University of Washington research established that LLM performance degrades sharply — over 30% in some settings — when relevant information is in the middle of a long context rather than at the beginning or end. If you suspect this pattern, run a controlled experiment: retrieve 3 documents vs. 15 documents on the same queries, with the same ground truth. If performance drops with more context, you don't have a retrieval quantity problem — you have a position bias problem. Fix it by reordering retrieved documents to put highest-scoring results at the start and end of the context, not sorted by score descending.
Synthetic Data for Private Corpora
A common objection: "We don't have labeled queries for our internal knowledge base." This is the normal case, not the exception. Most enterprise RAG systems are built over proprietary content with no pre-existing query-document pairs.
For generating synthetic test data at scale, a reliable pattern is:
- Sample a diverse set of document chunks across your corpus, stratified by document type and length to avoid over-representing any subset.
- For each chunk, prompt an LLM to generate 3–5 queries at different complexity levels: a direct question about a fact in the chunk, a more abstract question requiring inference, and a question that can only be answered by combining this chunk with another.
- Use the LLM to label relevance for query-document pairs across the broader corpus (not just the source chunk) to identify hard negatives — documents that look relevant but aren't.
Hard negatives are the most important part of this process. A test set without hard negatives will show artificially high precision because the retriever just needs to avoid obvious mismatches. Add documents that share vocabulary and topic with the query but don't actually answer it, and your metrics will start reflecting real-world difficulty.
For sensitive data where you cannot send document contents to an external LLM, differentially private query generation frameworks exist — they work by clustering documents, extracting statistical patterns from clusters, and generating synthetic queries from those patterns without exposing individual document contents.
Choosing Your K
The K in Recall@K and Precision@K should match how your pipeline actually consumes retrieved documents.
If you display the top 5 results to users, evaluate at K=5. If you pass the top 10 to a reranker that selects 3 for the final context, evaluate Recall@10 (did you retrieve the relevant docs at all?) and Precision@3 after reranking (did the reranker surface them?). Choosing K=100 to pad your recall numbers is a common mistake that produces metrics uncorrelated with production behavior.
Snowflake's engineering team studying finance RAG found that retrieval and chunking strategies were larger determinants of answer quality than the choice of generator model — even with long-context models supporting 100K+ tokens. That finding generalizes: in domain-specific deployments, your retriever will be the primary bottleneck more often than your generator.
Closing the Loop
Building a retriever eval harness takes a few days. Running it against a production retriever before any significant pipeline change takes minutes. The asymmetry makes the investment obvious, but it requires accepting that end-to-end metrics will not catch retriever regressions.
Start minimal: 50 labeled queries, binary relevance labels, Recall@10 and MRR. Run the harness in CI. Expand coverage based on the failure modes you actually observe. Once you have retrieval metrics you trust, decisions about chunking strategy, embedding model selection, reranking, and hybrid search become experiments with measurable outcomes rather than guesses.
The goal isn't perfect retrieval — it's distinguishing between "this is a retriever problem" and "this is a generator problem." Once you can answer that question reliably, the path to fixing each is clear.
- https://www.getmaxim.ai/articles/rag-evaluation-a-complete-guide-for-2025/
- https://deconvoluteai.com/blog/rag/metrics-retrieval
- https://towardsdatascience.com/how-to-evaluate-retrieval-quality-in-rag-pipelines-precisionk-recallk-and-f1k/
- https://developer.nvidia.com/blog/evaluating-retriever-for-enterprise-grade-rag/
- https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/rag-evaluation/
- https://arxiv.org/html/2601.03258
- https://www.snowflake.com/en/engineering-blog/impact-retrieval-chunking-finance-rag/
- https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/
- https://www.getmaxim.ai/articles/solving-the-lost-in-the-middle-problem-advanced-rag-techniques-for-long-context-llms/
- https://weaviate.io/blog/chunking-strategies-for-rag
- https://www.iguazio.com/blog/best-rag-evaluation-tools/
