Skip to main content

The RAG Eval Antipattern That Hides Retriever Bugs

· 10 min read
Tian Pan
Software Engineer

There's a failure mode common in RAG systems that goes undetected for months: your retriever is returning the wrong documents, but your generator is good enough at improvising that end-to-end quality scores stay green. You keep tuning the prompt. You upgrade the model. Nothing helps. The bug is three layers upstream and your metrics are invisible to it.

This is the retriever eval antipattern — evaluating your entire RAG pipeline as a single unit, which lets the generator absorb and hide retrieval failures. The result is a system where you cannot distinguish between "the generator failed" and "the retriever failed," making systematic improvement nearly impossible.

The fix is not complicated, but it requires stepping back from end-to-end eval and building a standalone harness that tests your retriever on its own terms. This post explains how.

Why End-to-End Eval Hides Retriever Bugs

When you measure only final answer quality — whether via LLM-as-judge, human rating, or exact match — you're collapsing a four-stage pipeline (indexing, retrieval, reranking, generation) into a single signal. That signal is dominated by your generator's capabilities.

Modern LLMs are remarkably good at generating plausible answers even when the retrieved context is wrong, incomplete, or entirely missing. A GPT-4-class model confronted with irrelevant context will often fall back on its parametric knowledge and produce something that looks correct to a human reviewer, especially for common factual queries. Research in 2025 formalized this as "context neglect" — models generating from priors rather than retrieved content, even when correct context is present.

The implication: a retriever with 40% recall can drive a system with 80% end-to-end answer quality if your generator is strong enough and your queries are close to training data. You'd conclude the pipeline is working. It isn't. Edge cases, uncommon queries, and domain-specific content will all fail silently.

There's a second compounding problem. Retrievers systematically bias toward documents with surface-level characteristics: shorter documents, documents where the answer appears early, documents with lexical overlap with the query, and documents that repeat the query's entities. None of these correlate reliably with actual relevance. Your end-to-end eval can't measure this bias because the generator smooths over it.

The Metrics That Actually Matter

A retriever-only eval harness needs a small number of well-chosen metrics. The right choice depends on your use case.

Recall@K answers the question: "Did we find the relevant documents at all?" It measures what fraction of all relevant documents appear in the top K results. If you have 10 relevant documents for a query and your retriever puts 6 of them in the top 10, your Recall@10 is 0.6. This metric is not rank-aware — it only counts presence. That's a feature, not a bug: if your downstream generator uses all K results, retrieval order matters less than coverage.

Use Recall@K when your use case is completeness-sensitive. Legal research, medical literature review, compliance checking — anywhere that missing a relevant document is a hard failure. A Recall@K below 0.5–0.6 is a strong signal your retriever is the bottleneck.

Precision@K answers: "How much noise are we sending the generator?" It measures the fraction of your top K results that are actually relevant. If you retrieve 10 documents and only 4 are relevant, Precision@10 is 0.4. High noise hurts generators in two ways: it crowds out relevant context (especially with the "lost in the middle" effect discussed below), and it introduces contradictory information that increases hallucination risk. Precision@K below 0.4 suggests your retriever is adding too much noise.

MRR (Mean Reciprocal Rank) is the right metric for systems where one correct answer exists — Q&A assistants, chatbots, search interfaces. It measures how high the first relevant result ranks, averaged across queries. An MRR below 0.3 means the correct answer is, on average, below position 3 or 4 — far enough down that most users or rerankers won't find it. MRR is simple and interpretable, which makes it a good first metric to instrument.

NDCG@K (Normalized Discounted Cumulative Gain) is the most informative metric when documents have graded relevance — when some documents are highly relevant, others partially relevant, and others irrelevant. It incorporates both relevance and ranking position, with a logarithmic discount for lower-ranked results. If you're building over technical documentation, scientific literature, or any corpus where document quality varies significantly, NDCG should be your primary metric.

For most production RAG systems, start with Recall@10 and Precision@10. Add MRR if you're building a Q&A system. Reach for NDCG when you've collected graded relevance labels.

Building the Eval Harness

The harness has three components: a test set, a ground-truth mapping, and a runner.

The test set is a collection of queries representative of what your system will actually receive. The critical property is representativeness — queries from the same distribution as production traffic. This is harder than it sounds. Synthetic queries generated by asking an LLM to produce questions from your documents are easy to collect but heavily skewed toward simple factual lookups. Studies show ~95% of naive LLM-generated queries fall into single-fact categories, creating unrealistically high performance expectations that mask failures on complex queries.

Better approaches for building the test set:

  • Sample real user queries from production logs, if you have them. Even 100 labeled examples from production traffic is worth more than 10,000 synthetic queries.
  • For common question types you know exist, write them manually. 50 hand-crafted queries are cheap and highly representative.
  • For augmenting a small seed set, use an LLM to generate variations at different abstraction levels: direct factual questions, multi-hop questions that require combining information from multiple documents, and paraphrased versions of seed queries.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates