Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems
Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?
Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.
This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.
Why Unit Tests Don't Catch Seam Failures
The standard RAG testing playbook looks like this: write a set of queries, check that the retriever returns relevant documents, then evaluate generation quality on a hand-curated set of (query, context) pairs. Both pass. You deploy. Users start filing bugs.
The problem is that unit tests evaluate each component against inputs you control. In production, the generator receives the context your retriever actually constructs — including its failures. The gap between "retriever test context" and "production retriever context" is where compound failures live.
Consider these seam-specific failure patterns:
- Truncation silencing the answer. Your retriever returns a 2,000-token document that contains the answer in the last 400 tokens. You configured your context window to 1,500 tokens. The document passes the retriever's relevance check. The answer never reaches the generator.
- Distractor pollution. You retrieve top-5 documents. Four are correct; one is semantically similar but factually contradictory. Your generator, trying to be comprehensive, blends them. The retriever test measured precision@5 and called it a pass.
- Structural mismatch. Your generator was fine-tuned on clean, well-formatted documents. Your retriever returns raw HTML fragments, table cells out of context, or mid-sentence chunk boundaries. Neither component test caught this because the retriever was tested against clean documents and the generator was tested with clean context.
- Attribution erasure. The generator produces a correct-sounding answer that isn't actually grounded in the retrieved context — it's using parametric memory. Your unit test for the generator checked answer quality, not groundedness.
The "Seven Failure Points in RAG" framework identifies these systematically: missing content, missed top-ranked documents, documents excluded at consolidation, failure to extract relevant information, wrong format, incorrect specificity, and incomplete answers. Notice how cleanly they split between retrieval failures, seam failures, and generation failures — yet most teams instrument only the endpoints.
Designing Seam-Level Tests
Testing the seam requires fixtures that simulate realistic retrieval output, not idealized test contexts. The goal is to expose your generator to the actual distribution of contexts your retriever will produce.
Synthetic context injection is the foundational technique. Instead of testing generation on hand-crafted gold-standard contexts, you pipe your retriever's actual output into your generation evaluator. This means running your retriever against a representative query set, capturing the full retrieved context (including rank, scores, and any post-retrieval compression), and feeding that exact context to your generation evaluator.
Three fixture categories cover most seam failures:
- Correct context. The retriever found the answer. Verify the generator extracts it faithfully. This is your baseline — if you fail here, you have a generation problem, not a seam problem.
- Relevant but insufficient context. The retriever found topically related documents, but none contain the specific answer. Verify the generator says "I don't know" rather than hallucinating. Most teams skip this fixture type entirely.
- Contradictory context. The retriever found documents with conflicting claims. Verify the generator acknowledges uncertainty or resolves the conflict explicitly, rather than silently picking one.
Each fixture category tells you something different. Failures in category 1 implicate your prompt or model. Failures in category 2 implicate your confidence calibration and refusal behavior. Failures in category 3 implicate your context structuring and generation prompt.
A seam test is only useful if it fails separately from its component tests. If your retriever-only tests pass and your seam tests fail, you've localized the problem to the handoff. That's actionable. If everything fails together, you've learned nothing.
The Metric Split: Retrieval Recall vs. Grounded Accuracy
The most common evaluation mistake is using a single end-to-end accuracy metric to evaluate the whole pipeline. When it drops, you don't know whether to tune your embeddings or your prompt.
The metric split that reliably locates blame:
Retrieval-side metrics:
- Context Recall: Does the retrieved set contain all the information needed to answer the question? Low context recall means your embedding model or chunking strategy is the problem — not your generator.
- Context Precision: Do the relevant chunks rank above the irrelevant ones? Low precision means you're handing the generator too much noise.
- Recall@K / Precision@K: Traditional IR metrics, useful for comparing embedding models and rerankers in A/B tests.
Seam-specific metrics:
- Context Relevance: Of the chunks actually passed to the generator (post-truncation, post-reranking), what fraction are relevant? This differs from Context Recall — it measures the quality of the context window as assembled, not just the retrieval candidates.
- Answer Attribution Rate: For claims in the generated answer, what fraction can be traced back to a specific passage in the context? Low attribution at high answer quality means your model is using parametric knowledge instead of retrieved context — fine for some use cases, dangerous for others.
Generation-side metrics:
- Faithfulness: What fraction of claims in the answer are supported (and not contradicted) by the provided context? This is the primary hallucination detector for RAG.
- Answer Relevancy: Does the generated output address what was actually asked? Surprisingly separable from faithfulness — a model can be faithful to bad context and still irrelevant.
- https://arxiv.org/html/2401.05856v1
- https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more
- https://www.confident-ai.com/blog/how-to-evaluate-rag-applications-in-ci-cd-pipelines-with-deepeval
- https://docs.ragas.io/en/stable/concepts/metrics/
- https://www.evidentlyai.com/llm-guide/rag-evaluation
- https://www.getmaxim.ai/articles/rag-evaluation-a-complete-guide-for-2025/
- https://qdrant.tech/blog/rag-evaluation-guide/
- https://developers.redhat.com/articles/2026/02/23/synthetic-data-rag-evaluation-why-your-rag-system-needs-better-testing
- https://labelstud.io/blog/seven-ways-your-rag-system-could-be-failing-and-how-to-fix-them/
- https://deconvoluteai.com/blog/rag/failure-modes
