Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems
Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?
Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.
This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.
Why Unit Tests Don't Catch Seam Failures
The standard RAG testing playbook looks like this: write a set of queries, check that the retriever returns relevant documents, then evaluate generation quality on a hand-curated set of (query, context) pairs. Both pass. You deploy. Users start filing bugs.
The problem is that unit tests evaluate each component against inputs you control. In production, the generator receives the context your retriever actually constructs — including its failures. The gap between "retriever test context" and "production retriever context" is where compound failures live.
Consider these seam-specific failure patterns:
- Truncation silencing the answer. Your retriever returns a 2,000-token document that contains the answer in the last 400 tokens. You configured your context window to 1,500 tokens. The document passes the retriever's relevance check. The answer never reaches the generator.
- Distractor pollution. You retrieve top-5 documents. Four are correct; one is semantically similar but factually contradictory. Your generator, trying to be comprehensive, blends them. The retriever test measured precision@5 and called it a pass.
- Structural mismatch. Your generator was fine-tuned on clean, well-formatted documents. Your retriever returns raw HTML fragments, table cells out of context, or mid-sentence chunk boundaries. Neither component test caught this because the retriever was tested against clean documents and the generator was tested with clean context.
- Attribution erasure. The generator produces a correct-sounding answer that isn't actually grounded in the retrieved context — it's using parametric memory. Your unit test for the generator checked answer quality, not groundedness.
The "Seven Failure Points in RAG" framework identifies these systematically: missing content, missed top-ranked documents, documents excluded at consolidation, failure to extract relevant information, wrong format, incorrect specificity, and incomplete answers. Notice how cleanly they split between retrieval failures, seam failures, and generation failures — yet most teams instrument only the endpoints.
Designing Seam-Level Tests
Testing the seam requires fixtures that simulate realistic retrieval output, not idealized test contexts. The goal is to expose your generator to the actual distribution of contexts your retriever will produce.
Synthetic context injection is the foundational technique. Instead of testing generation on hand-crafted gold-standard contexts, you pipe your retriever's actual output into your generation evaluator. This means running your retriever against a representative query set, capturing the full retrieved context (including rank, scores, and any post-retrieval compression), and feeding that exact context to your generation evaluator.
Three fixture categories cover most seam failures:
- Correct context. The retriever found the answer. Verify the generator extracts it faithfully. This is your baseline — if you fail here, you have a generation problem, not a seam problem.
- Relevant but insufficient context. The retriever found topically related documents, but none contain the specific answer. Verify the generator says "I don't know" rather than hallucinating. Most teams skip this fixture type entirely.
- Contradictory context. The retriever found documents with conflicting claims. Verify the generator acknowledges uncertainty or resolves the conflict explicitly, rather than silently picking one.
Each fixture category tells you something different. Failures in category 1 implicate your prompt or model. Failures in category 2 implicate your confidence calibration and refusal behavior. Failures in category 3 implicate your context structuring and generation prompt.
A seam test is only useful if it fails separately from its component tests. If your retriever-only tests pass and your seam tests fail, you've localized the problem to the handoff. That's actionable. If everything fails together, you've learned nothing.
The Metric Split: Retrieval Recall vs. Grounded Accuracy
The most common evaluation mistake is using a single end-to-end accuracy metric to evaluate the whole pipeline. When it drops, you don't know whether to tune your embeddings or your prompt.
The metric split that reliably locates blame:
Retrieval-side metrics:
- Context Recall: Does the retrieved set contain all the information needed to answer the question? Low context recall means your embedding model or chunking strategy is the problem — not your generator.
- Context Precision: Do the relevant chunks rank above the irrelevant ones? Low precision means you're handing the generator too much noise.
- Recall@K / Precision@K: Traditional IR metrics, useful for comparing embedding models and rerankers in A/B tests.
Seam-specific metrics:
- Context Relevance: Of the chunks actually passed to the generator (post-truncation, post-reranking), what fraction are relevant? This differs from Context Recall — it measures the quality of the context window as assembled, not just the retrieval candidates.
- Answer Attribution Rate: For claims in the generated answer, what fraction can be traced back to a specific passage in the context? Low attribution at high answer quality means your model is using parametric knowledge instead of retrieved context — fine for some use cases, dangerous for others.
Generation-side metrics:
- Faithfulness: What fraction of claims in the answer are supported (and not contradicted) by the provided context? This is the primary hallucination detector for RAG.
- Answer Relevancy: Does the generated output address what was actually asked? Surprisingly separable from faithfulness — a model can be faithful to bad context and still irrelevant.
Run these metrics together on the same test set. The pattern of failure tells you where to look:
| Pattern | Diagnosis |
|---|---|
| Low context recall, high faithfulness | Retrieval problem — fix chunking or embeddings |
| High context recall, low context relevance | Reranking or context assembly problem |
| High context relevance, low faithfulness | Generation problem — fix prompt or model |
| High faithfulness, low answer relevancy | Prompt instruction following problem |
| Low attribution rate, acceptable faithfulness | Model using parametric knowledge — evaluate whether this is acceptable |
Building the Test Infrastructure
The tooling landscape has matured enough that you don't need to build this from scratch. The practical question is how to layer the tools.
RAGAS is purpose-built for RAG evaluation. It calculates context recall, context precision, faithfulness, and answer relevancy without requiring labeled ground truth for most metrics. Its test data generator (SDG Hub) creates (question, context, answer) triplets through a multi-stage pipeline: topic extraction, question generation, question evolution, and answer generation. This gives you a realistic evaluation dataset from your own documents without manual annotation.
DeepEval integrates with pytest, making it natural to add RAG evaluation gates to your CI pipeline. You define metric thresholds (e.g., faithfulness ≥ 0.7, context recall ≥ 0.8) and the test suite fails if any metric drops below the threshold. This is the most important investment teams skip — a CI gate that catches quality regressions before they reach production.
TruLens instruments your application in-flight, applying evaluation functions after each LLM call in production. This gives you the production distribution of context quality and generation faithfulness, which is different from your offline test distribution in ways that matter.
The recommended layering:
- Development: DeepEval unit tests for each component plus seam-level integration tests
- Staging: RAGAS batch evaluation over a representative query set before each release
- Production: TruLens (or LangSmith) for continuous monitoring with alerting on metric degradation
The key architectural decision: run component tests and seam tests in the same CI job, with separate pass/fail thresholds. When a seam test fails but component tests pass, you've caught a real integration regression.
Localizing Blame at Scale
In practice, you'll have queries that fail end-to-end and need to determine where to spend engineering time. A systematic blame localization protocol:
- Run the failing query through your retriever in isolation. Does the answer exist in the retrieved documents? If no → retrieval failure. If yes → continue.
- Inject the exact retrieved context into a clean generation call. Does the generator produce the correct answer? If yes → the seam is the problem (context assembly, truncation, ordering). If no → generation failure.
- Inject a gold-standard context you wrote manually. Does the generator produce the correct answer? If yes → the generation model is fine, the problem is in what the retriever hands it. If no → you have a generation model problem that a prompt change or model upgrade will fix.
This three-step protocol reduces "the RAG system gave a bad answer" to a specific, actionable diagnosis in under ten minutes. Teams that don't follow it spend weeks tuning embeddings when the actual problem was a system prompt that told the model to be creative.
The other critical practice: preserve traces. Every production RAG query should log the retrieved documents, their scores, the assembled context (as actually passed to the model), and the generated response. Without this trace, you're doing forensics on a crime scene that's been cleaned up. With it, you can replay any production failure in your offline test environment and understand exactly what happened.
What Gets Teams in Trouble
A few anti-patterns show up repeatedly in production RAG post-mortems:
Evaluating on your development documents. Your chunking strategy looks excellent when measured against the documents you built it for. Different document types — PDFs with tables, code files, multilingual content — often have dramatically different seam failure patterns. Test on a representative sample of your production document types, not just the clean ones you started with.
Tuning retrieval and generation simultaneously. When you change your embedding model and your system prompt in the same sprint and your metrics improve, you don't know which change helped. Make isolated changes and measure after each one. This slows development but gives you a reliable model of what actually works in your pipeline.
Treating faithfulness as a binary. A faithfulness score of 0.8 means 20% of claims in your generated answers aren't supported by the retrieved context. For a customer support bot answering questions about your product, that's potentially 20% of claims being fabrications or parametric memory. Decide what faithfulness threshold is acceptable for your use case before you build the system, not after users start complaining.
Ignoring the "I don't know" case. Most RAG evaluation suites are built from answerable questions. Production queries include many questions your corpus can't answer. If your system isn't evaluated on these, you don't know how it behaves when it should refuse — and it's almost always worse than you'd expect.
The Continuous Evaluation Requirement
RAG systems degrade in ways traditional monitoring doesn't catch. Your embedding model gets silently updated. Your document corpus grows and changes. A new document type gets added to the knowledge base. None of these trigger error alerts. All of them can drop your faithfulness score by 15 points.
The implication: RAG evaluation isn't a pre-deployment gate that you run once. It's a continuous process that monitors the component metrics, the seam metrics, and the end-to-end metrics in parallel — and alerts when any of them degrades.
Teams that ship RAG without this infrastructure are flying blind. They see increased support tickets or hear from users that the system "started giving wrong answers" sometime after a deployment, with no trace of what changed. Teams that instrument the seam have a specific signal: context relevance dropped after the document ingestion pipeline was updated, localized to a new document type, fixable with a chunking rule.
The seam is where your system's reliability is really determined. Unit tests passing on both sides of it doesn't tell you that it works. Only tests at the seam do.
- https://arxiv.org/html/2401.05856v1
- https://www.confident-ai.com/blog/rag-evaluation-metrics-answer-relevancy-faithfulness-and-more
- https://www.confident-ai.com/blog/how-to-evaluate-rag-applications-in-ci-cd-pipelines-with-deepeval
- https://docs.ragas.io/en/stable/concepts/metrics/
- https://www.evidentlyai.com/llm-guide/rag-evaluation
- https://www.getmaxim.ai/articles/rag-evaluation-a-complete-guide-for-2025/
- https://qdrant.tech/blog/rag-evaluation-guide/
- https://developers.redhat.com/articles/2026/02/23/synthetic-data-rag-evaluation-why-your-rag-system-needs-better-testing
- https://labelstud.io/blog/seven-ways-your-rag-system-could-be-failing-and-how-to-fix-them/
- https://deconvoluteai.com/blog/rag/failure-modes
