RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong
Most teams building RAG systems think they have two failure modes: retrieval fails to find the relevant document, or the LLM hallucinates despite having it. The first is measured obsessively — recall@K, MRR, NDCG. The second is treated as the model's problem. Neither framing is complete.
There's a third failure mode that sits between them: retrieval succeeds (the relevant document ranks in the top-K), but the retrieved context doesn't actually contain enough information to answer the question correctly. The model gets confident, generates a plausible answer, and gets it wrong. Research on frontier models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 shows this happens at rates above 50% on multi-step queries — and most production systems have no instrumentation to detect it.
