The 'What Changed' Query Is the RAG Question Your Index Can't Answer
A user asks your assistant, "what changed about our refund policy this quarter?" The system returns a confident, well-formatted summary of the current refund policy. The user nods, closes the chat, and acts on information that has nothing to do with the question they asked. Nothing in your eval suite caught this. Nothing in your faithfulness metric flagged it. The retrieval looked perfect — it returned highly-relevant chunks. The synthesis looked perfect — it cited every chunk it used. The only problem is that the question was about change, and your index has no concept of change.
This is the failure mode that vector-similarity retrieval cannot fix by tuning. Two versions of the same document have nearly-identical embeddings — that is what good embeddings do, they collapse semantically equivalent text into the same neighborhood. So when you ask "what changed," the retriever returns one of the versions, the LLM summarizes that version, and the answer is silently a hallucination of nothing-changed. The user cannot tell. Your eval set probably cannot tell either, because your eval set is built around "what is X" questions, not "what's different about X now."
