The Production Retrieval Stack: Why Pure Vector Search Fails and What to Do Instead
Most RAG systems are deployed with a vector database, a few thousand embeddings, and the assumption that semantic similarity is close enough to correctness. It is not. That gap between "semantically similar" and "actually correct" is why 73% of RAG systems fail in production, and almost all of those failures happen at the retrieval stage — before the LLM ever generates a word.
The standard playbook of "embed your documents, query with cosine similarity, pass top-k to the LLM" works in demos because demo queries are designed to work. Production queries are not. Users search for product IDs, invoice numbers, regulation codes, competitor names spelled wrong, and multi-constraint questions that a single embedding vector cannot geometrically satisfy. Dense vector search is not wrong — it is incomplete. Building a retrieval stack that actually works in production requires understanding why, and layering in the components that compensate.
