The Domain Expert Bottleneck in RAG: Why Knowledge Curation Breaks Production AI
Most teams building RAG systems spend their first month on the pipeline — chunking strategy, embedding model selection, vector store configuration, retrieval tuning. They get that working. The demo passes. Stakeholders are impressed.
Then six months later, the system starts quietly degrading. Support tickets reference wrong procedures. The bot cites a pricing tier that was retired in Q3. A customer gets a confident answer about a product feature that was deprecated before they even signed up. The pipeline is fine. The knowledge base is the problem.
This is the pattern that 72% of enterprise RAG deployments hit in their first year. Teams discover, after the fact, that building the RAG system was the easy part. The hard part is the ongoing human process of creating, validating, and retiring knowledge content — and almost nobody plans for it in the initial architecture.
The Ingestion Pipe Fallacy
The path of least resistance for knowledge base setup is to point an ingestion pipeline at whatever documentation exists. Notion export, Confluence dump, internal wiki, PDF folder — load it all, chunk it, embed it, ship it.
This works until it doesn't. The ingestion pipe has no opinion about whether the content is correct, current, or contradictory. It faithfully converts whatever you give it into vectors. A 2022 onboarding guide gets embedded alongside a 2025 process update. A deprecated API reference coexists with the current one. Both get retrieved. The model picks one without flagging the conflict.
The result is confident answers grounded in stale definitions. A RAG pipeline scoring 0.95 faithfulness — meaning it faithfully synthesized what it retrieved — can still return wrong business answers when the retrieved index is eight months out of date. Standard evaluation metrics don't measure whether retrieved content is correct in the business sense, only whether the model accurately represented what it found.
Three Failure Modes When Curation Is Skipped
Authoritative-sounding answers from stale sources. This is the most insidious failure because it looks healthy from a metrics standpoint. The model isn't hallucinating — it's faithfully synthesizing documents that happen to be wrong. A support agent using a RAG system might quote a refund policy from a version of the terms-of-service that was superseded three months ago. The answer is internally coherent and cites a real document. The document is just no longer accurate.
Stanford research on legal RAG found that even in high-stakes domains where accuracy is paramount, RAG systems produce hallucinations that are "substantial, wide-ranging, and potentially insidious." The irony is that the citation display — the mechanism meant to build trust — can actually increase false confidence because it creates the appearance of verified information without the substance of it.
Low retrieval precision from noise content. When the knowledge base accumulates outdated procedures, duplicate explanations, drafts that were never finalized, and articles that were superseded but not removed, retrieval quality degrades measurably. A query that should surface two precise answers instead retrieves seven documents, five of which are tangentially relevant noise. The model receives contradictory signals and hedges, or worse, picks the wrong document confidently.
Every irrelevant document retrieved consumes context window tokens and dilutes the signal from the relevant ones. Retrieval precision — the fraction of retrieved documents that actually help — directly caps the ceiling on generation quality. No prompt engineering or model upgrade overcomes a knowledge base that's 40% noise.
Silent coverage gaps. The third failure mode is the one that's hardest to detect: queries where no document covers the actual question. Knowledge bases accumulate content organically, which means they accumulate content about what someone thought to write about, not about what users actually ask. When users ask about a procedure nobody documented, the RAG system retrieves the closest-sounding documents and generates a plausible answer that may be entirely fabricated. There's no obvious failure signal — the model doesn't say "I don't know." It answers confidently from approximate context.
Identifying coverage gaps requires actively mapping the query distribution against the knowledge base — something that never happens in ingestion-only pipelines.
- https://arxiv.org/html/2401.05856v1
- https://www.evidentlyai.com/llm-guide/rag-evaluation
- https://ragaboutit.com/the-knowledge-decay-problem-how-to-build-rag-systems-that-stay-fresh-at-scale/
- https://nstarxinc.com/blog/the-next-frontier-of-rag-how-enterprise-knowledge-systems-will-evolve-2026-2030/
- https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf
- https://pub.towardsai.net/rag-in-practice-exploring-versioning-observability-and-evaluation-in-production-systems-85dc28e1d9a8
- https://venturebeat.com/data/karpathy-shares-llm-knowledge-base-architecture-that-bypasses-rag-with-an/
