Skip to main content

The Knowledge Half-Life Problem: Why Your RAG System Is Already Wrong

· 9 min read
Tian Pan
Software Engineer

Your RAG system passed all the retrieval benchmarks. Precision looks solid. The LLM-as-judge eval scores are green. And yet, somewhere in your index, there is a document describing an API endpoint that was deprecated eight months ago, a pricing tier that no longer exists, and a compliance policy that was superseded by new regulations in Q3. Your retriever has no idea. Semantic similarity has no concept of time.

This is the knowledge half-life problem: the silent failure mode where RAG systems appear healthy on every metric you're measuring while serving increasingly stale decisions to users. Seventy-three percent of organizations report accuracy degradation in RAG deployments within 90 days — not from poor retrieval architecture or embedding model quality, but from knowledge staleness that no one modeled as a reliability concern.

Why Semantic Similarity Cannot Save You

The fundamental issue is architectural. Vector similarity measures how closely a query matches a document's meaning. It does not measure when that document was last accurate. A chunk describing a deprecated authentication flow from 18 months ago will score 0.92 against a user's question about logging in — because the content is semantically aligned. The retriever returns it. The LLM incorporates it. The user gets wrong information delivered with full confidence.

The paradox deepens at scale. A corpus of 1,000 documents can sustain sub-hour freshness relatively easily. At 100,000 documents, nightly batch reindexing creates 12-hour staleness windows as a floor. At a million documents, multi-day delays become the norm. And most teams compound this by running every domain on the same pipeline cadence — the regulatory guidance that expires in months gets the same nightly job as the competitive pricing data that expires in hours.

This matters because the failure mode looks exactly like hallucination to end users and to the LLM-as-judge evals designed to catch it. When a model states that SAML is supported in the enterprise plan because that's what the indexed documentation says — and SAML was removed in v3 — no retrieval grader will catch it. The document is semantically relevant. The information was once true. The eval passes. Only a user who tries to configure SAML in production discovers the problem.

Documents Do Not All Age at the Same Rate

The first step toward fixing this is accepting that freshness is a domain property, not a system property. Different content types decay at fundamentally different rates:

  • Pricing and configuration data becomes unreliable within days. An agent quoting last week's per-seat pricing may be quoting a price that no longer exists.
  • API documentation and release notes have a window of roughly two to four weeks before version drift introduces meaningful inaccuracies.
  • Regulatory guidance and compliance policies operate on monthly or quarterly cycles, but individual updates can invalidate large sections of the corpus immediately.
  • Architectural overviews and foundational concepts have half-lives measured in years — reindexing them weekly is wasted compute.

Running all of these through a uniform nightly pipeline is not a neutral choice. It means you're simultaneously over-indexing stable content (burning compute) and under-indexing volatile content (serving stale answers). The mismatch is systematic.

The operational fix is to assign freshness classes at ingest time, not at query time. When a document enters the pipeline, a classification step — whether rule-based on document path and type, or a lightweight classifier trained on your corpus — should tag it with a freshness tier and an acceptable staleness threshold. From that point forward, the refresh scheduler uses freshness class as its primary input, not corpus-wide schedules.

The Three-Tier Architecture That Actually Works

Teams that have solved this in production typically converge on a multi-layer architecture with distinct update cadences per tier.

The hot layer handles content where staleness tolerance is measured in minutes: live pricing APIs, feature flags, security advisories, anything fed from a webhook or change data capture pipeline. This layer uses streaming ingestion rather than batch jobs — changes flow from source to index within seconds via message queues. The operational overhead is significant (roughly three times the complexity of a batch pipeline), but the SLA improvement from 24-hour staleness to sub-minute is the only acceptable trade-off for this category of content.

The warm layer handles content where hourly to daily refresh is appropriate: API documentation, feature guides, process documentation. This is where most incremental indexing logic lives — change detection via timestamps or content hashing, re-embedding only modified chunks, explicit deletion tracking to retire vectors for removed content.

The key insight here is to never do full reindexing when you can do incremental updates. Full reindexing that touches 100,000 chunks to update 200 changed documents is not just wasteful; it creates a consistency window where old and new versions of the same content coexist in the index.

The cold layer covers content that changes rarely: architectural documentation, historical materials, foundational guidelines. These get monthly or quarterly rebuild cycles. The compute cost here is minimal, and running them on the same hot path as volatile content would dilute your streaming pipeline's capacity.

The practical implementation requires routing logic at query time: agents and retrieval pipelines need to know which tier to query based on the type of question being asked. A question about current pricing routes to the hot layer. A question about architecture tradeoffs routes to the cold layer. This routing can be explicit (the application knows what it's asking about) or classifier-driven (a lightweight intent model infers freshness requirements from the query itself).

Detecting Staleness Before It Becomes a User Problem

Freshness tiers solve the refresh scheduling problem. But they don't catch the staleness that already exists in your index, or the cases where source systems change faster than your classification model predicted.

The staleness detection problem has two levels. At the document level, you can track it with metadata: every indexed chunk carries a last-verified timestamp, a version identifier, and an acceptable freshness threshold. Query-time retrieval logic checks these fields alongside semantic similarity scores. A document that would score 0.90 on relevance but is 47 days past its freshness window can be deprioritized or flagged before it enters the LLM's context.

At the answer level, catching staleness-induced errors requires treating it like any other hallucination vector. The approaches with the best production track records combine token similarity (BLEU/ROUGE scoring between retrieved context and generated answer — fast and catches obvious divergences) with LLM-based self-verification for complex cases. Neither method is reliable in isolation. Token similarity misses paraphrased outdated facts. LLM-based detection misses cases where the model doesn't recognize that a cited fact was once true but is no longer.

The monitoring setup that catches staleness drift before it becomes critical:

  • Track the age distribution of documents appearing in retrieval results, not just documents in the index. If your top-10 retrievals for a category of queries are consistently pulling from documents more than 60 days old, that's a signal regardless of whether those documents are past their nominal freshness threshold.
  • Run a fixed evaluation set monthly on your highest-stakes query categories. Questions about pricing, compliance, and product features answered by documents of known ages. Accuracy degradation on this set is your staleness signal.
  • Set automated alerts when the percentage of indexed documents past their freshness window exceeds a threshold. This is an operational health metric at the same level as p99 latency or error rate.

The Deletion Problem Almost No One Addresses

Every discussion of RAG freshness focuses on updating content. The deletion problem gets far less attention and causes proportionally more damage.

When a document is removed from a source system — a deprecated product page, a retired API version, a policy that was superseded — batch pipelines typically leave orphaned vectors in the index for days. The document no longer exists at the source. The vector persists. Users get search results pointing to content that 404s, or retrieval results drawn from a knowledge base that's authoritative about something that no longer exists.

The fix is not complicated, but it requires treating deletions as first-class events in the ingestion pipeline. Every document that exits the source corpus needs a corresponding delete operation against the vector index. Change data capture pipelines handle this naturally when the source is a database. For file-based or crawled knowledge bases, the pipeline needs an explicit reconciliation step that identifies vectors with no corresponding live source document and removes them on a schedule. This is not glamorous engineering, but a team that skips it will eventually debug a customer complaint to a vector pointing at a document that was removed six months ago.

Treating Freshness as Infrastructure, Not Backlog

The meta-pattern here is a governance problem as much as an engineering one. Knowledge bases in production RAG systems require the same operational discipline as databases: clear ownership, monitoring, and incident procedures for freshness failures.

In practice, this means explicitly assigning document staleness responsibility to someone on the engineering or data team — not leaving it as a shared assumption. Include freshness metrics in your reliability dashboards alongside latency and error rates. Escalate freshness violations as incidents, not backlog items, when they affect time-sensitive business logic.

Sixty percent of enterprise RAG deployments fail at scale not because the retrieval architecture is wrong, but because no one treated the knowledge base as infrastructure that requires continuous maintenance. The initial build of a RAG system is relatively straightforward. Keeping it accurate at month three, month nine, and month eighteen — when documents have drifted, policies have changed, and the product has evolved — is the harder and more important problem.

The teams that get this right build freshness classification into their ingest pipeline from day one, operate tiered update schedules matched to content decay rates, run staleness detection as a continuous background process, and treat deletion events with the same rigor as updates. The teams that don't will spend their time debugging why their RAG system — which worked perfectly in the demo — is producing confidently wrong answers about things that were true six months ago.

References:Let's stay in touch and Follow me for more thoughts and updates