Skip to main content

Stale Retrieval: The Data Quality Problem Your RAG Pipeline Is Hiding

· 10 min read
Tian Pan
Software Engineer

Your RAG system is lying to you about the past. When a user asks about current pricing, active security policies, or a feature that shipped last quarter, the retrieval pipeline returns the most semantically similar document in the index — not the most recent one. An 18-month-old pricing page and this morning's update look identical to cosine similarity. Nothing in the standard RAG stack has any concept of whether the retrieved document is still true.

This is stale retrieval, and it fails differently than hallucination. The model isn't inventing anything. It accurately summarizes real content that once existed. Standard evaluation metrics — faithfulness, groundedness, context precision — all pass. The system is confidently correct about a fact that stopped being correct months ago.

Teams that discover this problem typically discover it the hard way: a customer service agent citing discontinued products, a compliance assistant referencing retired policies, a technical chatbot describing a deprecated API method with full confidence. By then, the damage is done.

Why Embedding-Based Retrieval Is Temporally Blind

The architectural cause is simple. Embedding models map text into a geometric space where similar meanings land near each other. Nothing in that mapping encodes when the document was written, whether it was deleted from the source, or whether a newer version superseded it.

This creates a class of failure that's easy to miss in offline evaluation: you test with a curated query set, your reference documents are current, metrics look fine. The retrieval corpus drifts in production, but your eval suite doesn't track that drift. Six months later, accuracy at your core question types has collapsed by a third, and you're debugging it as a model problem.

The corpus drifts faster than teams expect. Research tracking LangChain's documentation corpus found it shrank 67% — from over 11,000 to under 3,700 documents — in a single year. Content doesn't disappear cleanly; it moves, fragments, gets reorganized. Your index holds the old shape while the source has moved on.

Three Failure Modes That Standard Evals Miss

1. Zombie Documents

When a document is deleted from the source system — a CMS, an S3 bucket, a database table — nothing tells the vector store. The embedding lives on in the index, indistinguishable from live content, retrievable on every relevant query.

This is not just a quality problem. Documents deleted for GDPR reasons can persist in the vector store for months after deletion from the source. Safety manuals in multiple versions — current and superseded — can exist simultaneously in the index, with retrieval returning whichever one scores highest on any given query. The retrieval system has no way to prefer the current version; it picks the one closest in embedding space.

The fix is a deletion pipeline, not just a deletion API. Modern vector databases support deletion by ID or metadata filter. The operational gap is building the event pipeline that calls the API reliably when source documents change. Options include CDC connectors (Debezium, Kafka-based) for database-backed corpora, webhook integrations for CMS platforms, and scheduled reconciliation jobs for corpora that update less frequently. The right choice depends on how fast your source content changes and how much staleness you can tolerate.

2. Silent Version Drift

Your pricing page updates. Your policy document gets a quarterly revision. Your API docs get restructured with a major version bump. If you're not propagating updates to the vector store, you now have both versions indexed. Retrieval returns whichever one scores higher for the query — with no signal about which version is current.

The insidious part is that this failure compounds gradually. In week one, the old and new versions are semantically close, so the new one often wins. By month three, after additional drift, the relationship between stored embeddings and source content has drifted enough that retrieval behaves unpredictably on exactly the queries where accuracy matters most.

Content hashing is a reliable detection pattern: store a hash of the source document content at ingestion time, then run periodic reconciliation that re-fetches source documents and compares hashes. Any mismatch triggers a re-embed and upsert. This is less elegant than real-time change propagation but works for corpora where you can't instrument the source system for events.

3. Embedding Model Version Mismatch

A subtler variant: you upgrade your embedding model. Query embeddings now use the new model. Document embeddings in the index still use the old one. They inhabit different geometric spaces. Cosine similarity between them is no longer meaningful.

This failure mode is particularly dangerous because it degrades gradually and the symptoms look like a model problem. Teams that have gone through this incident — migrating from one OpenAI embedding model to a newer one — report cosine similarity scores dropping from 0.85+ to around 0.65, with precision metrics that look acceptable in dev but fail in production. Diagnosis takes days, not hours, because nothing in the standard stack alerts on version mismatch between query and document embeddings.

The fix is straightforward: track the embedding model version for every document at ingestion time, and alert when the query-path model version diverges from the stored document version. Make full re-indexing a first-class migration operation, not an afterthought.

The Stale Retrieval vs. Hallucination Distinction

This distinction matters for diagnosis, not just for labeling. When a model hallucinates, it invents content not present in its training data or retrieved context. When retrieval is stale, the model accurately describes a retrieved source — but that source is no longer current.

The practical consequence: you cannot fix stale retrieval by improving your generation model or your prompting. You cannot catch it with faithfulness metrics or groundedness checks. The generated answer is grounded; the grounding itself is outdated. Fixing it requires addressing the retrieval pipeline, not the LLM layer.

Research on self-correcting retrieval systems (CRAG) found that adding a lightweight evaluator between retrieval and generation — one that scores retrieved document quality and triggers a fresh web search when it detects low-quality or outdated content — improved accuracy on real-world question answering by 14–37% across different domains. The key insight is that intervention between retrieval and generation is more effective than trying to compensate at generation time.

Making Freshness a First-Class Metadata Attribute

The most durable fix is treating document freshness as a first-class schema attribute at ingestion time, not as an afterthought.

Every document ingested into a vector store should carry:

  • indexed_at: When the embedding was created (not when the source document was created — these are different)
  • source_last_modified: The modification timestamp from the source system
  • ttl_days: The expected lifespan of this document's validity
  • content_hash: Hash of the document content at indexing time
  • embedding_model_version: Which model generated this embedding

The ttl_days value should differ by content type. Pricing information might expire in seven days. API reference documentation in two weeks. Compliance policies in six months. Archival content might carry no expiry. The appropriate TTL is a product decision, not a technical one — but engineers need to surface it as a configurable property, not bury it in deployment configuration.

Time-Aware Scoring at Query Time

Once freshness metadata exists, you can apply temporal decay at query time. The simplest form is a half-life decay applied to retrieval scores:

score(query, doc) = α · cosine_similarity + (1 - α) · 0.5^(age_days / half_life)

With α around 0.7 and a half-life of 14 days, research shows this formula achieves essentially perfect retrieval accuracy on freshness-sensitive queries — up from near-zero with semantic-only retrieval. The temporal signal is not marginal; at lower α values the improvement is dramatic.

Production systems like Ragie implement a simpler bucket approach: weight documents differently based on age brackets (last hour gets 1.0, last day gets 0.9, last week gets 0.8, and so on), then compute a hybrid score within each bucket. This is less mathematically precise than continuous decay but more interpretable and easier to tune.

For the highest-freshness domains, pre-filtering is more appropriate than scoring. Applying a time-window WHERE clause at the database level — before ranking occurs — eliminates stale documents from the candidate set entirely, rather than letting them enter the ranking pipeline with lower weights. A deprecated compliance document ranked first by semantic similarity never enters the results if it's filtered out in the query.

Retrieval Regression Testing in CI/CD

Most RAG teams run offline evals against a static snapshot of the corpus. This misses the class of failure introduced by corpus drift. A retrieval regression suite should include a freshness slice: a curated set of time-sensitive queries with known ground truth, run against the live production corpus on every deploy.

This requires:

  • Queries whose correct answers are known and time-sensitive (pricing, policy references, API versions)
  • Expected answers that you update when the source truth changes
  • Automated comparison that fails the build if freshness recall drops below a threshold
  • A dashboard tracking the age distribution of retrieved chunks over time

The age distribution metric is particularly useful. If you plot the 25th/50th/75th percentile age of retrieved chunks over time, you'll see corpus drift before it affects user-facing metrics. A gradual shift toward older retrieved content is a leading indicator of freshness degradation.

The full observability stack should log chunk provenance on every retrieval — document ID, indexing timestamp, source modification timestamp, embedding model version — so you can query these fields in your monitoring system and alert on conditions like "retrieved content older than 180 days constitutes more than 20% of served results."

High-Risk Content Categories

Not all content ages at the same rate, and not all staleness has the same impact. The highest-risk categories, where stale retrieval creates the most damage:

  • Pricing and commercial terms: Legally sensitive; citing outdated pricing creates liability exposure
  • Security advisories and CVE references: A months-old advisory may describe a patched vulnerability as current
  • Compliance and regulatory policies: Stale policy content can constitute a compliance violation in regulated industries
  • Active product documentation: Features added, deprecated, or renamed after the indexing date
  • Organizational information: Personnel, org charts, and reporting structures change continuously

For these categories, a stricter approach is justified: more frequent re-indexing cycles, shorter TTLs, pre-filtering at query time, and dedicated freshness monitoring rather than relying on aggregate metrics.

Building the Pipeline

The operational pattern for a production-grade freshness pipeline has four components:

Ingestion with metadata: Store freshness fields at ingest time. Make ttl_days a required field, not optional. Enforce it at the ingestion API boundary.

Change propagation: Build the event pipeline that notifies the vector store when source documents change or are deleted. The right architecture depends on your source system — CDC for databases, webhooks for CMS platforms, scheduled reconciliation for static file stores.

Query-time scoring or filtering: Apply temporal decay or pre-filtering at query time based on document age and content type. Make the decay parameters configurable without a code deploy.

Freshness monitoring: Log chunk age on every retrieval. Alert on age distribution shifts. Include a freshness regression suite in your CI/CD pipeline.

None of these components is particularly difficult to build. The failure mode is treating them as optional, assuming the retrieval corpus will stay accurate without active maintenance. It won't. Corpus drift is the default state of any production system that hasn't been explicitly designed to prevent it.

A RAG system that doesn't model time is incomplete. Semantic similarity gets you close to the right document; temporal awareness tells you whether that document still describes the world correctly.

References:Let's stay in touch and Follow me for more thoughts and updates