RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

May 14, 2026 · 11 min read

Software Engineer

A customer asks your support agent how to do something. The agent retrieves three documentation chunks with high relevance scores, synthesizes a confident answer, and walks the customer through a five-step procedure that ends on a button that hasn't existed for four months. The customer files a ticket. The on-call engineer pulls the eval suite, finds it green, pulls the retrieval traces, finds them green too — the model didn't hallucinate, it faithfully quoted documentation describing a feature your product team renamed in the last quarterly release.

This is the failure mode I want to name: not a hallucination, not a retrieval miss, but a phantom inventory problem. Your retrieval corpus is a snapshot of a product surface that no longer exists. The vector store doesn't know the product changed. The eval suite doesn't know either. The only system that consistently catches it is the support ticket queue, and by the time a ticket is filed the customer has already been told to click a button that isn't there.

The problem hides in plain sight because every component is functioning correctly when measured against its own contract. The indexer indexed the docs it was given. The retriever surfaced the chunks most semantically similar to the query. The model grounded its response in the retrieved context exactly as instructed. The eval suite checked "did the model use the retrieved context faithfully" and got a green answer. Each contract was honored. The contract that nobody wrote — "the retrieved context describes a product surface that currently exists" — failed silently.

Why Semantic Similarity Doesn't Care That Your Product Changed

Vector retrieval has no concept of time and no concept of product state. A doc chunk written eighteen months ago about a feature you removed last quarter will score as close to a relevant query today as it did when the feature shipped. Worse, the deprecated doc often scores higher than the current doc, because the deprecated content was written when the feature was the centerpiece of a marketing push and contains rich, query-aligned language ("step-by-step", "click here", "to enable this"), while the replacement feature's doc is newer, shorter, and uses different terminology that the user's natural-language query doesn't match.

Research from production systems puts the scale of the problem on a depressing footing: roughly 73% of enterprise RAG deployments degrade meaningfully within their first year, with knowledge staleness consistently cited as the top reason. One published analysis traced a regulatory compliance bot at a bank that retrieved a Basel III document from 2022 and confidently quoted capital adequacy thresholds that had been updated twice since indexing — the model didn't hallucinate; the corpus was a time capsule.

The bank case is illustrative because the failure mode is generalizable to anyone running a product-docs RAG: the corpus is a derivative artifact of upstream state, and that upstream state is changing on a cadence the corpus owner doesn't observe. Vector search behaves like a librarian who hasn't checked whether any of the books on the shelf still describe a building that exists. The librarian's job, as written, is to fetch books matching a query. The contract is honored. The customer gets a tour of a wing the renovation team demolished.

The Evals That Pass While the Feature Burns

The reason this failure persists for months in production is that the standard RAG eval surface doesn't grade against reality. Look at what most eval suites actually check:

Retrieval relevance: did the retriever surface chunks semantically related to the query? Yes, the chunks about the deprecated feature are highly related to the query about how to do the thing that feature did.
Groundedness: did the model's response cite the retrieved chunks rather than inventing claims? Yes, the response is a faithful summary of the deprecated documentation.
Answer correctness against a labeled set: did the model's response match the expected answer in the eval test cases? Yes, because the eval cases were labeled six months ago when the feature still existed.

Every metric the team watches turns green while the customer walks into a phantom button. The eval suite is grading the model's behavior within the retrieved context, not the current-validity of the retrieved context. These are different questions and the team has answered only one of them.

Worse, the eval set itself is staler than the corpus. A practitioner who builds a labeled eval set in March doesn't go back in November to ask "are the expected answers in this eval set still the right expected answers given that the product changed three times since March?" The eval cases become a frozen snapshot of an outdated product surface, and the team measures "how well does the system reproduce the answers that were correct at the time we wrote the eval" rather than "how well does the system describe the product as it exists today."

The support ticket queue, in contrast, is graded against reality, because the customer is interacting with the live product. The information is there — it's in Zendesk, not in the eval dashboard. The org seam that prevents support tickets from feeding back into the eval suite is the same seam that prevents the corpus from being reconciled against the product roadmap. Both are the same problem in different forms: the data that would correct the system lives in a team whose tooling is disconnected from the team that operates the system.

Why "Just Re-index More Often" Doesn't Fix It

The intuitive response from an engineering team is "we'll re-index nightly" or "we'll watch the docs repo for commits and re-index on push." This solves a narrow version of the problem — embedding staleness, where the doc has changed but the index hasn't caught up. It does nothing for the actual failure mode, because:

The deprecated doc often hasn't been deleted. It's still in the docs site, perhaps at a slightly different URL, perhaps with a small banner that says "this feature has been replaced by X" that the chunker doesn't preserve. Re-indexing it doesn't make it disappear from the corpus; it re-indexes the same stale content.
The deletion is asymmetric. The product team removed the feature in a sprint. The docs team got a ticket to update the docs, prioritized it against twenty other tickets, and the deprecation page is in the queue behind a launch. The corpus reflects the docs team's backlog, not the product team's release notes.
The chunking pipeline strips deprecation banners. Even when the docs team adds a "deprecated" banner to a page, the chunker that splits the page into 500-token windows often drops the banner from chunks two through eight. The retriever surfaces chunk four, the banner is gone, the model has no signal that the content is dated.
Implicit expiration is the norm. Most documents describing a feature don't say "this content is valid until 2025-08-15." They describe the feature as if it will exist forever. There is no scalar value the indexer can read to know the content is rotting. The expiration lives in a separate system — the product roadmap, the deprecation log, the engineering Jira — that the corpus doesn't subscribe to.

A team that responds to phantom-inventory bugs by tuning re-indexing cadence is treating a structural defect with a knob calibration. The frequency of re-indexing is downstream of a much bigger problem: the corpus is decoupled from the source of truth about what features currently exist.

What Has to Change at the Org Level

The fix is not primarily a retrieval engineering project. The fix is recognizing that the RAG corpus is a derivative artifact of product state, and treating that dependency the way you'd treat any other production data dependency — with ownership, a contract, and an alert on the contract.

Corpus ownership belongs in product engineering, not the AI team. The team that owns the answer to "what features exist this week" should also own the answer to "what's in the index." In most orgs today, docs are owned by technical writing, the index is owned by the AI team, the product roadmap is owned by PM, and the reconciliation between them is owned by no one. The fix is to assign a named owner whose performance review includes the question "did the corpus describe features that no longer exist?"

Every feature deprecation triggers a corpus reconciliation step. Add a checkbox to the deprecation ticket template: "removed from RAG corpus, confirmed by retrieval probe." A retrieval probe is a saved query that should not return results about the deprecated feature after deprecation — the team runs the probe, confirms zero or banner-only results, and signs off. This converts deprecation from a docs-only task to a docs-and-index task.

Surface freshness in the prompt, not just the index. When a chunk is retrieved, the model should see a freshness label — last_verified: 2026-03-12, freshness_class: fast_decay — and the system prompt should instruct the model to surface uncertainty when freshness is below threshold ("the documentation I'm citing may describe a deprecated feature; please verify in the current product"). This won't catch every case, but it converts silent failures into hedged answers that the customer can challenge.

Add a deprecated-feature eval slice. Maintain a list of features your product has removed in the last twelve months and run an eval query for each one: "how do I do X?" where X is the deprecated capability. The expected answer is "X is no longer supported; here is the replacement." The pass rate on this slice is a tracked metric. When the slice regresses, you know the corpus contains content that should have been removed.

Source freshness from the product team, not the index. The freshness label on a chunk should be derived from when the underlying feature was last verified to exist in the product, not when the doc was last edited or when the index was last refreshed. This requires a pipeline where deprecation events in the product team's tooling propagate to chunk-level metadata in the index — a real integration, not a quarterly audit.

The Architectural Realization

The reason this failure mode is so persistent is that engineering teams have a strong intuition that a search index is a self-contained system: bytes in, bytes out, refresh on a schedule. A RAG corpus is not that kind of system. It is a projection of upstream product state into an indexable form, and a projection is only as fresh as the events that update it. If the upstream state changes via a roadmap decision made in a Notion doc that nobody plumbed into the corpus pipeline, the projection drifts. The drift compounds. The model serves it confidently. The customer follows it into a wall.

Two architectural mental shifts are required to fix it. The first is to model the corpus as a derived dataset with an upstream dependency, the same way you'd model a materialized view in a database — and to alert on the dependency the same way. The second is to recognize that the source of truth about "does this feature exist?" lives in product engineering, not in the AI org, and that any pipeline that doesn't subscribe to product engineering's deprecation events is going to fail this test. Both shifts move the work upstream of where the failures surface, which is uncomfortable but necessary.

The team that doesn't make these shifts will keep shipping a feature whose accuracy ceiling is set by the staleness of someone else's roadmap — a ceiling they will never raise by tuning the retrieval stack, because the retrieval stack isn't the problem. The team that makes the shifts gets a system where the answer to "is the corpus current?" has a clear owner, a tracked metric, and a deterministic alert. That's the only version of RAG that survives contact with a product that ships.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

Why Semantic Similarity Doesn't Care That Your Product Changed

The Evals That Pass While the Feature Burns

Why "Just Re-index More Often" Doesn't Fix It

What Has to Change at the Org Level

The Architectural Realization

Recommended Reading

About Tian Pan

Why Semantic Similarity Doesn't Care That Your Product Changed​

The Evals That Pass While the Feature Burns​

Why "Just Re-index More Often" Doesn't Fix It​

What Has to Change at the Org Level​

The Architectural Realization​

Recommended Reading

About Tian Pan

Why Semantic Similarity Doesn't Care That Your Product Changed

The Evals That Pass While the Feature Burns

Why "Just Re-index More Often" Doesn't Fix It

What Has to Change at the Org Level

The Architectural Realization