The Embedding That Aged Out of Meaning
You embedded the knowledge base eighteen months ago. The model has not changed. The chunks have not changed. The index is healthy, the latency is fine, the recall dashboard is a flat line at 0.86. And yet support is quietly pasting the wrong article links into ticket replies, the sales bot keeps surfacing a deprecated SKU when a prospect asks about the new one, and an internal user just told you the assistant "feels dumber" without being able to say why.
Nothing broke. Your embeddings aged. The word post used to mean blog post in your domain; now half the corpus uses it for a Slack post, a forum post, and a job posting, and your eighteen-month-old vectors still treat it as one concept. The model that encoded those vectors never saw the new senses, never saw the new product names, never saw the rebrand, never saw the regulation that introduced three new terms your customers now use without thinking. The retrieval system answers the question it knows how to answer, which is no longer the question your users are asking.
This failure mode is not the one the literature warns you about. The well-known story is that you upgrade your embedding model and the coordinate system shifts under you, so v1 vectors and v2 vectors become incomparable and you have to reindex. That problem is loud — the migration is on a Jira ticket, somebody owns it, the runbook says alias swap. The failure described here is the opposite: nobody upgraded anything, nobody migrated anything, and the system silently drifted into wrongness because the world outside your stack kept editing what its own words mean.
Embeddings encode a snapshot of meaning that the world keeps editing
An embedding is a frozen opinion about which words live near which. The model's pretraining corpus had a distribution of contexts for every token, and the resulting vector encodes that distribution. The moment you embed a document, you have committed your retrieval system to that opinion. The vector does not update when usage shifts. The vector does not know that agent now defaults to "LLM agent" in your industry rather than "browser user-agent" or "real-estate agent." It still positions the chunk wherever the 2024 distribution said it belonged.
Hamilton, Leskovec, and Jurafsky's diachronic-embedding work showed two patterns in how language changes over time that are worth keeping in mind for production retrieval. Rare words change meaning faster than common ones, and polysemous words — words that already carry multiple senses — change faster than monosemous ones, even after controlling for frequency. In a B2B product, the exact words that change fastest are the ones you care about most: jargon, acronyms, product names, regulatory shorthand. These are low-frequency in the open web (so the embedding model has weak priors on them) and highly polysemous in your domain (so they are exactly the words whose sense the world will reshape).
The corpus you embedded is fixed. The query distribution flowing into the system is not. Every week your users write queries in the vocabulary of the current world. Every week the gap between how your users name things now and how your vectors named things eighteen months ago gets a little wider. This is not a model bug. It is not an index bug. It is a calendar problem.
Recall dashboards are graded by yesterday's notion of similar
Here is the part that hides the bug. You measure recall against a golden set. The golden set was built when the corpus was embedded, or shortly after, by humans writing queries in the vocabulary that existed then. Those queries match the vectors well — of course they do; they were written in the same dialect. Recall stays flat.
Meanwhile real users write queries in today's vocabulary, get retrievals that are technically high-similarity to their query embedding but semantically off the topic they meant, and either rephrase or give up. Neither rephrasing nor giving up shows up in your retrieval metric. The dashboard reports the health of a population of queries that no longer exists. This is the same shape of bug as a fraud detector graded on last year's fraud patterns — the score is real, the score is meaningless.
A handful of patterns make this visible:
- Run a current-vocabulary canary: a small set of queries written from scratch every quarter by someone fluent in how customers currently talk, scored against the same retrieval pipeline. Compare its hit rate to the original golden set's hit rate. The gap is your drift.
- Watch no-click and rephrase rates per query, segmented by query novelty. A query containing tokens that did not appear in the corpus at embedding time is a query the index probably cannot serve well; if these are climbing as a share of traffic, the index is aging out from under you.
- Pull a zero-overlap sample: queries whose top-1 retrieved chunk shares no salient tokens with the query. Some of those will be legitimate paraphrase wins; many will be the system grabbing a vaguely-near vector because nothing in the index actually matches the new term.
- https://arxiv.org/abs/1605.09096
- https://medium.com/@anindyasinghobi/embedding-drift-the-quiet-killer-of-retrieval-quality-in-rag-systems-b5d46bee3bba
- https://decompressed.io/learn/embedding-drift
- https://apxml.com/courses/optimizing-rag-for-production/chapter-6-advanced-rag-evaluation-monitoring/monitoring-retrieval-drift-rag
- https://ragaboutit.com/7-new-rag-evaluation-metrics-that-catch-hidden-accuracy-gaps/
- https://www.digitalapplied.com/blog/rag-anti-patterns-7-failure-modes-2026-engineering-guide
- https://safjan.com/version-your-vectors-index-versioning-as-the-missing-layer-in-rag/
- https://redis.io/blog/10-techniques-to-improve-rag-accuracy/
