Embedding Drift: The Silent Degradation Killing Your Long-Lived RAG System
Your RAG system is running fine. Latency is normal. Error rate is zero. But a user asking about "California employment law" keeps getting results about real estate — and your logs show nothing wrong.
This is embedding drift in action: the retrieval failure mode that doesn't throw exceptions, doesn't spike error rates, and doesn't show up in standard observability dashboards. It happens when your vector store accumulates embeddings produced under different conditions — different model versions, different chunking rules, different preprocessing pipelines — and the vectors start pointing in incompatible directions. The system keeps serving requests, but the semantic coordinates are no longer aligned, and retrieval quality erodes quietly over weeks or months.
Long-lived RAG systems are uniquely vulnerable. A system you deployed eighteen months ago may have indexed its initial corpus with one embedding model, added new documents with a slightly different preprocessing pipeline, and migrated query encoding to a newer model — all without anyone explicitly deciding to mix embedding spaces. Each individual change seemed reasonable at the time. Together, they've produced a vector store that can no longer reliably rank relevant content above irrelevant content.
Why Mixed Embedding Spaces Break Retrieval
An embedding model transforms text into a point in high-dimensional space. Two texts that are semantically similar should produce nearby points. Two texts that are unrelated should produce distant points. The entire premise of cosine similarity — the similarity function underlying almost all vector search — depends on this property holding consistently across all vectors in your index.
The problem is that "nearby" and "distant" are only meaningful within a single model's coordinate system. Different embedding models carve up vector space differently. A vector produced by text-embedding-ada-002 and a vector produced by text-embedding-3-large exist in fundamentally incompatible spaces. Comparing them with cosine similarity is like comparing GPS coordinates from two different projections — the numbers look similar, but they don't point to the same place.
This is obvious when you explicitly switch embedding models and forget to re-embed your corpus. It's less obvious in the cases that actually hurt teams in production:
- You updated a preprocessing step that strips HTML artifacts more aggressively. Old documents were embedded with the noise; new documents aren't. The semantic density has shifted.
- You changed your chunk size from 512 to 256 tokens to improve precision. Old chunks have broader contextual windows; new chunks have narrower ones. The boundary conditions are different.
- Your data pipeline has an inconsistency: development strips trailing whitespace and unicode variations; production doesn't. Documents embedded in dev and prod end up in slightly different positions.
- You ran a partial re-embedding when a new model came out, covering documents added in the last six months. Older documents are still on the original model.
Any of these changes produce a mixed vector store. The mismatch degrades recall gradually: results don't disappear overnight, but increasingly the documents that surface at the top of the ranking are the ones that happen to have been embedded recently or under the current pipeline.
What Degraded Retrieval Actually Looks Like
Embedding drift produces a specific pattern of failure that's easy to miss if you're not looking for it. Healthy cosine similarity between a query and a highly relevant document should exceed 0.85. In a mixed vector store, cross-model comparisons routinely produce similarities in the 0.65 range for the same semantic relationship. That's not a catastrophic failure — the system doesn't crash — but it's enough to scramble the ranking.
Retrieval recall in affected systems often drops from 0.92 to around 0.74 without any corresponding alert. If you're not running regular retrieval benchmarks against a baseline query set, you won't see this. Standard monitoring — CPU, memory, latency, error rate — shows nothing unusual.
The failure mode compounds over time. A document corpus embedded with the original model becomes progressively less discoverable as the query encoder diverges. Newer documents, embedded under the current pipeline, dominate results even when older documents are more relevant. Users learn to add more specific qualifiers to their queries, which partially compensates — until they can't anymore.
Six Signals That Tell You Drift Has Started
Catching embedding drift early requires metrics that go beyond infrastructure health. The most reliable diagnostic is cosine distance on identical text: periodically re-embed a sample of documents from your corpus and compare the new vectors against the stored ones. In a stable system, the distance between a document's current embedding and its freshly computed embedding should be near zero — typically 0.0001 to 0.005. When you start seeing distances above 0.05, something in your embedding pipeline has changed in a way that affects vector positions.
A complementary check is nearest-neighbor stability: run the same set of benchmark queries weekly and measure how much the result set changes. In a stable system, 85–95% of results should be the same from week to week. When overlap drops below 70%, retrieval quality is actively degrading.
Beyond these two primary signals:
- Embedding dimension distribution: track the statistical distribution of L2 norms across your stored vectors. Shifts in variance or mean reveal that new vectors are occupying a different region of the space than old ones.
- nDCG@k trend: if you have any kind of relevance judgments or click data, track normalized discounted cumulative gain over time. A downward trend on a stable query set is the clearest signal that ranking has degraded.
- Retrieval recall benchmarks: maintain a small evaluation set with known relevant documents and run it regularly. Recall declining while latency stays constant is a hallmark of drift.
- Vector count divergence: compare the count of vectors in your index against the count of source documents. Unexplained deltas reveal ingestion failures that produce incomplete coverage and compound over time.
- https://dev.to/dowhatmatters/embedding-drift-the-quiet-killer-of-retrieval-quality-in-rag-systems-4l5m
- https://decompressed.io/learn/embedding-drift
- https://dev.to/humzakt/zero-downtime-embedding-migration-switching-from-text-embedding-004-to-text-embedding-3-large-in-1292
- https://arxiv.org/html/2509.23471v1
- https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/vector-drift-in-azure-ai-search-three-hidden-reasons-your-rag-accuracy-degrades-/4493031
- https://decompressed.io/learn/rag-observability-postmortem
- https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/
- https://learn.evidentlyai.com/ml-observability-course/module-3-ml-monitoring-for-unstructured-data/monitoring-embeddings-drift
- https://particula.tech/blog/update-rag-knowledge-without-rebuilding
