The Embedding Deprecation That Halved Your Retrieval Recall Without a Deploy
The most expensive embedding bug a RAG system can ship is the one where nothing in your repository changes. Your retrieval code is the same. Your index is the same. Your query path is the same. And one Tuesday in week six, somebody notices that the answers used to be better.
The provider posted a sunset notice for the embedding family your index was built against twelve months ago. The platform team filed it in a deprecations dashboard with a year of runway and moved on. The sunset path wasn't a hard cutoff — it was a quiet quality regression where the deprecated endpoint started routing to a "compatibility" successor that returned vectors in the same dimensionality and a subtly different semantic geometry. Query embeddings began drifting against the corpus you embedded a year ago. Recall@10 on your standing eval slid by 47% over six weeks. The team only traced it back when an unrelated quality dashboard crossed a threshold, dragging a senior engineer into a root-cause exercise that ended at an embedding endpoint no one on the call had touched in a year.
This post is about the architectural mistake underneath that incident: treating embedding endpoints as fungible URLs instead of as versioned dependencies pinned to the corpus they produced. Providers do not always own up to behavior changes with a version bump, deprecation timelines are upper bounds rather than firm dates, and your retrieval recall is being renegotiated by a vendor on a cadence your eval cannot see.
Why "compatibility" is the most dangerous word in a deprecation notice
A version bump from embeddings-v1 to embeddings-v2 is loud. Your client code changes, your index documentation changes, your tickets pile up, and somebody runs the eval. The system has a chance to surface the regression at the moment of the change.
A "compatibility" successor is the opposite. The provider keeps the URL, keeps the dimensionality, and keeps the response envelope. The only thing that changes is the function from text to vector. Same input, slightly different output, same shape. To every line of your client code, the call looks identical to what it looked like yesterday.
That is exactly the problem. Cosine similarity and dot products are only meaningful when both sides of the comparison live in the same space. The moment your query embeddings come from a different model than your document embeddings, the geometry that makes neighborhoods meaningful breaks. Practitioners writing about this describe it as "comparing apples to oranges": the numbers still come back, the index still returns ten neighbors, and most of the neighbors are now wrong in ways nothing in your stack will detect.
The damage is bounded by how much the two models disagree, which is bounded by how aggressively the provider tuned the successor. Some "compatibility" successors are calibrated to be near-isomorphic; others are not. You do not get to choose, and you usually do not get told.
Deprecation as a behavior surface, not a date
Most teams treat a deprecation date as a deadline: at time T, the endpoint stops working. That mental model lets you fund the migration by burning down to T. It is a useful model for hard cutoffs.
It is the wrong model for soft cutoffs. A provider managing a fleet across millions of customers will optimize for availability — they would rather keep your client receiving 200 responses than break you with a 410. So the deprecation path becomes a curve: between the announcement and the official retirement, the endpoint's behavior is renegotiated to make the underlying infrastructure cheaper to operate. Quality of returned vectors is one of the knobs available.
Providers tend to describe these curves in the language of "feature parity adjustments" or "infrastructure improvements." The release notes are technically accurate. They are also useless as a signal that your retrieval system just got worse. The actual contract you have with the endpoint, in the deprecation window, is "we will return a vector of the same shape" — not "we will return a vector from the same semantic distribution."
The team that owns retrieval needs to internalize that announced sunset dates are upper bounds, and that the contract during the runway is shape-stable, not semantics-stable. Major providers now turn over models on a twelve-to-eighteen month cadence, and several have shipped two-to-four-week deprecation windows for accelerated retirements. The slack you think you have is shorter than the announcement implies, and the behavior is moving inside the window.
The asymmetric eval blindspot
Almost every team has an eval that benchmarks an embedding model. Almost none of them have an eval that benchmarks a deployed index. The difference is what makes this failure mode silent.
Standard embedding evaluation looks like this: take a labeled query set, embed the queries with model X, embed the corpus with model X, run retrieval, score recall@k. When you upgrade to model Y, you re-embed both sides with Y and re-score. The eval correctly tells you whether Y is better than X for retrieval — when both sides live in Y-space.
That is not what happens in production. In production, your document embeddings were written to the index a year ago, at the cost of whatever it cost to embed your entire corpus that quarter. Re-embedding the corpus is a project. So the "eval" that actually matches production is: queries embedded by today's endpoint, against documents embedded by last year's endpoint, scored by today's labels. Most teams never run that eval, because they never set up the infrastructure to embed queries from "today's endpoint" against vectors from "last year's endpoint" — they have one endpoint, and they trust it.
When the provider quietly migrates that endpoint to a compatibility successor, "today's endpoint" stops being "last year's endpoint" — but only at query time. The corpus side is frozen in the index. Your eval, if you run it the standard way (re-embed both sides), will look fine, because it puts both sides back in the same space. Your production traffic, which can only re-embed one side, will degrade.
The asymmetry between corpus-side embeddings (paid for once, hard to redo) and query-side embeddings (paid for per request, automatically reflects today's endpoint) is exactly where a silent provider migration hides. Any eval that does not preserve that asymmetry is testing the wrong system.
Patterns that pin the contract you actually need
If the contract you need is "query and document embeddings come from the same model," and the provider only contracts to "vector of the same shape," then the gap is yours to close. A few patterns are worth standing up before you need them.
Pin the embedding model version at the corpus. Every document in your index gets a sidecar field: embedding_model_id, embedding_model_version, embedding_endpoint_url, and ideally a content hash of a fixed canary string embedded by that endpoint at indexing time. Your write path refuses to insert a vector without these fields. Your read path refuses to score query embeddings against documents whose model identifier does not match the one your query came from. The error is loud and immediate, not a silent recall slide.
Treat the canary string as a contract test. Pick a small set of fixed strings — twenty to fifty of them — embed each one at the time you build the index, and store the resulting vectors as a contract artifact. At query time, on a sampled fraction of requests, re-embed one of the canary strings, take its cosine similarity to the stored vector, and assert it is above a threshold like 0.9999. The moment the provider's endpoint starts returning materially different vectors for known-fixed input, the assertion fires. This is the cheapest behavior-change detector you can deploy against a provider you do not control.
Monitor recall@k on slope, not just on level. A standing labeled query set (two hundred to five hundred query-document pairs is enough to be useful) should run every night against the live retrieval path — query embeddings from today's endpoint, documents from the index as it actually exists — and report recall@k. Alert not on the absolute number but on the multi-day rolling slope. A 47% slide over six weeks is invisible to any threshold-based alert if the threshold was set when the system was healthy; it is glaring on a trend.
Run a forced-cutover register, not a deprecation tracker. A deprecations dashboard that records "provider said this is going away on date T" rewards procrastination. A cutover register that records "we will fully migrate off this endpoint by date T-minus-90" puts the burden on your timeline. The forced cutover date should be calculated backward from the provider's date with a margin that accounts for the corpus you have to re-embed, the eval you have to re-run, and the index you have to swap. If you cannot meet the date, you find out early enough to negotiate with the provider, not late enough to be the one calling for a hotfix.
Plan the re-embedding project as infrastructure, not a sprint task. Re-embedding a corpus that took a quarter to build will take comparable time and cost. Modern guidance treats this as a major infrastructure event: a parallel index, a dual-write phase, a shadow query phase that compares old-index and new-index results, a cutover with a rollback plan, and a deprecation of the old index only after the new one has carried full traffic for a week. The work is large enough that if you discover the need on a Friday because recall just collapsed, you are already in trouble.
For teams that genuinely cannot re-embed in time, recent research has explored learnable transformation layers that map new-model query embeddings into the legacy index's space, recovering most of the recall of a full re-embed at the cost of a small latency overhead. These are useful as bridges during a migration but not as a substitute for one — they paper over the geometry mismatch rather than fixing it.
The architectural lesson under the incident
There is a generalization worth taking from this: the contract surface of a provider API is everything the provider can change without changing the URL. For embedding endpoints, that surface includes the actual mapping from text to vector, which is precisely the thing your retrieval system depends on. Treating the URL as the dependency is the bug. The dependency is the mapping, and the URL is just the way you call it.
The same generalization applies to anything you index against a model's output: re-rankers whose scores you cache, classifiers whose labels you persist, summarizers whose outputs are joined to a downstream pipeline. Each of these is a place where you wrote the model's output to a slow store, and your query path is reading from a fast endpoint that owes you only shape, not semantics. Provider behavior changes inside the window between announcement and retirement are not unusual — they are the rule. The question is whether your system can tell when they happen.
Retrieval systems that get this right have three things in common. They tag every stored vector with the model identity that produced it. They run a behavior-change canary that fires on geometry drift, not on uptime. And they treat their own forced-cutover date as the deadline that matters, with the provider's sunset date as a soft upper bound. Teams that do not have those three things are running a recall floor that a vendor is renegotiating on a cadence the team's eval cannot see — and a year from now, somebody on the team will spend a long afternoon tracing a quality regression back to a URL that has not changed in any commit.
- https://platform.openai.com/docs/deprecations
- https://community.openai.com/t/any-plans-of-deprecating-text-embedding-ada-002/700561
- https://hackernoon.com/your-embedding-model-will-deprecate-heres-what-to-do
- https://medium.com/data-science-collective/different-embedding-models-different-spaces-the-hidden-cost-of-model-upgrades-899db24ad233
- https://arxiv.org/pdf/2509.23471
- https://arxiv.org/html/2510.13406
- https://arxiv.org/pdf/2506.00037
- https://arxiv.org/pdf/2112.02805
- https://community.fabric.microsoft.com/t5/Data-Science-Community-Blog/When-Document-and-Query-Embeddings-Don-t-Match-A-Practical-Guide/ba-p/4993140
- https://decompressed.io/learn/rag-observability-postmortem
- https://medium.com/@anindyasinghobi/embedding-drift-the-quiet-killer-of-retrieval-quality-in-rag-systems-b5d46bee3bba
- https://medium.com/@mariem.jabloun/dont-break-your-rag-why-you-must-use-the-same-embedding-model-for-retrieval-and-indexing-7b0a3e536acd
- https://dev.to/dowhatmatters/embedding-drift-the-quiet-killer-of-retrieval-quality-in-rag-systems-4l5m
- https://tensorops.ai/blog/the-gpt-41-deprecation-forces-organizations-to-change
- https://docs.together.ai/docs/deprecations
- https://www.digitalapplied.com/blog/rag-system-metrics-recall-precision-faithfulness-2026
