Skip to main content

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Confound the Experimentation Stack Cannot See

A well-instrumented experiment platform logs the user bucket, the flag value, the request path, the latency, the downstream business event. It does not log the version of the embedding model that produced the vector behind the retrieval call. There is no field for it, and there is nobody on either side of the wall whose job it is to add one. The experimentation team treats the model layer as infrastructure. The AI team treats the experimentation platform as someone else's stack. The embedding version sits in the gap.

The result is that when the vendor rotates the model, the rotation is invisible to your decision-making pipeline. Your dashboards keep producing confidence intervals against a substrate that is no longer the substrate from which they were derived. The confound is not detectable from the experiment data alone because the dimension along which the world changed is not a dimension your data has.

The asymmetry matters. A code change goes through review, gets a flag, gets monitored, leaves an audit trail. A vendor-side embedding upgrade leaves a release note and changes your numbers. One of these is a first-class event in your experimentation discipline. The other is weather. You cannot run controlled experiments in a wind tunnel where the wind is also being adjusted.

How Vendor Changelogs Hide the Thing That Matters

The vendor's release notes talk about "improved relevance" and "better handling of multilingual queries." Both are true. Neither tells you anything actionable about how your specific corpus, your specific query distribution, your specific downstream LLM will behave after the change. The marginal cases your A/B was sized to detect are exactly the cases most likely to move when the embedding shifts, because the marginal cases are the ones living near retrieval boundaries.

Worse, vendors often ship these changes silently under the same endpoint name. The contract is "v1" but the weights behind v1 are upgraded continuously, on the theory that v1 is a schema and the model is an implementation. From the vendor's perspective, this is hygiene. From your perspective, it means the endpoint you benchmarked on Tuesday is not the endpoint serving on Thursday.

Your eval suite, the one you ran during procurement and have been re-running monthly, probably measures task-level accuracy on a fixed test set. That number stays stable through small embedding shifts because the easy cases stay easy. The cases that move are the ones you would only notice by looking at retrieval distributions, not aggregate accuracy. By the time the aggregate metric moves, you have already shipped several experiment decisions on top of the drift.

The Snapshot-to-Production Gap Nobody Audits

A common pattern: the offline retrieval eval uses a snapshot of embeddings captured three months ago, because re-embedding the corpus is expensive and the eval pipeline runs against the local snapshot for speed. The online system uses today's embeddings, hitting the live endpoint. The snapshot was taken when the vendor's model was at one weights state. Today's endpoint is at another.

The gap between snapshot and production has been widening monotonically, and nothing in the pipeline alarms on it. The offline eval keeps producing the same numbers because it is testing against itself. The online behavior keeps drifting because the substrate is drifting. The two only reconcile when someone manually re-embeds the snapshot and notices the numbers do not match what the offline test predicted.

The teams who avoid this make re-embedding cadence a function of the vendor's release cadence, not their own engineering convenience. The corpus is re-embedded whenever the production embedding endpoint is touched, and the snapshot is regenerated at the same time. The two stay in lockstep, or you have lost the ability to predict online retrieval behavior from offline numbers, which means you have lost the ability to evaluate before shipping.

The Hybrid Index That Is Already a Frankenstein

If you have been running for more than a year, your vector index is almost certainly a mixture. Documents ingested in March were embedded with one version of the vendor's model. Documents ingested in October were embedded with whatever the vendor was serving by then. Documents re-ingested after a corpus refresh were embedded with whatever was current that week. Queries hit the index against today's model.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates