The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter
You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.
This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.
