Skip to main content

Your A/B Test Can't Tell Two Models Apart When the Metric Is Vibes

· 9 min read
Tian Pan
Software Engineer

You shipped a model swap behind an experiment. Two weeks pass, the dashboard moves a tenth of a percent, and the readout says "no significant difference." You conclude the new model is roughly the same as the old one and move on.

It is not the same. Your metric was never sensitive enough to tell.

This is the quiet failure mode of A/B testing for AI features. The statistical machinery works exactly as designed. The randomization is clean. The sample size is large. The p-value is honest. And the experiment still cannot distinguish a model your users love from one they merely tolerate, because the only thing it is measuring is whether they clicked.

A/B testing AI is a measurement problem before it is a statistics problem. Most teams skip straight to the statistics.

Click-Through Rate Is Almost Flat Across Models Humans Rank Very Differently

Hand two model outputs to a panel of users and ask which one they prefer. You will get a clear winner, often by a wide margin. Now run the same comparison through your engagement dashboard. Click-through rate, task completion rate, session length — they barely move.

The reason is that the things click-through rate measures are upstream of model quality. A user clicks because they had a question, not because the answer was good. They complete the task because they came in intending to. They stay in the session because they have nowhere else to be in the next thirty seconds. The act of engaging happened before the model produced anything; the engagement metric is a function of intent, not output.

A bad answer and a good answer can both produce a click. They differ in what happens next, and "what happens next" is exactly the part most product analytics suites do not instrument. You see the click and the next session, separated by a gap that contains the user's actual experience of the model — and you measured neither end of that gap with enough resolution to pick up the signal.

This is the same trap that traditional product experimentation hit with content quality changes. A redesigned article page does not move bounce rate much because bounce rate measures whether the reader arrived, not whether they got what they came for. AI features sit deeper inside that gap. The metric you are running the experiment on is even further removed from the thing you are actually trying to change.

The Latency Between a Worse Answer and the Churn It Causes

When a model gets worse, users do not leave that day. They leave three weeks later, after the cumulative experience has reached whatever threshold each individual user has for "this is not working for me." The damage is real and the dollars are real, but they are smeared across a window long enough that no two-week A/B test will see them.

This is the lagging-indicator problem in its meanest form. The metric you trust most — retained users, paying users, users who return — is exactly the metric that takes the longest to register a change. By the time it moves, the experiment has been over for a month, the new model has been rolled out to everyone, and the only signal left is that the curve is bending downward and nobody can prove why.

The reverse is just as bad. You ship a better model, ship the experiment, see no movement, and conclude there was no improvement. Three months later, retention is up two points. Nobody connects it back to the model swap, because the experiment readout said it did nothing.

This is not a statistical-power problem you can fix with more traffic. The signal is not buried in noise. The signal is genuinely not there yet at the moment you are reading the dashboard. Running the experiment for longer helps only if the lag is shorter than your patience, and for most B2C products it is not.

Proxy Signals That Move Faster Than Retention

The way out is not to abandon experiments. The way out is to stop running them on metrics that measure intent and start running them on metrics that measure dissatisfaction. Dissatisfaction is something models cause directly and users register immediately, often within the same session.

A short list of signals that actually move when the model changes:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates