Skip to main content

Variance Eats the Experiment: Why A/B Power Math Breaks for LLM Features

· 11 min read
Tian Pan
Software Engineer

The model team can demo the new feature and show ten convincing wins side by side. The growth team runs it as a two-week A/B test, gets p = 0.31, and the readout says "no significant effect." Both teams are right. The experiment is wrong.

This pattern repeats across every org that has bolted an LLM onto a product without rebuilding its experimentation stack. The math the growth team is using was designed for button colors, ranking changes, and pricing pages — features whose outputs are deterministic given a user and a context. LLM features break the two assumptions that math leans on, and the standard 80%-power, 5%-significance, two-week-ramp template ships systematically wrong calls in both directions: real wins read as null results, and noise reads as confident wins.

The cost is not just one experiment. It is that after a few rounds of "non-significant" readouts on features the model team can show qualitatively improve, the model team stops running experiments. The feedback loop between feature iteration and causal evidence quietly breaks, and the org ends up shipping AI features on vibes — not because anyone decided to, but because the experimentation stack stopped being useful for that surface.

The two assumptions that quietly break

Classical A/B math compresses a lot of subtlety into a sample-size formula that looks deceptively portable. Two things have to be true for it to give you the right number.

One: per-user variance is small relative to the treatment effect you are trying to detect. If the noise floor of an individual user's behavior is larger than the lift you are looking for, you need either a much larger sample or a longer experiment to separate signal from noise. For deterministic UI experiments, this assumption holds because the per-user variance is mostly about whether the user is in the mood to click — bounded, well-modeled, well-studied.

Two: each unit of measurement (a session, an event, a conversion) is a roughly iid sample from the user's behavior distribution. This is what lets you treat sessions as exchangeable and pool them into a per-user metric without much thought.

LLM features violate both assumptions, and they violate them in compounding ways:

  • Inter-user variance is higher than for deterministic features, because the input is now natural language. Two users asking "the same" question phrase it differently, push the model into different parts of its output distribution, and get answers with different latencies, lengths, and downstream behaviors. Your treatment and control arms are not just sampling users — they are sampling prompt distributions per user, and those distributions differ more across users than click-targets do.
  • Intra-user variance is high in a way deterministic features don't have at all, because the same user asking the same question across two sessions can get materially different answers. Temperature, sampling, retrieval freshness, tool-call ordering, and model-side determinism all conspire to put an extra layer of variance inside the unit you used to treat as a fixed point. The "iid sample from a user's behavior distribution" framing now has a second source of randomness — the model's — stacked on top of the user's.
  • The treatment effect is often smaller than the within-user variance of either arm, because the changes the model team is shipping (a better prompt, a smarter router, a swap from a small model to a medium one) tend to produce 1–5% lifts on metrics whose per-session noise is 10–30%. The signal-to-noise ratio is not bad — it's structurally upside-down.

The Microsoft experimentation team published the foundational variance-reduction paper for this kind of problem in 2013, and the framing it built — that not all variance in an experiment is random and pre-experiment data can absorb a lot of it — is exactly the right starting point. But the standard implementation assumes deterministic per-user behavior, and the LLM case adds a model-side variance component that pre-period covariates can't predict.

Why the dashboard says "non-significant" and the model team says "obviously better"

When a sample-size calculator is fed an effect size and a noise estimate from a deterministic-feature pre-period, it returns a number that is wrong by roughly the ratio of true variance to assumed variance. For an LLM feature with high intra-user variance, that ratio is often 2–5×. So the team needs 2–5× the sample size, or 2–5× the experiment duration, to hit the same statistical power.

Most growth teams don't catch this because the variance estimate is implicit in their tooling. The pre-period is automatically computed from historical data, the calculator returns a sample size, and the experiment ramps. Nothing in the standard pipeline distinguishes "this metric has 10% per-user variance because users vary" from "this metric has 30% per-user variance because users vary AND the model varies." The number that comes out has the right units and the wrong magnitude.

The model team, meanwhile, is looking at curated qualitative wins — twenty paired examples where the new prompt is obviously better than the old one. Those examples are real. But they are not random samples; they are a hand-picked subset where the lift is large enough to see. The honest extrapolation is "the new feature is reliably better when the lift is visible, but the average lift across the full traffic distribution is small relative to the noise the experiment can detect."

Both readouts can be true at once. The new feature does work. The experiment also genuinely doesn't have the power to detect it. The answer is not "trust the model team" or "trust the growth team" — it is that the experimentation infrastructure needs to know it is measuring a stochastic feature.

What a stochastic-aware experimentation framework actually does differently

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates