Variance Eats the Experiment: Why A/B Power Math Breaks for LLM Features
The model team can demo the new feature and show ten convincing wins side by side. The growth team runs it as a two-week A/B test, gets p = 0.31, and the readout says "no significant effect." Both teams are right. The experiment is wrong.
This pattern repeats across every org that has bolted an LLM onto a product without rebuilding its experimentation stack. The math the growth team is using was designed for button colors, ranking changes, and pricing pages — features whose outputs are deterministic given a user and a context. LLM features break the two assumptions that math leans on, and the standard 80%-power, 5%-significance, two-week-ramp template ships systematically wrong calls in both directions: real wins read as null results, and noise reads as confident wins.
The cost is not just one experiment. It is that after a few rounds of "non-significant" readouts on features the model team can show qualitatively improve, the model team stops running experiments. The feedback loop between feature iteration and causal evidence quietly breaks, and the org ends up shipping AI features on vibes — not because anyone decided to, but because the experimentation stack stopped being useful for that surface.
The two assumptions that quietly break
Classical A/B math compresses a lot of subtlety into a sample-size formula that looks deceptively portable. Two things have to be true for it to give you the right number.
One: per-user variance is small relative to the treatment effect you are trying to detect. If the noise floor of an individual user's behavior is larger than the lift you are looking for, you need either a much larger sample or a longer experiment to separate signal from noise. For deterministic UI experiments, this assumption holds because the per-user variance is mostly about whether the user is in the mood to click — bounded, well-modeled, well-studied.
Two: each unit of measurement (a session, an event, a conversion) is a roughly iid sample from the user's behavior distribution. This is what lets you treat sessions as exchangeable and pool them into a per-user metric without much thought.
LLM features violate both assumptions, and they violate them in compounding ways:
- Inter-user variance is higher than for deterministic features, because the input is now natural language. Two users asking "the same" question phrase it differently, push the model into different parts of its output distribution, and get answers with different latencies, lengths, and downstream behaviors. Your treatment and control arms are not just sampling users — they are sampling prompt distributions per user, and those distributions differ more across users than click-targets do.
- Intra-user variance is high in a way deterministic features don't have at all, because the same user asking the same question across two sessions can get materially different answers. Temperature, sampling, retrieval freshness, tool-call ordering, and model-side determinism all conspire to put an extra layer of variance inside the unit you used to treat as a fixed point. The "iid sample from a user's behavior distribution" framing now has a second source of randomness — the model's — stacked on top of the user's.
- The treatment effect is often smaller than the within-user variance of either arm, because the changes the model team is shipping (a better prompt, a smarter router, a swap from a small model to a medium one) tend to produce 1–5% lifts on metrics whose per-session noise is 10–30%. The signal-to-noise ratio is not bad — it's structurally upside-down.
The Microsoft experimentation team published the foundational variance-reduction paper for this kind of problem in 2013, and the framing it built — that not all variance in an experiment is random and pre-experiment data can absorb a lot of it — is exactly the right starting point. But the standard implementation assumes deterministic per-user behavior, and the LLM case adds a model-side variance component that pre-period covariates can't predict.
Why the dashboard says "non-significant" and the model team says "obviously better"
When a sample-size calculator is fed an effect size and a noise estimate from a deterministic-feature pre-period, it returns a number that is wrong by roughly the ratio of true variance to assumed variance. For an LLM feature with high intra-user variance, that ratio is often 2–5×. So the team needs 2–5× the sample size, or 2–5× the experiment duration, to hit the same statistical power.
Most growth teams don't catch this because the variance estimate is implicit in their tooling. The pre-period is automatically computed from historical data, the calculator returns a sample size, and the experiment ramps. Nothing in the standard pipeline distinguishes "this metric has 10% per-user variance because users vary" from "this metric has 30% per-user variance because users vary AND the model varies." The number that comes out has the right units and the wrong magnitude.
The model team, meanwhile, is looking at curated qualitative wins — twenty paired examples where the new prompt is obviously better than the old one. Those examples are real. But they are not random samples; they are a hand-picked subset where the lift is large enough to see. The honest extrapolation is "the new feature is reliably better when the lift is visible, but the average lift across the full traffic distribution is small relative to the noise the experiment can detect."
Both readouts can be true at once. The new feature does work. The experiment also genuinely doesn't have the power to detect it. The answer is not "trust the model team" or "trust the growth team" — it is that the experimentation infrastructure needs to know it is measuring a stochastic feature.
What a stochastic-aware experimentation framework actually does differently
There are four shifts that have to land for AI-feature experimentation to produce honest answers. None are exotic; all are mostly absent from the default toolchain.
Variance estimates that include both inter- and intra-user components. The pre-period analysis has to decompose variance into between-user and within-user pieces, ideally via a mixed-effects model where user is a random intercept. The intraclass correlation coefficient — the share of total variance that lives at the user level — tells you how much your effective sample size is shrunk by within-user repetition. For a deterministic feature on a per-conversion metric, ICC is often near 1 (a user converts or doesn't, with low session-to-session variance). For an LLM feature, ICC drops because each session is a fresh draw from a stochastic process, which counterintuitively increases your effective sample size per user — but only if you actually measure it, and only if your variance estimate isn't being inflated by misspecifying the model.
Blocked designs that pair the same user across treatment and control. The single biggest variance reduction available to you is to give the same user both versions and analyze the paired difference. This collapses inter-user variance entirely and leaves you measuring within-user contrast, which is exactly the quantity you care about. The price is real — you need a UX that supports the blocking (twin-prompt designs, A/B exposure within session, off-peak shadow runs) — and not every feature can be blocked. But for prompt and model upgrades on tasks where the user issues multiple comparable requests, paired analysis turns "underpowered for the rest of the year" into "decisive in two weeks."
A consistency covariate that absorbs intra-user noise. Pre-experiment behavior is the standard CUPED covariate; for stochastic features, a more powerful covariate is a twin-prompt consistency score computed at experiment time. Run the same prompt through the model twice, measure how much the two outputs differ on the metric of interest, and use that as a per-session covariate in the regression. This is conceptually the same move as CUPED but moves the variance estimation from pre-period historical data to in-experiment paired sampling — closer to the augmentation framing that recent work has revisited as the right way to think about CUPED.
Longer ramp horizons, because the SNR is worse than the dashboard implied. This is the boring one, and it is the one teams skip. If the variance is 2–3× higher than the calculator assumed, the experiment needs 2–3× the duration. Telling a growth team to ramp an AI feature for six weeks instead of two will be unpopular. Telling them they have been running underpowered experiments for a year and shipping decisions on noise is worse.
The org failure mode that follows from getting this wrong
The technical mistake is recoverable. The org failure mode is not, and it compounds.
After the third or fourth time the experimentation team rules a model-team feature "non-significant," the model team draws an entirely rational conclusion: experimentation is not a tool that can measure their work. So they stop running experiments. They ship on offline evals, qualitative spot-checks, and confidence in the model upgrade. The feedback loop between iterations and causal evidence breaks, and once it breaks, the experimentation team has no traffic to learn from and no way to calibrate their machinery for stochastic features.
This is how an org ends up two years later with an AI roadmap that ships on vibes and an experimentation team that has never run a successful AI experiment, sitting next to each other in standup, not realizing they are the same problem. The experimentation team thinks the model team is ignoring the data; the model team thinks the experimentation team is gatekeeping with the wrong yardstick. Neither is exactly wrong, and the diagnosis — "the math doesn't fit the workload" — is something neither team owns.
The cost frame nobody surfaces
A stochastic-aware experimentation framework is platform investment. It needs a mixed-effects estimator in the metric-computation pipeline, a paired-design primitive in the assignment service, a twin-prompt sampling mechanism wired into the model gateway, and dashboards that surface ICC and effective sample size alongside p-values. Every one of those is six engineer-weeks of platform work that competes with feature work in a quarterly plan.
The experimentation team, looking at their roadmap, sees AI features as a small fraction of the experiment volume and rationally deprioritizes the work. By the time half the roadmap is AI features and the system is producing systematically wrong calls, the platform investment is two years behind, and the team is rebuilding under pressure with the model team breathing down their neck.
The hidden cost is not the engineering. It is the year of decisions made on misread experiments before anyone notices that the experiment system has been wrong all along.
Experimentation infrastructure for features whose output distribution is the thing you're changing
The deeper realization is that experimentation infrastructure built for deterministic UIs is asking the wrong question of LLM features. For a button-color test, the output is deterministic given the user and the assignment, and the experiment measures how user behavior shifts in response. For an LLM feature, the output is itself a sample from a distribution, and the experiment is comparing two distributions of outputs — not two fixed treatments.
Once you frame it that way, a lot of what feels weird about LLM A/B testing becomes obvious. Switchback designs make sense for features where the model's output distribution is the unit of comparison. Paired analysis is the default, not a clever variant. Twin-prompt sampling is a measurement instrument, not a hack. The framework you want is closer to clinical-trial-style mixed-effects analysis on noisy biomarkers than to the click-rate calculator the experimentation team has been maintaining for a decade.
The teams that will ship AI features with confidence in 2026 are the ones that took the variance problem seriously a year before they had to. The ones that didn't will spend 2026 reforecasting roadmaps off experiment readouts that were wrong from the day the AI surface launched, and wondering why the model team stopped showing up to the experiment review.
The math hasn't changed. The workload has. The experimentation stack that doesn't notice is the one you are running on right now.
- https://www.statsig.com/blog/llm-optimization-online-experimentation
- https://www.statsig.com/blog/cuped
- https://exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf
- https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/articles/deep-dive-into-variance-reduction/
- https://mlumiste.com/technical/ab-test-llm-evals/
- https://blog.growthbook.io/how-to-a-b-test-ai-a-practical-guide/
- https://aclanthology.org/2025.findings-emnlp.594.pdf
- https://openreview.net/forum?id=E2RyjrBMVZ
- https://arxiv.org/abs/2009.00148
- https://en.wikipedia.org/wiki/Intraclass_correlation
- https://www.stata.com/features/overview/intraclass-correlations-for-multilevel-models/
- https://vasishth.github.io/Freq_CogSci/from-the-paired-t-test-to-the-linear-mixed-model.html
- https://arxiv.org/html/2312.02935v1
