Skip to main content

A/B Testing AI Features When the Treatment Is Non-Deterministic

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

Why LLMs Break Classical A/B Testing

The problem starts with non-determinism. Even with temperature set to 0, inference kernels don't guarantee identical outputs across runs. When server load changes batch sizes, operations like RMSNorm and matrix multiplication produce numerically different results from identical inputs. Empirically, accuracy variations of up to 15% have been measured across repeated calls to the same model with the same prompt — and in some tasks the gap between best and worst runs approaches 70%. Temperature is not the variance dial engineers assume it is.

The second problem is treatment drift. LLM providers update their models continuously, often without versioning guarantees that match your experiment windows. If your control is "old model" and your treatment is "new prompt on old model," but the provider ships a silent patch three days into your two-week experiment, your treatment is no longer what it was at randomization. This isn't a theoretical concern — it's a documented source of confounded results.

The third problem is metric operationalization. With a UI experiment, "did the user click?" is a clean, binary outcome. With an LLM feature, "did the AI help?" is contested. Thumbs-up ratings are sparse and subject to politeness bias. Engagement metrics (did the user keep interacting?) confound helpfulness with novelty. Task completion is meaningful but hard to measure without instrumentation that most teams don't have in place before shipping the experiment.

Variance Reduction Is Not Optional

The single most impactful change you can make to LLM experiment design is applying CUPED (Controlled-Experiment Using Pre-Experiment Data) or an equivalent variance reduction technique. The core idea: collect a pre-experiment metric for each user, then subtract the portion of post-experiment variance that's predictable from that baseline.

The formula is straightforward. For each user, compute:

Y_cuped = Y_post - θ × (X_pre - mean(X_pre))

where θ is the regression coefficient between pre- and post-experiment outcomes. The resulting variance reduction is Var(Y) × (1 - ρ²), where ρ is the correlation between pre- and post-experiment periods. In practice, CUPED routinely achieves 20–40% variance reduction on business metrics like session duration, items per order, and revenue per user.

For LLM features specifically, CUPED matters even more because LLM output variance stacks on top of user behavior variance. If your treatment metric is "task completion per session" and LLM outputs vary 15% between calls, that variance propagates into your experiment's standard error, inflating your minimum detectable effect (MDE). CUPED attacks the user behavior component; reducing per-call variance attacks the LLM component.

To reduce per-call variance, run the model 3–5 times per prompt during evaluation and average the scores. This sounds expensive, and it is — but for offline evaluation and metric calibration before running live experiments, it's the most reliable way to estimate true effect sizes rather than noisy samples. In live experiments, averaging isn't always feasible, which makes getting pre-experiment covariates right even more important.

CUPED has two hard requirements: at least two weeks of pre-experiment data, and a strong correlation between pre- and post-experiment outcomes. It doesn't work for new users or metrics with no historical pattern. For those cases, stratification by user cohort (power users vs. new users, mobile vs. desktop) is the fallback — run separate analyses rather than letting high-variance segments dominate your pooled result.

Define Metrics Before You Randomize

The standard guidance — "define your success metric before looking at the data" — is especially important for LLM experiments, where the temptation to pick metrics post-hoc is high and the garden of forking paths is wide.

The hierarchy of metrics that work:

Downstream business outcomes are the most trustworthy. Did the LLM feature drive a conversion, a purchase, a subscription renewal? These are hard to fake with novelty effects and resilient to output quality debates. The downside is low sensitivity — you need large samples and long windows to detect meaningful lifts.

Behavioral proxies sit in the middle. Edit rate (did the user accept or modify the output?), retry rate (did they regenerate?), copy rate, and downstream click-through all reflect revealed preference rather than stated preference. They're more sensitive than business outcomes and less gameable than explicit ratings. Instrument these before launching the experiment.

Quality ratings are the noisiest signal. Thumbs-up rates suffer from positivity bias, and users who explicitly rate outputs are not representative of users who don't. Use ratings as a supplementary diagnostic, not as a primary metric.

Avoid composite scores that blend quality dimensions unless you've validated the weighting scheme on historical data. "Quality score" as a primary experiment metric is usually a number that everyone argues about after the experiment rather than a signal that drives decisions.

Handling Model Updates Mid-Experiment

If your LLM provider updates their model during your experiment window, you have a few options. The cleanest is to treat it as an experiment design failure — restart the experiment with a snapshot-pinned model version if your provider supports it, and add model version as an experiment parameter going forward.

If restarting isn't feasible, instrument your logging to capture model version per inference call, then run a difference-in-differences analysis: compare treatment vs. control separately before and after the update. If the update affected both arms equally, the difference-in-differences estimate is still valid. If it affected treatment and control differently (because, for example, the update was specifically relevant to your prompt strategy), the experiment is confounded and you should discard it.

The operational implication: always log which model version responded to each user request, even in production systems that aren't running experiments. This turns an irreversible confound into a recoverable analysis problem.

Sequential Testing: Stop Early, Not Wrong

A common pressure in LLM feature experiments is cost. LLM inference at experiment scale is expensive, and teams want to stop experiments early when results look decisive. The problem is that standard frequentist tests are not valid if you peek at results and stop when p < 0.05 — doing so inflates your false positive rate to well above the nominal alpha.

Always-valid p-values (also called anytime-valid inference) solve this. The approach replaces traditional hypothesis tests with sequential statistics that remain valid regardless of when you look. You can continuously monitor your experiment, stop early if strong evidence emerges in either direction, and maintain confidence interval validity throughout. The price is slightly reduced statistical power compared to a fixed-sample test at the same final n — but that tradeoff is almost always worth it when inference costs are real.

Several large-scale A/B testing platforms have implemented always-valid inference as their default method. If you're building experiment infrastructure for LLM features, this should be your baseline rather than an afterthought.

Interleaving for Ranking and Retrieval

If your LLM feature is a ranking or retrieval system — reranking search results, personalizing content feeds, ordering recommendations — consider interleaving before committing to a full A/B test.

Interleaving presents results from both control and treatment to the same user in the same session, using downstream user actions (clicks, dwell time, conversions) to determine which ranker the user implicitly preferred. Because users serve as their own controls, interleaving requires dramatically less traffic to reach statistically significant conclusions. Airbnb's experimentation team documented a 50x speedup compared to traditional A/B tests, with 82% directional alignment.

The limitation is important: interleaving tells you which ranker users prefer relative to the other, not what the absolute business impact of deploying one over the other is. Use it as an acceleration gate — if interleaving shows a clear winner, proceed to A/B validation for absolute impact measurement. If interleaving is inconclusive, skip the A/B test and iterate.

The Novelty Effect Problem

LLM features produce strong novelty effects. Users interact with a new AI-powered capability at higher rates in the first week simply because it's new, not because it's useful. This effect decays, sometimes to below baseline if the feature disappoints after initial enthusiasm.

Two-week experiment windows are often insufficient to distinguish genuine lift from novelty. The practical guidance:

  • Run for at least three to four weeks on production features.
  • Segment analysis by days-since-first-exposure. If your lift is concentrated in day 1–3 interactions and flat or negative after that, you have a novelty effect, not a feature improvement.
  • Separate new users (who have no novelty baseline) from returning users. New users will always show interaction patterns that look like novelty; mixing them into your treatment effect estimate obscures what's happening with your established user base.
  • If novelty effects are large, pre-register a planned analysis of week-over-week treatment effect decay. If the treatment effect is real, it should be stable or growing as users learn the feature. If it decays monotonically, be skeptical.

What a Mature LLM Experiment Stack Looks Like

Pulling the pieces together, a robust approach to LLM feature experimentation requires:

  • Versioned model snapshots or per-call model version logging, so model updates don't silently confound experiments.
  • Pre-experiment metric collection (at minimum two weeks) for every user-level metric you care about, enabling CUPED variance reduction.
  • Behavioral instrumentation (edit rate, retry rate, copy rate, downstream clicks) deployed before the experiment starts, not after.
  • Always-valid inference as the statistical method, so teams can monitor results and stop early without inflating false positive rates.
  • Novelty effect analysis as a standard post-experiment check — treatment effect by days-since-first-exposure should be part of every experiment report.
  • Interleaving as a fast first gate for ranking and retrieval features before committing to full A/B validation.

None of these are exotic. They're all techniques that large-scale experimentation teams have been using for years in traditional product contexts. The LLM-specific adaptation is recognizing that the variance budget is much tighter — non-deterministic outputs and noisy metrics mean the classical assumptions about stable, low-variance treatments no longer hold. Adjust accordingly, and your experiments will tell you something true.

The teams that run rigorous LLM experiments ship better features. Not because they're more conservative, but because they can actually tell the difference between a real effect and noise — and act on that difference with confidence.

References:Let's stay in touch and Follow me for more thoughts and updates