Skip to main content

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

· 10 min read
Tian Pan
Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.

The Variance You Didn't Budget For

In a traditional A/B test, you estimate the sample size needed to detect a minimum detectable effect (MDE) at some statistical power. The formula depends on the variance of your metric. For a button color change, that variance comes entirely from differences between users — some click more, some click less. The treatment itself is deterministic.

With an LLM-powered feature, you have two sources of variance stacked on top of each other:

  • Between-user variance: Different users have different needs, contexts, and behaviors. This is the variance your framework was built to handle.
  • Within-user, within-treatment variance: The same user, with the same input, gets different outputs on different invocations. This is the variance your framework ignores.

The total variance is the sum of both. In practice, the within-treatment variance of LLM outputs can be enormous — a summarizer might produce a crisp three-sentence summary on one call and a rambling five-paragraph response on the next.

When you plug this inflated variance into a standard power calculation, the required sample size balloons. A test that would need 10,000 users for a deterministic feature might need 50,000 or more for the same effect size on a non-deterministic one.

Teams that don't adjust end up in one of two failure modes: they run experiments too short and declare "no significant difference" when a real effect exists, or they see a lucky streak of high-quality outputs in the treatment arm and ship a change that doesn't actually help.

Per-Session Randomization Creates Hidden Confounds

The standard approach to A/B testing assigns each user to a group — control or treatment — and keeps them there for the duration of the experiment. This works when each session is independent. But AI features break this assumption in two ways.

First, multi-turn interactions create temporal dependencies. A user who gets a bad first response from your AI assistant will behave differently for the rest of the session — asking simpler questions, providing more context, or abandoning the feature entirely. The quality of turn one leaks into the metrics of turn five. Research on multi-turn LLM conversations shows an average 39% performance degradation compared to single-turn interactions. Once a conversation goes off the rails, it rarely recovers.

Second, per-session randomization with non-deterministic outputs means the user's experience of "treatment" is itself inconsistent. Imagine testing two prompt strategies for a customer support bot. User A gets assigned to treatment and has five conversations over the experiment period. Due to LLM non-determinism, two conversations are excellent, two are mediocre, and one is poor. User A's average experience is "okay" — but the treatment might genuinely be better. The non-determinism just obscured it within that single user's data.

The fix is to think carefully about the unit of randomization. For multi-turn features, randomize at the conversation level rather than the user level. This gives you more data points (each conversation is a sample) but introduces a different problem: the same user appears in both control and treatment, creating within-user correlation. Clustered standard errors or mixed-effects models become necessary, not optional.

Why Standard Metrics Lie

The second trap is measuring the wrong thing. Traditional A/B tests optimize for behavioral metrics — click-through rate, conversion, time on page. These metrics are well-defined, easy to measure, and have known distributions.

AI features introduce a category of outcome that doesn't fit neatly into behavioral metrics: output quality. Whether a summary is "good," whether a code suggestion is "correct," whether a chatbot response is "helpful" — these are subjective, multi-dimensional, and expensive to measure.

Teams typically reach for one of three approaches, each with its own failure mode:

Proxy metrics. Measure something downstream and observable: Did the user accept the suggestion? Did they edit the summary? Did they thumbs-up the response? These are easy to collect but weakly correlated with actual quality. A user might accept a mediocre code suggestion because editing it is faster than writing from scratch. They might thumbs-down a correct but verbose response.

Automated evaluation. Use an LLM-as-judge to score outputs on dimensions like relevance, accuracy, and helpfulness. This scales but introduces its own variance — the judge model is also non-deterministic. You now have treatment variance stacked on evaluation variance. Research shows LLM judges exhibit prompt sensitivity and scoring biases that can mask or amplify real differences between variants.

Human evaluation. Gold standard for quality, but expensive and slow. By the time you've collected enough human ratings to power your test, the model version has changed, the prompts have been updated, and the experiment is obsolete.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates