Skip to main content

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

The Novelty Effect: Why Your First Two Weeks Are Useless

Every experienced product engineer knows about novelty effect — the documented tendency for users to engage more with any new interface element simply because it's new. What's less appreciated is how much more severe this effect is for AI features than for deterministic UI changes.

When you redesign a button, the novelty effect lasts days. When you ship a new LLM-powered response surface, it lasts weeks, and the magnitude is dramatically larger. A new AI writing assistant might produce responses that users genuinely find novel: the output varies, the phrasing feels fresh, and users explore the surface more than they would a static element. That exploration reads as engagement in your metrics.

The critical failure mode: teams run two-week experiments because that's standard practice. The novelty effect for AI features often peaks around day five and tapers through day fourteen — exactly the window the experiment is running. You're measuring the novelty spike, not the steady-state utility.

What you actually need to know is what happens in weeks three and four, after the novelty has faded and users have settled into their real usage patterns. Counterintuitively, research tracking users over eight months found that some AI features show significant usage declines over time, even after initially strong adoption, with trust erosion being the primary driver. The early signal pointed the wrong direction entirely.

The fix isn't just running longer experiments. It's segmenting your experiment results by user cohort tenure: how do users who first encountered the feature on day one behave by day fourteen, compared to day seven, compared to day twenty-one? If the engagement curve is declining within your experiment window rather than stabilizing, you're looking at novelty, not utility.

Anchoring Bias: The Order Problem You Can't Randomize Away

Anchoring is one of the most robust biases in human cognition: the first piece of information a person receives disproportionately shapes all subsequent judgments. For AI experiments, this creates a specific failure mode that randomization doesn't fix.

In a standard A/B test, each user is randomly assigned to see either control or treatment. The logic is that if users are randomly split, any systematic biases will be equally distributed across groups, and the difference between groups reflects only the effect of your change. This holds for static UI. It doesn't hold when the artifact being evaluated is a probabilistic text output.

When a user first interacts with your AI feature, they form a baseline expectation for what "AI response quality" means. If treatment users happen to see the AI produce a particularly fluent first response (which is more likely with newer, better-prompted models), they anchor their quality expectations to that output. Every subsequent interaction is judged against that anchor. Users who saw a mediocre first output from the control variant have a lower anchor — so even equivalent outputs from treatment look better by comparison.

Empirical studies on LLMs found that anchoring indices remain statistically significant even when researchers instruct models to ignore prior inputs and even after applying chain-of-thought prompting. Chain-of-thought doesn't reduce anchoring. Explicit anti-anchoring instructions don't reduce anchoring. The bias compounds in human evaluation scenarios where users provide ratings or feedback scores, which is exactly how most AI quality experiments are instrumented.

The implications for experiment design:

  • Don't use within-subjects designs where users see both old and new AI outputs sequentially. The order effect is not randomizable.
  • Counterbalance output order in any evaluation study where users judge quality directly: some users see old-then-new, others see new-then-old.
  • Be suspicious of evaluation scores collected from users who already have strong usage history with your current system — their anchor is high.

Carryover Bias: When the Control Group Is Contaminated

The most insidious of the three biases is carryover — and it's the one most teams don't have a mental model for at all.

Standard A/B testing assumes your control and treatment groups are genuinely independent. For most product experiments, this holds: the fact that a user in the control group used your app last month doesn't systematically contaminate their behavior in the experiment today. But AI features break this assumption in a specific way.

If you ran any AI experiments in the previous six months, some users in your current control group have already been exposed to an AI treatment variant. Their expectations, interaction patterns, and mental model of the feature have been shaped by that prior exposure. When they now interact with the "control" (your baseline non-AI or old-AI behavior), they're evaluating it against an implicit benchmark set by the prior treatment — not the baseline you think you're measuring.

This is most acute in product areas where:

  1. You've already run previous AI experiments and are reusing the same user population
  2. The product has a long session history, meaning users have accumulated substantial AI interaction patterns
  3. You're A/B testing subtle improvements to an existing AI feature rather than launching something new
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates