The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features
A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.
Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.
The Three Assumptions That Break
Classical experiment design rests on three assumptions that LLMs violate simultaneously.
Determinism. In a traditional feature test, showing a user variant B produces the same outcome for every identical input. LLMs don't work this way. Research on inference non-determinism found that even at temperature=0 — the setting teams reach for when they want "reliable" outputs — around 24% of GPT-4o-mini runs produced distinct outputs from the first run on the same prompt. This isn't a rounding error: continuous batching and prefix caching in inference infrastructure introduce randomness that prompt settings cannot eliminate. You're not testing a treatment; you're sampling from a distribution.
Homogeneous variance. Standard power calculations assume variance is roughly constant across the range of inputs your treatment will encounter. LLM output variance is heteroskedastic: it scales with task difficulty. Simple lookups show 5–10% output divergence across runs; complex reasoning tasks show 40–60% divergence. Your sample size estimate, built on aggregate variance, will be wildly wrong for the hardest queries — exactly the ones where getting the answer right matters most.
Independence. A/B test units are supposed to be independent: one user's experience shouldn't affect another's, and a user's treatment assignment should stay stable across their interactions. Conversational LLM systems break both. A poor response in turn one shapes what the user asks in turn two. If your assignment logic isn't deterministic across a whole session, the same user can experience both variants — a SUTVA violation that poisons your treatment effect estimate.
The Variance Problem Is Worse Than You Think
Standard A/B tests for deterministic features need large samples, but they scale predictably. For LLM features, the required sample size to achieve equivalent statistical power is typically 3–5x larger because output variance is so high. The math is unforgiving: halving the minimum detectable effect requires four times more samples. Teams running two-week experiments on LLM features are almost always underpowered unless they've explicitly recalculated required sample sizes against measured LLM variance for their specific task.
The latency confound compounds this. A new model version or restructured prompt often trades quality for speed, or vice versa. A 20% improvement in answer quality means little if it comes with a 40% latency increase that drives user abandonment before they even see the response. Teams that optimize a single outcome metric without guardrail metrics (latency, token cost, refusal rate) routinely ship "improvements" that are net negatives.
Surface Metrics Will Lie to You
The more insidious problem is that the outcomes teams measure don't capture what LLMs actually produce.
Consider click-through rate, session length, or thumbs-up feedback — the standard product metrics teams reach for in A/B tests. These metrics can stay flat or improve while the semantic quality of LLM outputs degrades significantly. A shorter, more confident-sounding response might get better engagement even if it's less accurate. A prompt change that produces more consistent outputs might lower ratings if users find less variety boring.
The reverse is equally treacherous. Research documenting a "better prompts hurt" phenomenon found that a generalized "helpful assistant" wrapper prompt degraded task-specific extraction accuracy by 10% and RAG compliance by 13% on Llama 3 — while improving instruction-following scores by 13%. Aggregate metrics would call this a wash. Slice-level analysis by task type revealed the harm. Without semantic-level evaluation broken down by intent and content type, you can't distinguish improvements from tradeoffs.
Traditional string-matching metrics make this worse. Metrics like BLEU penalize semantically correct but differently-phrased outputs. A model that says "We'll process your order today" scores differently than one that says "Your order will be handled immediately" — even though these are equivalent responses. Embedding-based semantic similarity (cosine similarity over output embeddings) is the minimum viable alternative, not a research curiosity.
- https://arxiv.org/html/2504.09723v1
- https://arxiv.org/html/2408.04667v5
- https://arxiv.org/html/2601.19934
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://www.braintrust.dev/articles/ab-testing-llm-prompts
- https://www.statsig.com/blog/llm-optimization-online-experimentation
- https://arxiv.org/html/2601.22025v1
- https://arxiv.org/html/2404.08008v1
- https://huggingface.co/blog/g-ronimo/semscore
- https://www.statsig.com/perspectives/shadow-testing-ai-model-evaluation
- https://www.traceloop.com/blog/the-definitive-guide-to-a-b-testing-llm-models-in-production
- https://langfuse.com/docs/prompt-management/features/a-b-testing
