8 posts tagged with "experimentation"

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

May 7, 2026 · 10 min read

Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

Why AI Features Break A/B Testing (and the Causal Inference Methods That Don't Lie)

May 7, 2026 · 11 min read

Tian Pan

Software Engineer

You ship an AI-powered feature, run a clean two-week A/B test, see a 4% lift in engagement, and call it a win. Six months later, the feature is fully rolled out and engagement is flat or declining. The test wasn't noisy — it was measuring the wrong thing entirely.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=Why%20AI%20Features%20Break%20A%2FB%20Testing%20(and%20the%20Causal%20Inference%20Methods%20That%20Don't%20Lie%29)

A/B tests were built for a world where users in a treatment group and users in a control group are statistically independent. AI features routinely violate that assumption. Users talk to each other, learn from each other's behavior, and share the outputs of AI tools. Treatment effects don't stabilize in two weeks when the real mechanism is long-horizon behavioral adaptation. When you ignore this, your experiment gives you a number that's internally consistent but causally meaningless.

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

May 5, 2026 · 8 min read

Tian Pan

Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.

Variance Eats the Experiment: Why A/B Power Math Breaks for LLM Features

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The model team can demo the new feature and show ten convincing wins side by side. The growth team runs it as a two-week A/B test, gets p = 0.31, and the readout says "no significant effect." Both teams are right. The experiment is wrong.

This pattern repeats across every org that has bolted an LLM onto a product without rebuilding its experimentation stack. The math the growth team is using was designed for button colors, ranking changes, and pricing pages — features whose outputs are deterministic given a user and a context. LLM features break the two assumptions that math leans on, and the standard 80%-power, 5%-significance, two-week-ramp template ships systematically wrong calls in both directions: real wins read as null results, and noise reads as confident wins.

The Missing Arm: Your AI Experiment Has No 'AI-Off' Control

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Look at the last six experiment readouts your team shipped on an AI feature. What were the arms? Odds are good you tested "new prompt vs. old prompt," or "GPT-5 router vs. GPT-4 fallback," or "reasoning model vs. fast model," or "with retrieval vs. without retrieval." You reported lift on engagement, task completion, or session length. You called it product impact. A quarter rolled by. Inference spend climbed. Nobody paused to ask the question the CFO eventually will: what would have happened if the feature simply weren't there?

That question is the missing arm. The lift your experiments keep measuring is "better AI vs. worse AI," but the one your business runs on is "AI vs. nothing" — or more uncomfortably, "AI vs. the three-line heuristic we never wrote down." These are different experiments with different conclusions, and most AI product programs in 2026 have only ever run the first one. The second is the one that tells you whether the feature is earning its inference bill.

A/B Testing AI Features When the Treatment Is Non-Deterministic

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

Why A/B Tests Fail for AI Features (And What to Use Instead)

April 15, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature shipped. The A/B test ran for two weeks. The treatment group looks better — 4% lift in engagement, p-value under 0.05. You ship it to everyone.

Six weeks later, the gains have evaporated. Engagement is back where it started, or lower. Your experiment said one thing; reality said another.

This is not a corner case. It is the default outcome when you apply standard two-sample A/B testing to AI-powered features without accounting for the ways these features break the assumptions baked into that methodology. The failure modes are structural, not statistical — you can run your experiment perfectly by the textbook and still get a wrong answer.

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

April 12, 2026 · 10 min read

Tian Pan

Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.

About Tian Pan