Skip to main content

5 posts tagged with "experimentation"

View all tags

Variance Eats the Experiment: Why A/B Power Math Breaks for LLM Features

· 11 min read
Tian Pan
Software Engineer

The model team can demo the new feature and show ten convincing wins side by side. The growth team runs it as a two-week A/B test, gets p = 0.31, and the readout says "no significant effect." Both teams are right. The experiment is wrong.

This pattern repeats across every org that has bolted an LLM onto a product without rebuilding its experimentation stack. The math the growth team is using was designed for button colors, ranking changes, and pricing pages — features whose outputs are deterministic given a user and a context. LLM features break the two assumptions that math leans on, and the standard 80%-power, 5%-significance, two-week-ramp template ships systematically wrong calls in both directions: real wins read as null results, and noise reads as confident wins.

The Missing Arm: Your AI Experiment Has No 'AI-Off' Control

· 9 min read
Tian Pan
Software Engineer

Look at the last six experiment readouts your team shipped on an AI feature. What were the arms? Odds are good you tested "new prompt vs. old prompt," or "GPT-5 router vs. GPT-4 fallback," or "reasoning model vs. fast model," or "with retrieval vs. without retrieval." You reported lift on engagement, task completion, or session length. You called it product impact. A quarter rolled by. Inference spend climbed. Nobody paused to ask the question the CFO eventually will: what would have happened if the feature simply weren't there?

That question is the missing arm. The lift your experiments keep measuring is "better AI vs. worse AI," but the one your business runs on is "AI vs. nothing" — or more uncomfortably, "AI vs. the three-line heuristic we never wrote down." These are different experiments with different conclusions, and most AI product programs in 2026 have only ever run the first one. The second is the one that tells you whether the feature is earning its inference bill.

A/B Testing AI Features When the Treatment Is Non-Deterministic

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

Why A/B Tests Fail for AI Features (And What to Use Instead)

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped. The A/B test ran for two weeks. The treatment group looks better — 4% lift in engagement, p-value under 0.05. You ship it to everyone.

Six weeks later, the gains have evaporated. Engagement is back where it started, or lower. Your experiment said one thing; reality said another.

This is not a corner case. It is the default outcome when you apply standard two-sample A/B testing to AI-powered features without accounting for the ways these features break the assumptions baked into that methodology. The failure modes are structural, not statistical — you can run your experiment perfectly by the textbook and still get a wrong answer.

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

· 10 min read
Tian Pan
Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.