4 posts tagged with "statistics"

Variance Eats the Experiment: Why A/B Power Math Breaks for LLM Features

April 27, 2026 · 11 min read

Software Engineer

The model team can demo the new feature and show ten convincing wins side by side. The growth team runs it as a two-week A/B test, gets p = 0.31, and the readout says "no significant effect." Both teams are right. The experiment is wrong.

This pattern repeats across every org that has bolted an LLM onto a product without rebuilding its experimentation stack. The math the growth team is using was designed for button colors, ranking changes, and pricing pages — features whose outputs are deterministic given a user and a context. LLM features break the two assumptions that math leans on, and the standard 80%-power, 5%-significance, two-week-ramp template ships systematically wrong calls in both directions: real wins read as null results, and noise reads as confident wins.

A/B Testing AI Features When the Treatment Is Non-Deterministic

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

Your LLM Eval Is Lying to You: The Statistical Power Problem

April 15, 2026 · 9 min read

Tian Pan

Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

April 12, 2026 · 10 min read

Tian Pan

Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.

About Tian Pan