Skip to main content

3 posts tagged with "ab-testing"

View all tags

Variance Eats the Experiment: Why A/B Power Math Breaks for LLM Features

· 11 min read
Tian Pan
Software Engineer

The model team can demo the new feature and show ten convincing wins side by side. The growth team runs it as a two-week A/B test, gets p = 0.31, and the readout says "no significant effect." Both teams are right. The experiment is wrong.

This pattern repeats across every org that has bolted an LLM onto a product without rebuilding its experimentation stack. The math the growth team is using was designed for button colors, ranking changes, and pricing pages — features whose outputs are deterministic given a user and a context. LLM features break the two assumptions that math leans on, and the standard 80%-power, 5%-significance, two-week-ramp template ships systematically wrong calls in both directions: real wins read as null results, and noise reads as confident wins.

When Your Evals Disagree: A Signal Hierarchy for the Week the Numbers Contradict Each Other

· 12 min read
Tian Pan
Software Engineer

It's Tuesday morning, the week after a prompt change shipped to half your traffic. You open four dashboards. The held-out golden set scored by the LLM judge says +8%. The human-rater panel that samples production weekly says no change. The A/B test on downstream conversion says −2%. The thumbs-up rate is flat. Four signals, four verdicts, and a standup in fifteen minutes where someone is going to ask whether you ship the prompt or roll it back.

The temptation is to pick the number that confirms what you already wanted to do — and the team will, because nobody on the call has a written rule for which signal wins. The disagreement isn't a measurement bug. It's the predictable output of a system that bolted four evaluators together without a hierarchy, and the cost of not having one is that every release week becomes a debate about whose number to trust.

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

· 10 min read
Tian Pan
Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.