Skip to main content

Your LLM Eval Is Lying to You: The Statistical Power Problem

· 9 min read
Tian Pan
Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.

Why Most Eval Suites Are Too Small to Be Useful

Statistical power is the probability that your test will detect a real effect when one actually exists. Power depends on three things: sample size, the size of the effect you're trying to detect, and the variance of your measurements.

The standard target for power is 80% — meaning you accept a 1-in-5 chance of missing a real improvement. To hit 80% power at a 5% significance level, the required sample size grows with variance and shrinks inversely with the square of the effect size. Halving the effect you want to detect quadruples the samples you need.

Here's a concrete calculation. Suppose your baseline accuracy is 82% and you want to detect a real improvement to 85% — a 3-point lift. Using a two-proportion z-test with α=0.05 and 80% power:

  • Variance for p=0.82: 0.82 × 0.18 ≈ 0.148
  • Variance for p=0.85: 0.85 × 0.15 ≈ 0.128
  • Required n per group ≈ 2,400 examples

You need roughly 2,400 labeled examples per model variant to reliably detect a 3-point gain. The average internal eval suite — 50 to 200 examples hand-curated during initial development — isn't even in the right ballpark. At 100 examples, your minimum detectable effect at 80% power is closer to 10–12 points, not 3.

This isn't a minor calibration issue. Teams regularly ship prompt changes based on 3-point eval improvements that are pure noise, while simultaneously failing to detect 8-point degradations because they only checked 80 examples.

Non-Determinism Inflates the Problem

Even if you accepted "we'll just use a few hundred examples," LLM outputs introduce a second source of variance that most teams ignore entirely: the model itself is non-deterministic.

Run the same input through a cloud-hosted LLM twice with identical parameters. You will not always get the same output. This happens because of probabilistic token sampling, floating-point arithmetic differences across distributed inference hardware, and indeterminate execution ordering in parallel systems. One study found statistically significant performance differences (p=0.013) between the same model hosted on two different cloud providers — identical model weights, different infrastructure, measurably different benchmark results.

This creates a two-component variance structure:

  • Between-question variance: some questions are harder than others; your 100-example sample may skew easy or hard by chance
  • Within-question variance: the same question, asked repeatedly, gets different answers from the same model

Both components inflate your measurement uncertainty. The practical consequence is that a single-pass eval score — run once, reported as a number — contains far more noise than the precision of "85.0%" implies. Reporting "accuracy: 85%" is like reporting a temperature to three decimal places using a mercury thermometer.

What You're Actually Doing When You Compare Two Models

When teams compare a baseline model to a candidate, they typically run both through the same eval set and compare the resulting accuracy numbers. If candidate beats baseline by 3 points, candidate wins.

This is not a hypothesis test. It is a comparison of two noisy point estimates. Without confidence intervals, you have no information about whether the observed difference exceeds the noise floor.

A better approach is paired analysis. When both models evaluate identical questions, you can analyze the per-question differences rather than aggregate scores. This exploits the fact that models tend to agree on question difficulty — if both models struggle on the hard questions and succeed on the easy ones, the covariance between their scores provides "free" variance reduction. Paired analysis on the same question set is systematically more sensitive than two-sample analysis on independent samples.

But even paired analysis requires enough examples. The paired approach reduces variance; it doesn't eliminate the need for statistical power. For the 3-point lift detection scenario, a well-executed paired test might cut the required sample size in half — to roughly 1,200. Still not 100.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates