Skip to main content

Your LLM Eval Is Lying to You: The Statistical Power Problem

· 9 min read
Tian Pan
Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.

Why Most Eval Suites Are Too Small to Be Useful

Statistical power is the probability that your test will detect a real effect when one actually exists. Power depends on three things: sample size, the size of the effect you're trying to detect, and the variance of your measurements.

The standard target for power is 80% — meaning you accept a 1-in-5 chance of missing a real improvement. To hit 80% power at a 5% significance level, the required sample size grows with variance and shrinks inversely with the square of the effect size. Halving the effect you want to detect quadruples the samples you need.

Here's a concrete calculation. Suppose your baseline accuracy is 82% and you want to detect a real improvement to 85% — a 3-point lift. Using a two-proportion z-test with α=0.05 and 80% power:

  • Variance for p=0.82: 0.82 × 0.18 ≈ 0.148
  • Variance for p=0.85: 0.85 × 0.15 ≈ 0.128
  • Required n per group ≈ 2,400 examples

You need roughly 2,400 labeled examples per model variant to reliably detect a 3-point gain. The average internal eval suite — 50 to 200 examples hand-curated during initial development — isn't even in the right ballpark. At 100 examples, your minimum detectable effect at 80% power is closer to 10–12 points, not 3.

This isn't a minor calibration issue. Teams regularly ship prompt changes based on 3-point eval improvements that are pure noise, while simultaneously failing to detect 8-point degradations because they only checked 80 examples.

Non-Determinism Inflates the Problem

Even if you accepted "we'll just use a few hundred examples," LLM outputs introduce a second source of variance that most teams ignore entirely: the model itself is non-deterministic.

Run the same input through a cloud-hosted LLM twice with identical parameters. You will not always get the same output. This happens because of probabilistic token sampling, floating-point arithmetic differences across distributed inference hardware, and indeterminate execution ordering in parallel systems. One study found statistically significant performance differences (p=0.013) between the same model hosted on two different cloud providers — identical model weights, different infrastructure, measurably different benchmark results.

This creates a two-component variance structure:

  • Between-question variance: some questions are harder than others; your 100-example sample may skew easy or hard by chance
  • Within-question variance: the same question, asked repeatedly, gets different answers from the same model

Both components inflate your measurement uncertainty. The practical consequence is that a single-pass eval score — run once, reported as a number — contains far more noise than the precision of "85.0%" implies. Reporting "accuracy: 85%" is like reporting a temperature to three decimal places using a mercury thermometer.

What You're Actually Doing When You Compare Two Models

When teams compare a baseline model to a candidate, they typically run both through the same eval set and compare the resulting accuracy numbers. If candidate beats baseline by 3 points, candidate wins.

This is not a hypothesis test. It is a comparison of two noisy point estimates. Without confidence intervals, you have no information about whether the observed difference exceeds the noise floor.

A better approach is paired analysis. When both models evaluate identical questions, you can analyze the per-question differences rather than aggregate scores. This exploits the fact that models tend to agree on question difficulty — if both models struggle on the hard questions and succeed on the easy ones, the covariance between their scores provides "free" variance reduction. Paired analysis on the same question set is systematically more sensitive than two-sample analysis on independent samples.

But even paired analysis requires enough examples. The paired approach reduces variance; it doesn't eliminate the need for statistical power. For the 3-point lift detection scenario, a well-executed paired test might cut the required sample size in half — to roughly 1,200. Still not 100.

The Organizational Cost Nobody Calculates

The real damage from underpowered evals isn't any single bad decision. It's the feedback loop corruption that compounds over time.

Imagine your measurement error is ±5 points and you're iterating toward improvements of 2–4 points per cycle. Every "improvement" you observe could be noise. Every regression could be noise. You are essentially navigating with a compass that points in a random direction — but you're still building a map.

Teams in this situation show recognizable symptoms:

  • Prompt thrash: changes that "help" in eval get reverted weeks later because production doesn't reflect the gains
  • Framework distrust: engineers stop trusting evals and start making intuition-based decisions, bypassing the system entirely
  • False confidence at launch: a green eval dashboard creates pressure to ship, even when the eval cannot actually distinguish working from not working
  • Inability to attribute regressions: when production quality drops, you can't trace it to a specific change because no individual eval run had the power to detect the regression

The organizational cost is the engineering time spent iterating on a system whose measurement error exceeds the gains being chased. This is common. It is also preventable.

An Experiment Design That Produces Reliable Signal

The goal is not to run academic experiments — it's to build a minimum viable eval that actually informs decisions. Here's what that looks like in practice.

Set a minimum detectable effect before building your eval. Decide the smallest improvement that would actually change a product decision. If you wouldn't ship for a 2-point gain but would ship for a 5-point gain, your MDE is 5 points. Use this to back-calculate the required sample size before you collect examples. If the required sample size is infeasible, either accept lower power (and account for it in decisions) or increase the threshold.

Report uncertainty, not just accuracy. Every eval result should include a confidence interval. A result of "85% ± 4%" reads very differently than "85%" when the candidate and baseline are separated by 3 points. Wilson score intervals work well for binomial accuracy metrics on small samples where the normal approximation breaks down.

Use continuous metrics where possible. Binary correctness (right/wrong) discards information. Token log-probability scores, preference win rates across multiple judges, or rubric-based 1–5 scores carry more signal per example. With continuous metrics, the same sample size produces narrower confidence intervals than binary labels.

Run multiple seeds. A single eval pass gives you one noisy estimate. Running the same eval with three different random seeds and averaging the results reduces within-question variance by a factor of three for a modest cost increase. For API-based evals, this is often the highest-leverage improvement you can make without adding examples.

Don't ignore correlations. If your eval questions are not independent — they share the same document, are translations of each other, or belong to the same domain cluster — standard error calculations will be too optimistic. Clustered standard errors can increase your estimated standard error by 3x, dramatically changing which differences look significant.

What Adequate Evals Actually Look Like

Reaching statistical adequacy does not require building a benchmark at MMLU scale. A few structural choices move you most of the way there.

For most production use cases, 500–1,000 diverse, representative examples with paired analysis and proper uncertainty reporting gives you the power to detect 5-point improvements at 80% confidence. This is achievable. Teams often resist because curating labeled examples is expensive — but the cost of labeled examples is far lower than the cost of shipping prompt regressions because the eval couldn't detect them.

Automated LLM-as-judge scoring can expand your eval set cheaply, but it introduces its own variance. LLM judges are not deterministic either. If you use LLM-as-judge, run each judgment multiple times and average, or report judge agreement rates alongside scores.

Finally, treat eval development as a capability, not a one-time setup task. An eval built for your initial feature will not cover edge cases discovered six months in. The eval suite should grow alongside the system it measures, with examples added whenever a production failure reveals a new failure mode.

The Compass Problem

Statistical power matters because it determines whether you can learn anything from your experiments. An underpowered eval isn't a safety net with holes — it's a compass that points in a random direction. The danger isn't that you know it's broken. The danger is that it feels exactly like a working compass.

The fix isn't complicated: calculate required sample sizes before building eval sets, report confidence intervals instead of point estimates, run multiple seeds, and use paired analysis when comparing models. These are standard statistical practices borrowed directly from clinical trial design and A/B testing. They apply to LLM evaluation just as cleanly.

If your current eval suite can't tell the difference between signal and noise, the iteration work built on top of it cannot either. Fix the compass before drawing the map.

References:Let's stay in touch and Follow me for more thoughts and updates