Eval Selection Bias: Why Your Test Set Goes Blind to the Failures That Drove Users Away
There is a quiet failure mode in production-grade LLM evaluation that no leaderboard catches: your test set is built from the users who stayed, so it never asks the questions that made the others leave. Quarter over quarter the eval scores climb, the dashboards turn green, and net retention sags anyway. The team chases "is the eval gameable?" when the real story is simpler and harder. The eval distribution drifted toward survivors, and survivors are exactly the population whose feedback you least need.
This is the WWII bomber armor problem in a new costume. Abraham Wald looked at returning planes, noticed where the bullet holes clustered, and pointed out that the holes you should reinforce against are the ones on planes that didn't come back. Replace bombers with users, replace bullet holes with failed turns, and you have the central pathology of eval sets seeded from production traces.
How the loop quietly poisons itself
Most teams running LLM products end up at the same eval-set workflow. Production logs every interaction. A periodic job samples interesting traces. Engineers and annotators turn those into eval cases. The new cases get folded into the regression set. You re-run on every model upgrade, prompt change, and fine-tune.
On paper this is a virtuous feedback loop. Recent practitioner writing has been almost unanimous on the value of mining production for evals; a reliable feedback loop, the argument goes, comes from monitoring production behavior to turn new failure cases into test cases over time. The mechanism is correct. The blind spot is in the population that generates the traces.
Three filters compound:
- Onboarding filter. Users who hit a deal-breaking failure in their first session leave before generating enough traces to be diverse. The first turn is overrepresented; everything past turn three is from people who already trusted the system.
- Tolerance filter. Among users who stay, the ones who stuck around tend to have higher tolerance for the agent's quirks. They have learned to work around its blind spots, which means their prompts adapt to the agent, not the other way around. Their traces look easier than they are.
- Refresh filter. Each quarter the eval set is updated with newer traces. Newer traces come from a population that has already been filtered by quarters one through n-1. The bias compounds geometrically.
By month nine you have an eval suite that mostly tests the agent on prompts written by people who have already accepted the parts of the agent that are broken.
What the rising score is actually measuring
When the eval distribution drifts toward survivors, the eval is still measuring something real. It just isn't measuring the thing the dashboards say it is.
A 95% pass rate on the current eval set means: among the kinds of interactions the people who didn't leave are sending us, we get 95% right. It does not mean: among the kinds of interactions a representative user sends, we get 95% right. The two diverge silently and there is no metric in the standard playbook that captures the gap.
The diagnostic giveaway is when eval scores trend up while one of these moves the other way:
- New-user 7-day retention.
- First-session success-without-followup rate.
- Net Promoter or task-completion satisfaction from cohorts in their first month.
- Activation: percentage of new accounts that reach a "second productive session."
If you see these decline while eval pass-rate climbs, suspect the eval before suspecting the model. Eval drift, in the literature on the topic, is described as the silent degradation of agent output quality over time even when code and prompts have not changed. The variant we care about here is sneakier: code, prompts, and benchmark numbers all look fine, and the only thing that has changed is which users your benchmark represents.
Churn-stratified sampling: weighting the eval against exit
The cleanest counterweight is to stop pretending that "users who churned" are out of scope for eval-set construction. They are the most important sample you have.
A churn-stratified eval set defines strata not by topic or persona but by exit signal:
- Stratum A: One-and-done. Accounts that opened a session, generated under N turns, and never returned. These traces are pure gold for adversarial evaluation because the user's first impression was unrecoverable.
- Stratum B: Slow churn. Accounts active for weeks who tapered to zero. Their last 3-5 sessions before silence often contain the failure that broke trust.
- Stratum C: Active. The traffic everyone already mines. Useful, but cannot be allowed to dominate.
- Stratum D: New-cohort holdout. A protected slice of new users whose traces are added to the eval set regardless of whether they retained. This stratum is structurally protected from refresh-time survivor filtering.
- https://iris-eval.com/learn/eval-drift
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://hamel.dev/blog/posts/evals-faq/
- https://latitude.so/blog/why-ai-agents-break-in-production
- https://www.confident-ai.com/blog/definitive-ai-agent-evaluation-guide
- https://www.statsig.com/perspectives/llm-evaluation-bias
- https://www.montecarlodata.com/blog-ai-data-drift/
- https://www.langchain.com/articles/llm-evals
- https://www.evalacademy.com/articles/stratified-random-sampling-in-evaluation
- https://www.braintrust.dev/articles/llm-evaluation-guide
