Skip to main content

Your Eval Set Only Has Problems You Already Solved

· 9 min read
Tian Pan
Software Engineer

Your eval score went from 0.81 to 0.87 over the last quarter. The team shipped a router, swapped in a stronger model on the hard intents, tuned the system prompt, and added forty new test cases harvested from "tickets that took more than a day to close." The dashboard says you got better. NPS is flat. Active users are down two percent.

There is a clean story that explains both numbers, and you don't want to hear it. Your eval set only contains problems you already solved. The queries that failed so badly the user never filed a ticket, never came back, and never showed up in any log you grep — those are not in your suite. They are not in anyone's suite. A rising eval score is consistent with getting better at the things you can see, and it is also consistent with getting better at the things you can see while staying exactly as bad at the things you cannot.

This is survivorship bias dressed up in MLOps clothes. You build the eval from production traces, you augment it with curated past failures, you weight it toward the hardest ten percent of inputs you spotted. Every part of that pipeline implicitly filters for users who stuck around long enough to leave a trace. The bomber pilots who didn't make it home are not in the dataset that says "armor the engines."

The structural absence problem

Eval sets get built from three sources, and all three are filtered by survival.

The first is logged production traffic. By definition, a logged query is a query the user fired off into your system. If the product is sufficiently broken on intent X, the user does not patiently send you ten more queries of intent X so you can sample them. They send one, get garbage, and leave. Your logs contain a single specimen — easy to dismiss as a one-off — while the prevalence of intent X among non-users is invisible.

The second is curated past failures. A failure becomes "past" when someone noticed it: a support ticket, a customer escalation, a Slack thread, a flagged thumbs-down. The model of the failure is not "things that went badly" but "things that went badly and a person with the time, vocabulary, and motivation reported them." Free-tier users with low expectations, non-English speakers who don't know the support form exists, agentic-pipeline operators whose retries hid the symptom — none of those feed your curation channel.

The third is synthetic adversarial examples. A red team writes prompts they suspect will fail. This produces excellent coverage of failures the red team can imagine, which is the exact set of failures you don't have a blind spot for. The failures in your blind spot are by definition the ones the red team didn't think to write down.

Add these together and the eval set is shaped like a flashlight. It illuminates a cone. The territory outside the cone is not empty — you just can't see it from where you're standing.

Why "we added the hard cases" is the comforting version of the lie

Every eval team I've talked to has a story about how they hardened their suite. The story is the same: we mined the support queue, we looked at low-rated traces, we asked the on-call to flag the gnarly ones from this week. We added them.

What you added is the hard cases that survived long enough to be noticed. Notice the survival filter is even tighter here than for logged traffic, because to make it into the "hard" set, a query had to (a) be logged, (b) be flagged, (c) be reviewable, (d) be reproducible enough to canonicalize. Each of those filters drops a generation of cases. The hard cases your eval set actually contains are the ones that walked through four doors. The hardest cases never made it to the first door, because the user who would have asked them concluded after one bad answer that the product wasn't for them.

This is why benchmark scores climb without users noticing. You did get better at things. You got better at the things in the cone. People outside the cone don't know your model improved because their experience of your product is "tried it, didn't work, told a friend it was a toy." They never came back to retest, and your retention dashboard quietly logs them as "low-engagement cohort, never converted." The eval suite and the retention curve are looking at disjoint populations.

The dark matter you can sometimes detect

You cannot directly observe the queries that never got asked. But you can observe their footprints, the way astronomers find dark matter by watching what its gravity does to the visible universe. Three techniques are worth being deliberate about.

Abandonment as a signal. Instrument the trajectory of a session, not just its final answer. A user who sends one query, reads the response, sends zero follow-ups, and never returns to the product within thirty days is telling you something. The shape of that drop-off conditional on intent type is one of the few honest measures of "how bad was the answer, really." A 4.2/5 rating from the users who stayed says little about the users who left mid-conversation. Most analytics stacks bucket "session ended" as success, because the session contained no error. Build a query-class–by–dropoff matrix and you will discover intents where your eval score is 0.9 and your conversion is 0.05. That gap is the dark matter.

Cohort retention sliced by query type. Group users by the first intent they tried. Plot month-over-month retention for each group. Intents where the M3 retention is half of the average are almost certainly intents where the first-touch answer was bad enough to lose the user permanently — even though, by the time M3 rolls around, your model has been "fixed" three times and your eval score on that intent looks healthy. The Andreessen-Horowitz framing is that AI products see heavy early churn from "AI tourists" and only stabilize at M3. The corollary is that M3 retention by intent is one of the few metrics that hasn't been tampered with by survivorship.

Deliberate sampling of low-confidence and high-latency tails. Most teams sample production traffic uniformly when they curate evals. That's the worst possible sampling distribution if you're trying to find the dark matter. Sample biased toward the bottom percentile of model confidence (self-consistency, log-probs, judge scores), toward the long tail of latency (long latencies are often retries hiding instability), and toward sessions that ended within one turn of starting. None of these are perfect, but each of them is correlated with the kind of failure the user did not bother reporting.

The reframing nobody wants to do

Here is the uncomfortable rewrite of what your dashboard is telling you. Your eval score going up means one of two things: you got better at the problems represented in the eval, or you stopped hearing from the people for whom those problems were never representative. Both look identical to the dashboard. You cannot distinguish them by looking at the eval suite alone, because the discriminator — the no-show users — is precisely what the eval suite cannot see.

The reframing is to stop treating the eval suite as a ground truth about quality and start treating it as a ground truth about coverage. The eval suite tells you how good you are at the slice of users who stuck around. To know whether that slice is representative of the market you want, you need an orthogonal signal: usability studies on prospects who haven't converted, exit interviews on churned cohorts, qualitative review of low-engagement sessions, A/B tests on new-user conversion not just power-user satisfaction. None of those produce a clean number you can plot. That is the point.

A team that treats the eval suite as the scoreboard will optimize until the score is saturated, and then assume they're done. A team that treats the eval suite as a coverage estimate will keep asking, this whole quarter, where are the users we're not hearing from. The first team's quarterly review looks better. The second team's product gets used by people the first team never met.

What to do on Monday

Make your eval suite tell you about itself. For every intent class in the suite, write down (a) how many cases came from logged traffic, (b) how many came from human-curated failures, (c) what fraction of users with that intent on first touch returned within a month. The last number is the survivorship correction. If it's low, your eval-set sample for that intent is selection-biased upward and your headline score on that intent is overstated.

Track tool-call abandonment and one-turn sessions as first-class quality signals, not just engagement signals. They are the closest thing you have to a "user gave up" event. Wire them into the same dashboard your eval scores live on, so the next time the eval score climbs while the abandonment rate on a specific intent climbs alongside it, somebody sees both numbers in the same field of view.

And once a quarter, run an explicit study of users who tried the product once and never came back. Sample twenty of them. Look at their first query. Look at the response. Score it. Ask, with no benchmark to grade you, whether the response was good. That handful of observations will not have statistical power. It will, occasionally, have epistemic power — the kind that tells you the lights have been on in the wrong room for a year.

The eval score is not the floor. It is the ceiling on what you currently know how to measure. The dark matter is heavier than the visible universe, in eval suites just like in cosmology. The honest move is to stop pretending the ceiling is the sky.

References:Let's stay in touch and Follow me for more thoughts and updates