The Distillation That Lost a Capability Your Eval Suite Never Measured
A team shrinks a 200B teacher into a 7B student because the eval suite — fifty thousand examples covering everything the product launched with — shows the student trailing the teacher by less than two points and inference cost dropping by an order of magnitude. The migration ships. The cost graph drops. The customer-satisfaction graph holds. Three weeks later, support starts seeing a class of failures the team cannot reproduce in eval.
The student no longer recognizes a corner-case input format the teacher had silently handled. It no longer recovers from a particular ambiguous instruction the teacher had reliably disambiguated. It no longer produces the rare-but-load-bearing "ask a clarifying question instead of guessing" behavior — because the eval set was scrubbed of ambiguous prompts on the grounds that they were "bad data."
The eval said the distillation was faithful. The eval was wrong about what faithfulness means.
Distillation Compresses What the Eval Measures, and Discards Everything Else
The mental model most teams have for distillation is "make a smaller model that does the same thing." The actual procedure is narrower: minimize a divergence between the student and the teacher over a finite sample of inputs, then verify with a finite eval. Anything the sample does not represent and anything the eval does not test is, from the optimizer's perspective, free entropy. The student is free to drop it. Most of the time it will.
This is not a defect of any particular distillation technique. It is what compression is. Knowledge distillation must account for what it loses, not just what the student retains on a primary metric, and recent surveys of the field explicitly name this: a student can preserve the teacher's headline score while losing the behaviors that made the teacher reliable. The capabilities most at risk are precisely the ones with the lowest base rate in your eval — and the lowest base rate tends to correlate with the highest user-facing cost when they finally trip.
What does "the lowest base rate" look like in practice? A short, incomplete list of behaviors that disappear from eval averages because they fire rarely:
- Calibrated abstention. The teacher said "I don't know" on three percent of inputs and was usually right to. The student says it on zero percent and confabulates instead. Average accuracy moves by less than one point. Trust collapses.
- Format robustness. A user pastes a malformed JSON with a trailing comma, or a date string in a regional format your eval never sampled. The teacher parsed it. The student rejects the whole request.
- Clarifying questions. Ambiguous instructions in the wild are common; in the eval they were filtered as noise. The student, trained on the filtered distribution, picks the most likely intent and runs with it.
- Multi-step plan repair. When a tool call fails halfway through a task, the teacher revised the plan. The student emits the same plan a second time and waits to be told it failed again.
- Refusal calibration on adversarial inputs. The teacher refused jailbreaks without over-refusing benign edge cases. The student inherits one half of that posture and not the other.
Each of these is a behavior whose frequency in any realistic eval is too low to move the headline number, and whose importance when it fails is high enough to define how the product feels.
The Eval Suite Is a Sample of the Contract; the Distillation Is Faithful to the Sample
A common organizational story: the AI team treats the eval as the contract, the product team treats the eval as the contract, and the distillation is judged faithful because the eval is satisfied. None of this is incorrect inside the world the eval describes. The bug is the unspoken assumption that the eval describes the contract.
It does not. Even a fifty-thousand-example eval is a sample of an open behavior space, and a sample built from logged production traffic will reflect the things the teacher already handled well — because those are the requests the product currently sees — while underrepresenting the things the teacher would handle well but the product has not yet asked it to handle. Your eval is biased toward the teacher's strengths the same way your customer base is. A distillation optimized against that eval inherits the bias and forgets the rest.
There is a sharper way to state this. The eval is a forward measurement of a finite set of test points. The deployed system encounters an unbounded set of real inputs that are correlated but not identical to the test points. The gap between those two sets is the gap where capabilities disappear. The gap is invisible inside the eval, by construction.
Behavior Coverage Is the Audit You Are Missing
Closing the gap does not mean abandoning aggregate evals — those still catch the cases where a distillation broke something obvious — but it does mean adding an audit layer that grades the eval itself rather than only the model.
A capability-coverage audit asks a different question than an accuracy eval. It asks: what is the space of behaviors the teacher exhibits, and what fraction of that space does the eval actually exercise? The audit is constructed in roughly four passes.
First, enumerate the teacher's behavior classes from the outside. This is not "what tasks does the model do" — that is the eval coverage you already have. It is "what does the model do along the way that the user benefits from": abstain, clarify, refuse, recover, re-plan, prefer one format over another, calibrate confidence, ground claims in retrieved context. Each class becomes a row in a coverage matrix.
Second, grade the existing eval against the matrix. For each behavior class, count how many eval items specifically test for it — not how many items happen to elicit it, but how many would be scored differently if the behavior changed. Most teams discover their eval has substantial coverage on three or four classes and effectively no coverage on a dozen more. The dozen others are not exotic; they are simply the parts of behavior the eval was not designed around.
Third, construct adversarial probes for the uncovered classes. Deliberately include ambiguous prompts (to measure clarifying), out-of-distribution formats (to measure robustness), partially-failed tool sequences (to measure recovery), prompts where the right answer is "I don't know" (to measure abstention), and prompts the eval was scrubbed of on the grounds they were noise. The point of an adversarial eval suite is to surface exactly the inputs that production produces and the headline eval omits. If your eval suite never surfaces the inputs that trigger failures, you are not testing your system — you are testing your assumptions about your system.
Fourth, run a behavior diff between teacher and student on identical real-traffic samples. Cluster the divergences by behavior class rather than by accuracy delta. This is the part most teams skip because it requires teacher inference to remain available at sample volume — but the alternative is shipping the student blind to the behaviors it lost.
Treat Behavior-Coverage Loss as a Release Gate
A coverage audit that runs once before launch is theater. The behavior matrix only does work when it becomes a release gate for every distilled successor.
The practical form is straightforward. Maintain a small set — a few hundred examples is enough — of "things the teacher does that the headline eval does not measure." Each example is tagged with the behavior class it exercises and the expected behavior, not the expected exact answer. Each candidate successor — distilled, fine-tuned, retrained, or simply a new checkpoint — is graded against this set separately from the headline eval. A regression on any behavior class blocks the deployment until either the behavior is recovered or the team takes a documented decision to trade it away.
This sounds heavyweight. It is not. The set is small precisely because each item is doing the work of standing in for a behavior class the eval otherwise ignores. The cost is dominated by the construction; the ongoing scoring is cheap. The benefit is that "we ran distillation and the numbers look fine" stops being a release decision a team can make blind to what numbers do not look at.
It is worth adding a quieter discipline alongside this: when the eval suite is curated, log what gets scrubbed and why. Items removed as "noise," "out of scope," "ambiguous," or "bad data" are exactly the items most likely to represent behaviors the headline eval will not measure and the distilled model will not learn. The scrub queue, kept around, becomes a free source of capability probes — the items the team already noticed were behaviorally interesting and then chose to forget.
The Cost Frame Is Asymmetric, and That Is the Real Trap
The cost graph of a successful distillation is immediate and on a dashboard. The cost graph of an unsuccessful one is dispersed across support tickets, churn, product-trust erosion, and capabilities the team eventually has to re-add to a different model. The first is legible to a quarterly review. The second is not legible until it has accumulated enough that someone notices a pattern in churn or an uptick in escalations and traces them back to the deployment that, on the dashboard, looked clean.
This asymmetry is what makes capability loss so persistent as a failure mode. The savings show up on the model owner's scorecard; the costs show up on someone else's. The team that owned the distillation often does not own the surfaces where its dropped behaviors fail. By the time the failures aggregate into a recognizable signal, the deployment is months old and the next distillation is already being planned. Without a coverage audit and a behavior-class release gate, the next one will lose a different capability — different because it depends on which behaviors the new eval underrepresents — and the cycle repeats.
There is a defensible architecture decision underneath all of this: an eval suite is a finite measurement of an infinite behavior space, and a distillation that scores well on the eval has guaranteed nothing about the behaviors the eval did not name. The team that did not measure capability coverage before compressing has not bought a smaller model that does the same thing. It has bought a smaller model that does the things it was tested on, and may have left the most useful behaviors behind on the teacher.
That is a fine trade if it is a chosen trade. It is a bad one if nobody noticed it was a trade at all.
- https://arxiv.org/abs/2504.14772
- https://arxiv.org/pdf/2503.03730
- https://arxiv.org/pdf/2505.14216
- https://link.springer.com/article/10.1007/s10462-025-11423-3
- https://arxiv.org/pdf/2502.06659
- https://arxiv.org/pdf/2110.08419
- https://arxiv.org/pdf/2412.15255
- https://glenrhodes.com/hot-take-rigorous-llm-evaluation-as-a-first-class-engineering-discipline-not-an-afterthought/
