Skip to main content

The Eval That Converges, Then Quietly Collapses

· 11 min read
Tian Pan
Software Engineer

Your weekly eval dashboard has gone flat. The line that used to wobble between 0.71 and 0.78 has tightened to a hairline around 0.84 for three release cycles. The team reads it as a ceiling — the model is as good as the rubric allows, and further work needs a harder eval. Someone schedules a planning meeting to "design eval v2."

That reading is plausible, and sometimes correct. But there is a second explanation that produces the same picture and quietly destroys your release-gating signal: your labelers, human or LLM-judge, have homogenized around the same opinions, and the eval is no longer measuring the model. It is measuring how well the model produces the shape of output your labelers have learned to call correct.

This failure mode is sneaky because every step that produced it looked like quality work. You refined the rubric. You added few-shot examples to your LLM-judge prompt to fix early disagreement. You rotated in a new cohort of human raters and calibrated them against the previous cohort's gold set. Each of those moves tightened the eval. Together, they collapsed it. The yardstick became frozen, and the team has been optimizing against a frozen yardstick for two quarters.

Why agreement is not the metric you think it is

The intuitive read of inter-rater agreement is "higher is better." A team launches an eval with Krippendorff's α around 0.4, treats that as a problem to be fixed, and works the rubric until α climbs past 0.8. The number going up reads as the eval becoming more reliable. In a well-specified, narrow task — does this string contain a phone number — that read is right. The ceiling is high because the answer is unambiguous, and rater convergence on the true answer is the whole point.

For the evals that gate frontier model releases, the answer is rarely unambiguous. "Did the model give a helpful response to this medical question" or "Did the agent follow the user's intent" or "Is this summary faithful to the source" are tasks where two competent labelers should sometimes disagree, because reasonable practitioners do disagree. A well-specified rubric on a genuinely ambiguous task produces a bounded band of disagreement, not zero disagreement. If you see α drift from 0.6 to 0.85 over six months without a corresponding change in the task definition or labeler training, the first hypothesis should not be that the labelers got better. It should be that the labelers got more alike.

The distinction matters because the two scenarios produce identical-looking dashboards but opposite truths. Real reliability improvement means the rubric finally describes the task crisply, raters converge on the genuinely correct answer, and the eval becomes more decisive. Labeler monoculture means the rubric became internalized as a set of stylistic conventions, raters now recognize the conventions instead of evaluating the underlying property, and the eval becomes more agreeable with itself while losing its grip on the property it was meant to measure.

How a working eval drifts into monoculture

Three forces push evals toward labeler monoculture, and most teams run all three simultaneously without noticing.

The first is rubric refinement. Every time a team encounters a confusing case, they add a clarifying clause to the rubric. After enough clauses, the rubric stops describing the task and starts describing a procedure for producing the rubric's preferred answer. Raters who follow the procedure agree with each other regardless of whether the procedure tracks the underlying property. Recent research on rubric drift specifically names this pattern: rubrics develop coherent, directional preference shifts that look like quality but are actually a manufactured consensus.

The second is few-shot judge prompts. When the LLM-judge disagrees with humans, the standard fix is to add few-shot examples — a high-scoring response, a medium-scoring response, a low-scoring response — that anchor the model's judgment. Each anchor pulls the judge toward the example. Anchor enough times and the judge is essentially pattern-matching against the anchors rather than applying the criteria the prompt describes. The judge's outputs are now coherent with the anchors and coherent with itself, which raises measured agreement against humans who were calibrated against the same anchors.

The third is cohort succession. Human raters rotate out. New raters are trained against gold-set decisions produced by the outgoing cohort. The new cohort is hired to agree with the gold set, not to independently apply the rubric to fresh cases. After two or three rotations, "the gold set" is no longer a record of correct judgments — it is a record of one specific group's interpretation of a rubric that has itself drifted. Every subsequent rater is trained to replicate that interpretation. Agreement metrics climb because the eval has become a closed loop training its own conformity.

None of these moves is wrong in isolation. Each is the textbook response to a real problem. The failure is in running them simultaneously without any countervailing force that punishes agreement when agreement should not exist.

Disagreement as a budget, not a defect

The discipline that catches collapse is treating inter-rater disagreement as a budget with an expected band rather than a defect to be minimized. The team writes down, before running the eval, the disagreement profile that a well-specified rubric on this task should produce. For a clear-cut task, that band might be α between 0.85 and 0.95. For a genuinely ambiguous one — like rating the helpfulness of a medical response with multiple acceptable answers — the band might be 0.55 to 0.75.

If the measured agreement is below the band, the rubric is under-specified or the raters are under-trained. That is the familiar failure mode and teams know how to fix it. If the measured agreement is above the band, the eval is collapsing. Same dashboard alert, opposite remediation. The remediation when α climbs out the top is harder than the remediation when α drops out the bottom, because the team has to introduce more disagreement, which feels wrong to operators trained to interpret disagreement as noise.

The mechanism that introduces healthy disagreement is a hold-out labeler cohort. A small group of raters — three to five is usually enough for a non-public eval — grades blind. They never see the rubric refresh. They never see the gold-set updates. They never train against the outgoing cohort. Their job is to apply the original task definition to whatever the eval throws at them, indefinitely. Their disagreement with the main cohort is the actual signal of whether the eval is still measuring the task or has drifted into measuring rubric compliance.

The hold-out cohort costs real money and produces a metric that the rest of the org will sometimes resent, because it complicates the headline number. That cost is the point. An eval whose health is verified by a cheaper, internal process is an eval whose health is conditional on the process not collapsing. The hold-out cohort is paid to be the thing the process cannot capture.

The adversarial sample that proves the eval still distinguishes

Disagreement profiles tell you whether the labelers are still independent. They do not tell you whether the eval can still tell good models from bad ones. A separate check is required for that, and it is one of the simplest discipline practices to run.

The team maintains an archive of model outputs they know are bad — a frozen snapshot of an old model checkpoint that the team agrees was meaningfully worse, plus a handful of deliberately broken outputs the team constructs by hand (truncated answers, hallucinated facts, refusal-when-help-was-correct, the standard catalog). Every eval cycle, the archive is mixed into the live eval set under a label the labelers do not see. The eval is healthy if it still separates the archive from current outputs by a margin the team can articulate. The eval has collapsed if it cannot.

This is a Goodhart-style test in reverse. Goodhart's law says metrics that become targets stop being measures. The adversarial archive turns that diagnostic into a procedure: hold a frozen sample of known-bad outputs constant, watch whether your eval's score-spread between known-bad and current can be detected. When the spread compresses, the eval has stopped measuring quality and started measuring a stylistic correlate. The team can then look at exactly which axes have collapsed — does the eval still detect truncation? still detect hallucination? still detect refusal failures? — and rebuild from the broken axis outward.

The archive itself has to be refreshed slowly, because the team will eventually want to retire outputs that have become trivially distinguishable. The refresh cadence should be slow enough that the eval is forced to keep working on stale samples. A quarterly refresh is reasonable. Monthly is too fast and creates the same closed-loop problem the rest of the methodology is trying to escape.

Why teams ship regressions against a collapsed eval

The cost of a collapsed eval is not visible until a model upgrade ships, and even then it can stay invisible if the field telemetry is wired to the same signals the eval is. A model upgrade ships, the eval says +1.2 points, the post-launch metrics that the eval was designed to predict also say +1.2 points, and the regression in some capability the eval used to measure but no longer does is invisible until a customer escalation surfaces it weeks later. The eval and the field telemetry are both downstream of the same labeler conventions, so they degrade together.

This is the pattern behind several public episodes from the last eighteen months where teams shipped what their internal evals reported as a clear improvement, then had to walk back claims after users surfaced specific capability regressions. The internal eval was not lying in the corrupt sense. It was reporting honestly against the property it had drifted to measure, which was no longer the property the team thought it was measuring. The cost is felt in the lag between launch and discovery: the longer the eval has been in its drifted state, the more regressions have accumulated under the eval's radar, and the more painful the catch-up is when a real regression surfaces a real capability gap.

The teams that avoid this are not the teams with the cleverest rubrics. They are the teams that treat the eval itself as a system that ages, decays, and requires maintenance with the same care as the production stack it is gating. An eval's health is not a property of any single release. It is a property of how the eval is being updated, who is grading it, and whether anyone is checking that the grading still tracks the thing the team cares about.

A health check you can run this week

If your eval has been quiet for a quarter or more and you cannot remember the last time the disagreement profile was reviewed, a single afternoon of work produces useful signal. Take twenty samples from this week's eval set. Have three raters from your team grade them independently against the original task definition — not the current rubric, the task definition the rubric was supposed to operationalize. Compare those grades to the rubric-driven grades the eval pipeline produced. If they disagree on more than two or three samples, the rubric and the task have already separated. If they agree on all twenty, you have either a very crisp task or a very advanced case of monoculture, and the adversarial archive check will tell you which.

The deeper architectural point is that the absolute score an eval produces is not its primary output. Its primary output is the disagreement profile it generates under a known-good labeling process. A team that watches the score and not the profile is operating an instrument it cannot calibrate. When the score eventually moves — or stops moving — the team will not know whether the model changed, the rubric changed, or the labelers changed, because the only number they kept was the one that cannot distinguish the three.

The fix is not to add another metric to the dashboard. The fix is to write down what the eval is supposed to measure, separately from how the rubric currently measures it, and to hold one labeler cohort accountable to the former while the rest of the pipeline serves the latter. The gap between those two readings, tracked over time, is the eval's actual health signal. Everything else is the eval grading itself.

References:Let's stay in touch and Follow me for more thoughts and updates