The Eval Rubric Pulled By Two Drift Vectors
Your composite eval score went up two points last quarter. Nobody can tell you whether the system got better, whether the human cohort that scores it got more lenient, or whether the judge model you upgraded in March started weighting verbosity differently. The number moved. The thing the number is supposed to measure did not necessarily move with it.
This is what happens when an eval rubric is read by two populations at once — humans and an LLM judge — and both populations drift on different axes for different reasons. The composite score blends their motion together, and unless you have a measurement protocol that holds one fixed while the other moves, you have shipped a metric whose changes are not attributable to anything.
Two Drift Vectors Pointing Different Directions
The human rater drift is the older problem. A human cohort that scores model outputs is not a constant. Annotators get tired, get faster, get pickier, or get replaced. In clinical research this has a name — rater drift — and it has been measured for decades. Inter-rater reliability erodes "over the life of a study," and raters drift in how they administer the rubric (how carefully they read the prompt) as well as how they score it (forgetting a particular convention they were trained on).
For LLM evals the cohort rotation is faster and more invisible. A vendor contract changes. A pool of part-time annotators turns over every six weeks. A new internal team takes over the labeling work because the old team got reassigned to a different project. Each of those events is a step change in the implicit rubric, because the explicit rubric — the document — does not capture the priors the previous cohort brought to it.
The LLM-judge drift is the newer problem and it moves faster still. The judge is a model. The judge model gets upgraded, and the upgrade does not come with a published release note titled "your rubric will now be interpreted differently." Studies in 2026 have measured the size: the same rubric and the same dataset, scored against a new judge model version, produce a mean shift of three to eight points and a narrower distribution. Within sixty to ninety days, a judge can drift far enough that yesterday's pass threshold has lost its meaning, even if nobody touched the rubric.
The two vectors do not point the same way. Humans drift slowly, with a trajectory shaped by who is in the cohort. Judges drift in steps, with a trajectory shaped by who is in the model's version string. When you average them into a composite score, you are summing two unrelated random walks and reporting the position.
Why The Composite Hides Both
The seductive thing about a composite is that it looks stable. A 0.6 weight on the human rater plus a 0.4 weight on the judge gives you one number per release. The number can move two points up or two points down, and you have a story to tell about it.
The number can also stay flat while both populations move in opposite directions, and you have no story at all. A judge that started favoring verbose answers by five points can cancel a human cohort that started penalizing verbose answers by three points, and the composite shows a one-point gain that has nothing to do with the system under test. The team ships, the leadership reports a green metric, and the underlying drift accelerates because nothing was set up to notice it.
The composite is not wrong as a reporting surface. It is wrong as a diagnostic surface. The eval team that uses it for both is publishing decisions whose confidence interval is bounded by the most-drifted of the two populations, with no way to tell which one.
The Recalibration Trap
The natural reaction when you notice the two populations disagreeing is to recalibrate the judge against the human cohort. The judge prompt gets edited, the rubric examples get refreshed, the judge model gets re-evaluated against a fresh batch of human labels, and the agreement metric — Cohen's kappa, Krippendorff's alpha, weighted kappa for ordinal scales — comes back in range.
This is the trap. Recalibrating the judge against the current human cohort locks the two drifts together. From this point forward, when the human cohort drifts, the judge's recalibration will move with it, because the judge is being re-tuned to match the moving target. The judge can no longer act as a stable anchor against which to detect human drift, and the human cohort can no longer act as a stable anchor against which to detect judge drift, because each of them is being measured against the other.
The composite score will appear smoother. The two populations will agree more consistently. The drift will continue, but it will be coherent across both vectors, and a coherent drift is one you cannot see by looking at agreement.
The recalibration is not the wrong action. The wrong action is recalibrating both directions at once and calling the agreement metric a measure of system quality.
Holding One Population Fixed
- https://futureagi.com/blog/evaluating-llm-judge-bias-mitigation-2026/
- https://futureagi.com/blog/llm-as-judge-best-practices-2026
- https://eval.qa/learn/llm-judge-calibration.html
- https://arize.com/blog/how-to-build-llm-as-a-judge-evaluators-that-hold-up-in-production/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863567/
- https://www.cogstate.com/blog/four-ways-to-identify-rater-drift-in-clinical-trials-remediation-strategies/
- https://arxiv.org/pdf/2511.21140
- https://arxiv.org/pdf/2602.13576
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
