Skip to main content

LLM-as-Judge Drift: When Your Evaluator Upgrades and All Your Numbers Move

· 11 min read
Tian Pan
Software Engineer

A regression suite that flips green-to-red without a single prompt change is usually one of three things: a broken test harness, a flaky retrieval store, or a judge that learned new taste over the weekend. The third one is the most common and the least debugged, because no commit in your repo caused it. The scoring model got a silent quality refresh, and every score you compare against last month's dashboard is now denominated in a different currency.

This is the uncomfortable part of LLM-as-judge: you have two moving models, not one. The candidate model is the thing you ship; the judge model is the thing that tells you how the candidate is doing. When both evolve independently, score deltas stop meaning what they used to, and the dashboard that your PM refreshes every morning quietly lies.

The lie is not random. It has a direction. Judges pulled in-family with the candidate (same provider, same generation, same style preferences) tend to report quality improvements that look like wins but are actually the judge catching up to the candidate's idiom. Judges pulled out-of-family report regressions that feel catastrophic but are actually the judge applying a different rubric than the one that was training the scores a quarter ago. Both movements get annotated in retros as "the new prompt worked" or "the new prompt broke things." Neither claim is safe.

Two Clocks, One Dashboard

The simplest mental model of LLM-as-judge evaluation has one clock: the candidate model's version pinned, the judge treated as a fixed ruler, and each prompt iteration producing a clean delta. Real deployments have at least two clocks. The candidate clock ticks every time someone edits a prompt, rebuilds a RAG index, or rotates a model pin. The judge clock ticks every time the vendor issues a behavioral refresh — which increasingly happens without a version bump, because providers treat quality improvements to production model aliases as non-breaking.

The effect is that a score of 4.2 in January and a score of 4.4 in March describe the agreement of two differently trained judges about two differently iterated candidates. The number 4.4 is not higher than 4.2 in any longitudinally meaningful sense. It is a different measurement on a different scale by a different instrument, reported against the same axis label. Your dashboard flattens both clocks into a single line, and the line looks like progress.

Research on LLM judge stability documents this precisely. Even holding everything else fixed, temperature settings introduce variance; even at temperature 0, provider-side sampling non-determinism produces score distributions that drift across runs. Add a judge-side behavioral update on top and the drift stops being Gaussian noise and starts being a systematic shift in what the rubric rewards. The 2024 survey of LLM-as-judge methods is blunt about the consequence: when you rely on proprietary judges, reproducibility is not a property you own — it is a favor the vendor can revoke on their next deploy cadence.

The Mechanism: Why Judges Drift in a Direction, Not a Noise Pattern

If judge drift were zero-mean noise, you could average it out. It isn't. Two mechanisms bend the direction.

The first is self-preference bias: models score outputs that look like their own generations higher. GPT-4 rates GPT-4-style outputs above comparably-quality Claude or Gemini outputs; Claude does the mirror. The 2024 Self-Preference paper traced this to perplexity — judges rate lower-perplexity (more familiar-looking) text higher, regardless of who wrote it. When a judge gets upgraded, its notion of "familiar" updates. Outputs that were slightly outside its distribution last month are now inside it. Scores climb, and the climb has nothing to do with the candidate improving.

The second is family bias: judges score other members of their own model family higher than out-of-family peers of comparable quality. If your candidate is on GPT-4.1 and your judge is on GPT-4o, the judge does not merely reward GPT-style prose — it rewards stylistic markers specific to the OpenAI lineage. When your candidate upgrades to GPT-4.5 and your judge stays on GPT-4o, the family axis stays aligned and the improvement looks bigger than it is. When the judge upgrades to a different family (say, a vendor swap driven by cost), the family axis shifts and the candidate looks like it regressed overnight.

Positional and length biases are the third-order effects. GPT-4 is significantly primacy-preferred in pairwise comparisons; other models are recency-preferred. Longer outputs reliably score higher, regardless of content. Every judge upgrade potentially re-weights these biases. If the new judge cares less about length, the RAG system you spent a quarter making concise suddenly loses its length-inflated lead, and the rollback discussion begins.

Each of these biases, in isolation, is well-studied and mitigable. The problem is that a judge upgrade changes all of them at once, in unknown proportions, with no changelog from the vendor describing what moved.

Phantom Regressions Are the Easy Case. Phantom Wins Are Worse.

When a judge upgrade causes your dashboard to drop, at least someone panics and investigates. The pattern is familiar: green-to-red regression suite, bisect the prompt changes, find nothing, eventually someone notices the judge's provider quietly shipped an "improved" release. Painful, but recoverable — you re-baseline, annotate the discontinuity, move on.

Phantom wins are the failure mode nobody catches. The judge upgrade happens, scores trend up, the team ships the week's prompt change, the numbers look confirmatory, and the prompt ships to production. Two months later, user complaints spike on a behavior the judge stopped penalizing but users still hate. The dashboard and the user never agreed in the first place; the team only noticed the disagreement when the judge's new preferences drifted far enough from user preferences to become visible.

This is why "quality went up" after a judge update should trigger the same skepticism as "quality went down." Both are evidence of the judge moving, not the candidate. A disciplined team treats uncorrelated judge-update-plus-score-movement as a cause to rerun calibration, not as confirmation of a shipped change.

Pinning the Baseline Judge

The first and least negotiable discipline is pinning a baseline judge version in eval metadata. Every eval run records the judge's model ID, temperature, system prompt hash, and rubric version alongside the candidate's identifiers. Score comparisons across time must match on judge identity. If the judge column changes between two runs, the score delta is unspecified, not zero. Dashboards enforce this by refusing to draw a line through a judge-version discontinuity.

Pinning sounds simple. In practice it requires two things that teams resist. First, pinning to specific snapshot model IDs (e.g., gpt-4o-2024-08-06 rather than gpt-4o), because provider aliases are not stable. Second, accepting that the pinned judge will eventually deprecate, and budgeting for the migration in advance. A pinned-baseline-judge strategy without a scheduled re-pinning cadence is a time bomb: the day the deprecation email arrives, every historical score in your warehouse becomes unanchored.

The re-pinning procedure has a specific shape. Run both the old pinned judge and the new candidate judge on the same dataset (production traces, eval set, and a held-out meta-eval set) in parallel. Measure per-category agreement. For categories where agreement is high, rebase the baseline to the new judge and annotate the discontinuity. For categories where agreement is low, freeze them as known drift zones and do not compare scores across the version boundary until the rubric is re-calibrated.

Judge-vs-Judge Calibration Cadence

Pinning a baseline tells you what your old judge said. It does not tell you whether your old judge was right. For that, you need regular judge-vs-judge calibration — running an independent judge (ideally a different model family, definitely a different provider) on a sampled slice of the same traces the primary judge scores.

Agreement metrics matter here, and the choice is not academic. Cohen's Kappa and Krippendorff's Alpha are standard but misbehave in the skewed score distributions typical of production evals, where most traces land in the top bucket and rare failures land in the bottom. Gwet's AC2 is more robust to that skew. Rank correlations (Spearman, Kendall) sidestep the rubric-boundary problem entirely and answer the question you usually care about: do the two judges order traces the same way?

The cadence matters. Running calibration once at onboarding and never again tells you nothing about drift. Running it on every eval run is expensive and noisy. A reasonable compromise: calibration on a fixed stratified sample weekly, with threshold-based alerting. Agreement stays above a pre-registered floor — keep shipping. Agreement drops below the floor — freeze eval-driven decisions until disagreement triage completes.

Disagreement triage is the unglamorous heart of this. Every disagreement is a trace where the primary judge and the calibration judge diverged. Read them. Usually one of three things is true: the rubric is ambiguous (fix the rubric), one judge has a bias the other does not (document it), or the traces genuinely sit on a boundary where reasonable judges disagree (mark the category as high-variance and require human review for shipping decisions). Each outcome updates the evaluation stack. None of them is "our judge is bad, upgrade it" — that reaction re-introduces the drift you're trying to catch.

Judge Versioning in Eval Metadata

The operational failure mode is not deciding to version judges. It's forgetting. A team standardizes on a pinned judge, ships the eval pipeline, and six months later the pinned model ID is hardcoded in three places, referenced in a Jupyter notebook someone forked, and rotated in a CI script nobody remembers editing.

The discipline that survives turnover is judge metadata as a first-class column in every eval artifact. The eval runner refuses to write a score row without a fully-qualified judge record: model ID with snapshot date, temperature, decoding parameters, rubric version hash, and the few-shot examples that were active at judgment time. Dashboards filter by judge identity, and the default view shows only the latest pinned judge. Comparisons across judges require explicit opt-in and render with a clear visual discontinuity.

This is the same hygiene that saved the ML community from silent training-pipeline drift a decade ago. Model weights got versioned; training data got versioned; hyperparameters got logged. The judge is a model in your pipeline. It needs the same treatment, even though it sits behind an API and even though the vendor pretends the model string alone is a stable identifier.

Stop Believing the Dashboard Without Its Footnotes

The takeaway is not that LLM-as-judge is unreliable — it is reliable in the ways it was designed for, which is ranking similar candidates under a fixed rubric at a fixed point in time. The takeaway is that LLM-as-judge is not a time-invariant measurement instrument, and treating it as one converts ordinary provider updates into phantom wins and phantom regressions that no one on the team caused or can reproduce.

The fix is not more judges, or smarter rubrics, or a fancier eval framework. The fix is boring: pin judges by snapshot ID, version their configuration alongside the candidate's, calibrate judge-vs-judge on a schedule, triage disagreement instead of averaging it away, and refuse to draw longitudinal lines through judge discontinuities. When a dashboard number moves, the first question is not "what prompt changed?" — it is "did the ruler change?" Until you can answer that from metadata alone, the dashboard is telling you a story, not a measurement.

Teams that take this seriously end up with quality dashboards that move less and mean more. Teams that don't end up re-qualifying prompts every vendor release, chasing ghosts that their own instrumentation conjured, and quietly losing faith in the eval program they spent a year building. The drift was never in the candidate. It was in the thing measuring the candidate, and nobody was watching that clock.

References:Let's stay in touch and Follow me for more thoughts and updates