Skip to main content

The LLM Judge Is a Versioned Dependency, Not Neutral Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most teams treat their LLM judge the way they treat a unit-test runner: neutral infrastructure that produces a number you can trust. You write a rubric, point a model at your outputs, and the judge returns scores. The scores go on a dashboard. The dashboard's trendline drives the roadmap. Nobody thinks of the judge as a thing that has behavior, because the whole point of automation was to take behavior out of the loop.

But the judge is a model. It has a version. It has biases. And the day it changes — because your eval-platform team swapped it for something cheaper, or because the provider silently rolled the weights behind a -latest alias — every historical score it produced becomes incomparable to every new one. Your quarter-over-quarter quality trend is now denominated in two different currencies, and no one printed an exchange rate.

This is not a hypothetical edge case. It is the default outcome of using an LLM as a measurement instrument without versioning it like one.

Your eval history is a ledger, and the judge is its unit of account

When a judge upgrades, the new model does not grade harder or softer in some uniform, easily-corrected way. It grades differently. It may have become stricter about hallucinated citations and more lenient about verbosity. It may have lost a position bias the old one had and gained a self-preference bias toward outputs from its own model family. The shift is multi-dimensional, and it lands unevenly across your eval slices.

So the failure mode is not "all scores went down 5%, subtract a constant." It is that the shape of your quality landscape changed. The slice that looked like your weakest is now your strongest, not because the product changed but because the ruler did. A team reading the dashboard sees a regression in one feature and an improvement in another, opens an investigation into the regression, and spends two weeks chasing a product bug that is actually a judge-strictness shift.

The deeper problem is epistemic. An eval score is only meaningful relative to other scores produced by the same instrument. A "7.2 average" means nothing in isolation; it means something compared to last month's "6.8." The moment the judge changes, that comparison silently breaks, but the number still renders on the dashboard in the same font, on the same axis, with the same color. Nothing in the visualization tells you that you just read across a discontinuity. The chart lies by omission, and it does so convincingly.

"Neutral infrastructure" is the assumption that hurts you

Why does this keep happening? Because of an organizational seam. The eval-platform team owns the judge. They are measured on cost and throughput, and upgrading the judge model — to a cheaper tier, to a newer release, to a faster endpoint — is a reasonable, even commendable, thing for them to do. From their seat it is an infrastructure optimization, invisible to consumers, like swapping a database connection pool.

The product teams, meanwhile, consume the scores as a continuous time series. They build roadmap arguments on the trendline. They set internal quality gates at "judge score ≥ 8." They never see the judge version, because the platform abstracted it away — which was the whole point of building a platform.

Both teams are behaving rationally. The bug is in the contract between them. The platform team thinks they are shipping a service; the product teams think they are reading a measurement. A service can be transparently upgraded. A measurement instrument cannot — recalibrating a scale is a visible, announced, documented event in every discipline that takes measurement seriously. Metrology has a word for the unit you measure against: a standard, and standards are not quietly swapped.

It gets worse with hosted models behind floating aliases. If your judge is pinned to a -latest endpoint, the provider can roll the weights under you with no version bump on your side at all. Your eval history develops a seam on a Tuesday and you find out, if ever, from a confused Slack thread three weeks later. The community has been asking providers for dated, pinnable snapshots precisely because reproducible evaluation is impossible without them — and even a pinned snapshot only tells you what you ran, not that it is safe forever, since snapshots still age out on deprecation schedules.

The disciplines that make a judge a real dependency

Treating the judge as a versioned dependency is not exotic. It is the same hygiene you already apply to any other dependency that can change behavior under you. Four practices carry most of the weight.

Pin a judge version per eval suite. The judge model, its exact snapshot ID, its decoding parameters, and the rubric prompt are all part of the instrument. Pin all of them, and record the pin alongside every score the suite produces. A score with no recorded judge identity is an unlabeled data point — you cannot tell which ruler produced it, so you cannot safely compare it to anything. If your judge only exists behind a floating alias, treat that as a known reliability gap, not a non-issue.

Keep a frozen anchor set. Maintain a small, stable set of outputs — fifty to a few hundred — that you never change, spanning the full quality range from clearly-bad to clearly-excellent. This is your standard reference. It is the eval equivalent of a calibration weight kept in a vault.

Run a judge-migration protocol, not a judge swap. When you change the judge, do not just start producing new numbers. First re-score the entire frozen anchor set with both the old judge and the new one. The two score vectors give you a measured mapping between the rulers: maybe the new judge runs 0.6 points stricter on the mid-range and agrees at the extremes, maybe it reshuffles a whole quality band. That mapping is your exchange rate. With it, you can restate old scores into the new scale — or at least quantify, honestly, that some comparisons cannot be salvaged and must be re-baselined. Without it, you are guessing.

Make the seam visible. Every judge change gets a changelog entry: date, old version, new version, and the anchor-set delta. The dashboard draws a literal vertical line at every judge migration, the way a stock chart marks a split. The line is not decoration. It is an instruction to the reader: do not run your eye across this point. A trend that crosses an unmarked judge change is not a trend; it is two trends in a trenchcoat.

Calibration is continuous, not a one-time launch

Pinning the judge is necessary but not sufficient, because even a pinned judge does not hold perfectly still. Studies of LLM judges keep finding the same uncomfortable result: state-of-the-art judges show large per-instance score variance under trivial prompt perturbations — reordering rubric items, renaming the score field, swapping which candidate is presented first. In pairwise comparisons, simply swapping presentation order can move accuracy by more than ten points. The instrument has measurement noise, and the noise is not small.

This means a pinned judge is a stable instrument, not a calibrated one. You still need to know how its scores map to ground truth — to human judgment — and that mapping drifts even when the version does not, because your input distribution drifts. The outputs your product generated six months ago are not the outputs it generates today, and a judge calibrated on the old distribution may quietly mis-grade the new one.

The practice that addresses this is a recurring calibration check: on a fixed cadence, re-run the judge against a human-labeled calibration set and measure agreement. If agreement decays, the judge needs re-grounding — sharper rubric anchors, few-shot examples for each score level, or bias corrections for the verbosity and self-preference effects that are well-documented and predictable. A well-calibrated judge can reach human-to-human agreement levels, but that number is earned and maintained, not granted at setup. Treating calibration as a launch checklist item rather than a standing process is how a judge silently rots into a number generator.

What to do on Monday

Start with an audit. For your most-watched eval dashboard, answer one question: which exact judge version produced each point on this line? If you cannot answer it for every point, you do not have a quality trend — you have a sequence of measurements from possibly-different instruments arranged left to right, and any conclusion drawn across it is unsupported.

Then make the cheap fixes first. Pin the judge to a dated snapshot if your provider offers one. Build the frozen anchor set this week; it does not need to be large to be useful. Add the judge version as a recorded field on every score your pipeline writes from now on, so future-you is never in this position again.

The reframe worth carrying: an LLM judge is a dependency with a version, a behavior profile, and a release cadence — not a neutral oracle. Every other versioned dependency in your stack already has a changelog, a pinning strategy, and a migration protocol. The judge sits closer to the center of your decision-making than most of them, because it defines what "better" even means for your product. An eval history that does not record which judge produced each score is a ledger written in a currency that keeps getting redefined — and a roadmap built on that ledger is being steered by an exchange rate nobody computed.

References:Let's stay in touch and Follow me for more thoughts and updates