The Eval Harness Whose Judge Model Was Upgraded Silently
A six-point lift across every eval category arrives the same week you shipped a prompt change. The room reads it as proof the change worked. Three weeks later, someone notices the lift also showed up in categories the prompt change could not possibly have touched — a control set you keep specifically to detect this — and the lift is uniformly distributed, the kind of shape a real product improvement never has. The judge model was rolled out under the same endpoint name on a Tuesday. Your scores moved before your system did.
This is the failure mode that breaks LLM-as-a-judge eval pipelines more quietly than any of the failure modes the literature warns about. Not bias, not position effects, not self-preference — those are properties of a judge at a point in time, and your eval design probably already accounts for them. The one that gets you is the judge changing while you're not looking, while your endpoint name and your eval code and your dashboards all keep claiming nothing happened. The unit of measurement shifted under a stable label. Every comparison across the migration boundary is now confounded, and you cannot decompose the delta into "our system improved" and "the ruler got more generous" because you never built the instrument to do that decomposition.
The Judge Is a Measuring Instrument, Not a Library Call
Every other measurement tool in the engineering world is treated as having calibration that is a contract, not a property of vibes. A torque wrench has a calibration certificate with a date. A pH meter gets re-anchored against buffer solutions on a schedule. An A/B test framework versions its variance estimator and refuses to compare experiments that ran under different versions without a migration note. The judge model in your eval pipeline does none of this by default, and the API surface it presents actively encourages you to forget that calibration is a thing that exists.
The judge produces a number. The number is reported to stakeholders. The stakeholders treat the number the way they treat a test pass rate — as something whose meaning is stable across time. None of that is true. The judge is a measuring instrument whose calibration is rented from a provider on terms the provider can change unilaterally, and the contract you have with that provider almost certainly does not include "we will tell you when we change how the judge scores." Quality improvements are a feature the provider ships. They are not a feature you want shipping mid-quarter into your only durable record of whether your system is getting better.
The mental model worth adopting: the judge is not part of your eval logic, it is part of your eval infrastructure, and infrastructure that changes without notice is a class of risk you already know how to handle in every other context. You version it, you pin it, and you treat any change to it as a migration with explicit before/after measurement, not as a transparent upgrade you should silently inherit.
Why the Endpoint Name Is Not a Version
The default integration looks like a single string — a model name, maybe an alias like latest or just the family name with no version suffix at all. Engineers reading the code see that string as a version, the same way python==3.11.4 is a version. It isn't. The string is a stable handle that may or may not resolve to a stable artifact. Providers will swap the underlying weights behind that handle to roll out quality improvements, safety patches, latency optimizations, and quantization changes, and they will do all of that without changing the string you pinned. Your "pin" is a name that resolves through a moving redirect.
The pinning that actually pins is a snapshot ID — a model version with a date or a hash that the provider commits to keeping immutable. Most provider APIs expose this; very few client integrations use it. The default examples in provider docs use the family alias because the alias is what the docs writer wants you to upgrade for free. That tradeoff makes sense for a chat product where freshness is good. It does not make sense for a measurement instrument where reproducibility is the whole point.
The fix is mechanical and cheap: at the API layer where your eval pipeline calls the judge, refuse to accept anything except a fully-qualified snapshot ID. Wrap the provider client in a thin adapter that raises if the caller passes a family alias. Make the snapshot ID a value committed to your eval config the same way you commit a dataset hash. Now an endpoint upgrade requires an explicit pull request that changes the snapshot ID, and the migration becomes a thing someone is responsible for instead of a thing that happens to you.
The Calibration Anchor Set
Pinning the judge to a snapshot ID prevents the silent-upgrade case. It does not give you the ability to compare scores across the migration when you do upgrade — and you will, eventually, because the old snapshot will deprecate. For that you need a calibration anchor set: a small, immutable collection of reference examples whose expected scores you have manually graded and committed to, and whose actual scores any judge run must match within a tolerance before the broader eval is considered valid.
The anchor set is to your judge what the buffer solution is to your pH meter. It exists to answer one question: is this instrument, right now, returning the numbers it returned the last time we checked? You build it once with care — a few dozen examples per scoring dimension, spanning the full range you expect to score, each with a human-graded ground-truth score that the team has reviewed and signed off on. You version it. You never let it leak into the training set of any model your team builds. And you run the judge against it before every eval cycle.
A few properties matter:
- Immutable: the anchor scores do not get re-graded based on new judge outputs. The whole point is that the anchor is fixed and the judge moves relative to it.
- Covers the score range: anchors clustered near the mean tell you nothing about whether the judge has gotten harsher at the tails, which is the most common shape of calibration drift.
- Includes near-boundary cases: examples where a one-point shift in the judge would flip the verdict are where drift surfaces first.
- Held privately: if the anchor set leaks into provider training data, it stops being an anchor.
- https://futureagi.com/blog/evaluating-llm-judge-bias-mitigation-2026/
- https://futureagi.com/blog/llm-as-judge-best-practices-2026
- https://deepchecks.com/llm-judge-calibration-automated-issues/
- https://www.langchain.com/articles/llm-as-a-judge
- https://www.confident-ai.com/blog/why-llm-as-a-judge-is-the-best-llm-evaluation-method
- https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge
- https://arxiv.org/pdf/2509.26072
- https://arxiv.org/pdf/2511.21140
- https://arxiv.org/html/2403.02839v2
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12863567/
