The Eval Harness Whose Judge Model Was Upgraded Silently
A six-point lift across every eval category arrives the same week you shipped a prompt change. The room reads it as proof the change worked. Three weeks later, someone notices the lift also showed up in categories the prompt change could not possibly have touched — a control set you keep specifically to detect this — and the lift is uniformly distributed, the kind of shape a real product improvement never has. The judge model was rolled out under the same endpoint name on a Tuesday. Your scores moved before your system did.
This is the failure mode that breaks LLM-as-a-judge eval pipelines more quietly than any of the failure modes the literature warns about. Not bias, not position effects, not self-preference — those are properties of a judge at a point in time, and your eval design probably already accounts for them. The one that gets you is the judge changing while you're not looking, while your endpoint name and your eval code and your dashboards all keep claiming nothing happened. The unit of measurement shifted under a stable label. Every comparison across the migration boundary is now confounded, and you cannot decompose the delta into "our system improved" and "the ruler got more generous" because you never built the instrument to do that decomposition.
