The Eval Pickle: When Your LLM Judge Gets Smarter Than the Model It Grades
A regression alert fires on Monday morning. Faithfulness on your held-out eval set dropped from 0.86 to 0.78 over the weekend. Nobody shipped a new model. Nobody touched the prompt. Nobody changed the retrieval index. The on-call engineer spends three hours digging before noticing the only thing that changed was the judge model — the auto-evaluator quietly rolled forward to a newer snapshot that catches subtle hedging the old one waved through. Same answers. Same model. Worse score. Real number, fake regression.
This is the eval pickle: as your LLM-as-judge gets sharper, your scores on a frozen system slide down, and the dashboard that's supposed to detect regressions starts manufacturing them. The team that doesn't notice spends quarters chasing "quality drift" that lives entirely in the ruler.
