LLM-as-Judge Adversarial Failures: When Your Eval Harness Gets Gamed
Your LLM-as-judge gave your new model a clean bill of health. Win rates are up, rubric scores improved across the board, and the automated eval pipeline ran green. Then you shipped — and user satisfaction dropped.
This is not an edge case. Researchers built constant-output "null models" that produce the exact same response regardless of input and gamed AlpacaEval 2.0 to an 86.5% length-controlled win rate. The verified state of the art at the time was 57.5%. When a model with no task capability at all can top your leaderboard, your eval harness has a problem that's worth understanding systematically.
