The LLM-as-Judge Ensemble That Agreed Because All Judges Were the Same Family
Your evaluation pipeline runs a three-judge ensemble against every model output. The judges are GPT-4 with a strict rubric, GPT-4 with a permissive rubric, and GPT-4 with a chain-of-thought rubric. They agree on 91% of cases. You report inter-judge agreement of 0.83 Krippendorff's alpha to the launch review committee. The number lands in the "substantial agreement" band that every methodology textbook treats as a green light. Three model upgrades ship against that number over six months.
An external auditor swaps one of the three judges for Claude using the same rubric and the agreement rate on hard cases drops to 64%. The eval score that justified the last three upgrades turns out to be a number that depends on which provider family you treat as ground truth. The upgrades were upgrades against GPT-4 family preferences, not against quality — because the judges were the model being judged's siblings.
