The Judge That Agreed With Itself Across A and B

May 25, 2026 · 10 min read

Software Engineer

You run an A/B test on two prompt variants. Your LLM judge — same vendor as your candidate model, because it was the easy default — consistently prefers variant A by a margin large enough to call statistically meaningful. You ship A. A week later your satisfaction metric is down, your refund queue is up, and nobody can explain it. Somebody finally re-runs the eval with a judge from a different model family. The preference flips.

The judge was not measuring quality. The judge was measuring how much the candidate sounded like the judge.

This is self-preference bias, and it is the quiet methodological flaw that turns an LLM-as-judge pipeline into a hall of mirrors. The paper that first quantified it on GPT-4 found a roughly 10% win rate inflation on the model's own outputs, with the bias traced to perplexity: judges systematically reward text they themselves find more probable, regardless of whether the user would call it correct. The corollary practitioners keep learning the hard way is that the bias does not require the judge and candidate to be the same model — siblings in the same family share enough of those priors that the within-family preference still shows up in the score.

What looks like an eval is actually a within-family taste poll. The team has built infrastructure to measure quality and is in fact measuring stylistic homogeneity.

Why Same-Family Judging Looks Right Until It Doesn't

The reason this failure mode is so persistent is that it has none of the symptoms teams are trained to watch for. The judge gives stable scores. The variance is low. The inter-run agreement is high. The dashboards look healthier than the team's last A/B. Every monitoring signal you set up to catch a broken eval pipeline reports green, because the pipeline is not broken — it is biased in a direction that produces clean, confident numbers.

The drift only becomes visible when the production signal disagrees. Satisfaction dips, complaint rates climb, the support team starts forwarding screenshots, and the team is left arguing about whether the production metric is noisy because the eval told them with a straight face that variant A was a clear improvement. Trust flows toward the metric with the smaller error bars, and that's the judge, so the team often spends weeks blaming the production telemetry before suspecting the eval itself.

There is a specific failure pattern hiding underneath. When the judge and the candidate are from the same family, the candidate gets rewarded for matching the judge's preferred formatting (bullet density, header style, hedge phrases, length distribution) and the judge's preferred reasoning shape (the order in which considerations get raised, the rhetorical tells of "let me think step by step"). None of that is correlated with what your users hired the product to do. The candidate has learned to sound like the judge instead of being right, and the judge has no way to tell the difference because the styles match its own priors.

Preference Leakage Is a Calibration Story, Not a Cheating Story

It is tempting to frame this as the judge cheating for its sibling. That framing is comforting but wrong. The judge is doing its job — assigning higher probability to text that fits its training distribution. The cheating is in how the team is interpreting those probabilities. They are treating a likelihood signal as a quality signal, and the two only happen to align on the slice of behavior where the judge's training distribution and the user's preferences overlap.

This matters because the fix is not "find a less biased judge." Every judge has priors, and those priors will lean toward something. The fix is to stop using a single judge as the arbiter and start treating the judging step as a measurement instrument that itself needs calibration. The question becomes not "which judge is right" but "what is the disagreement structure across judges telling me, and how does that structure track human verdicts on a held-out set."

Once the team makes that shift, the multi-judge ensemble stops looking like an expensive luxury and starts looking like the minimum viable architecture. A single judge gives you a point estimate with no uncertainty. Multiple judges drawn from different families give you a distribution, and the disagreement in that distribution is itself the most useful signal for routing borderline cases to a human reviewer.

The Multi-Vendor Ensemble Is a Cost Frame, Not a Technical Problem

The objection that always surfaces here is cost. Running three judges from three different vendors is 2–3x the per-evaluation spend of a single-vendor pipeline, and the engineer presenting the change needs to justify the line item. The reframe that lands is the decision frame: this evaluation is the input to a deployment decision that will affect every user who touches the feature. If variant A ships to a million sessions, the cost of getting that decision wrong dwarfs the cost of the judges by orders of magnitude. The multi-vendor judge is a hedge against a wrong-ship, and the cost is bounded and one-time per release cycle.

In practice, "different families" means picking judges whose training data, post-training recipe, and likely tokenizer behavior diverge enough to break the shared-prior pattern. A GPT family judge, an Anthropic family judge, and an open-weight family judge is a reasonable starting triple. What you are buying is independent error — the assumption is not that any single judge is correct, but that the kinds of mistakes one judge makes are uncorrelated with the kinds the other two make. When the three agree, the signal is strong. When they disagree, you have flagged exactly the cases where the eval is least trustworthy and a human should look.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Judge That Agreed With Itself Across A and B

Why Same-Family Judging Looks Right Until It Doesn't

Preference Leakage Is a Calibration Story, Not a Cheating Story

The Multi-Vendor Ensemble Is a Cost Frame, Not a Technical Problem

Recommended Reading

About Tian Pan

Why Same-Family Judging Looks Right Until It Doesn't​

Preference Leakage Is a Calibration Story, Not a Cheating Story​

The Multi-Vendor Ensemble Is a Cost Frame, Not a Technical Problem​

Recommended Reading

About Tian Pan

Why Same-Family Judging Looks Right Until It Doesn't

Preference Leakage Is a Calibration Story, Not a Cheating Story

The Multi-Vendor Ensemble Is a Cost Frame, Not a Technical Problem