Skip to main content

The LLM-as-Judge Ensemble That Agreed Because All Judges Were the Same Family

· 10 min read
Tian Pan
Software Engineer

Your evaluation pipeline runs a three-judge ensemble against every model output. The judges are GPT-4 with a strict rubric, GPT-4 with a permissive rubric, and GPT-4 with a chain-of-thought rubric. They agree on 91% of cases. You report inter-judge agreement of 0.83 Krippendorff's alpha to the launch review committee. The number lands in the "substantial agreement" band that every methodology textbook treats as a green light. Three model upgrades ship against that number over six months.

An external auditor swaps one of the three judges for Claude using the same rubric and the agreement rate on hard cases drops to 64%. The eval score that justified the last three upgrades turns out to be a number that depends on which provider family you treat as ground truth. The upgrades were upgrades against GPT-4 family preferences, not against quality — because the judges were the model being judged's siblings.

The mistake was not the rubric. The mistake was the sampling. An ensemble of three judges drawn from one family is one judge with three prompts attached, and the agreement metric is measuring the family's internal coherence, not the population's. The 91% was an artifact of provider selection that nobody in the room had named.

Why Three Judges From One Family Is One Judge

The standard intuition is that a judge ensemble is a sample from a distribution of "reasonable graders," and the agreement rate measures whether the rubric is well-defined enough that reasonable graders converge. That intuition smuggles in independence. If your three graders are three rubric variants of GPT-4, they are not three samples from "reasonable graders." They are three samples from "GPT-4 reading slightly different rubrics."

Models from the same family share training corpora, post-training preferences, formatting expectations, refusal styles, and stylistic priors. A 2025 paper called Preference Leakage formalized this as three types of relatedness between data generator and judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. All three types produce inflated agreement and inflated scores when the candidate model overlaps the judge's family.

Multiple studies measured the magnitude. Self-preference adds roughly 10 to 25 percent uniform bias when the same model is used as judge and candidate. Family bias — the same effect, slightly weaker — survives even when the specific checkpoint differs but the lineage matches. A judge that prefers its sibling does not need to know it has a sibling. The lower perplexity of a same-family response is enough; the judge calls it "more fluent" or "better structured" or "more aligned to the rubric" without ever invoking the family it came from.

The implication for an all-same-family judge ensemble is brutal: your agreement rate is bounded below by family coherence and above by something close to 1.0, because the judges are inheriting the same priors. The number tells you the family is internally consistent. It does not tell you the judgments are correct.

What Your Agreement Number Was Actually Measuring

A high Krippendorff's alpha among three GPT-4 rubric variants is approximately the result you would get if you asked one person to grade an assignment three times under three slightly different prompts and computed the agreement across the three attempts. You would expect high agreement, because it is one person. You would not call that inter-rater reliability. You would call it intra-rater consistency.

Your ensemble was reporting intra-family consistency as if it were inter-judge reliability. The reviewer who saw 0.83 and thought "substantial agreement, ship it" was reading a metric whose denominator they had silently changed.

The fix is not to use a stricter rubric. The rubric is not the problem; the family is. A stricter rubric makes the three same-family judges agree slightly less because the rubric introduces more decision points, but the underlying priors that drive each decision are still shared. You get a slightly lower number that still measures the same thing.

The fix is to make the sampling actually sample. Krippendorff's alpha is appropriate for measuring agreement across many annotators, but it is only meaningful if the annotators are drawn from the population whose agreement you actually care about. For an LLM judge ensemble, that population is "models that will reasonably grade this task." A sample drawn entirely from one provider family is not representative of that population.

A Cross-Family Judge Is Not Optional, It Is the Anchor

The empirical literature is clear that diverse imperfect judges outperform correlated perfect ones. A 2026 default ensemble looks like one judge from Anthropic, one from OpenAI, and one from Google — three families, three lineages, three sets of training-data priors. The pairwise disagreement across families is the signal you were trying to measure all along.

This costs roughly 3x a single judge in API spend, and the cost is the point. A three-judge ensemble drawn from one family is cheaper per evaluation but produces a metric that is structurally unable to detect family-correlated mistakes. You are paying for redundancy, but redundancy in a single dimension. The cross-family ensemble pays for redundancy in the dimension that matters — the dimension where systematic bias actually lives.

The right way to budget this is to treat the cross-family judge as the anchor and the same-family judges as the variance reduction. One judge from a different family than the candidate model is the minimum viable independence. Two would be better. A composition policy that forbids ensembles drawn entirely from one family — and forbids ensembles that include the candidate model's own family at majority strength — is the cheapest defensible discipline.

If the team's instinct is to push back on cost, the question to put to them is: what does the eval score certify? If it certifies "GPT-4 family judges believe this output is good," that is the metric you have, and the model upgrades it justifies are upgrades against GPT-4 family preferences. If it is supposed to certify quality, the sampling has to be wide enough to measure quality.

The Aggregation Method Matters More Than You Think

Majority vote across an ensemble of three is the obvious aggregation, and it is the right default when the judges are independent. With three same-family judges, majority vote is a no-op — the majority is whatever GPT-4 thinks, plus or minus a rubric perturbation that almost never flips the decision.

Across a cross-family ensemble, the aggregation choice becomes load-bearing. Majority vote with three families means a single family can be overruled. Weighted vote — where the weights come from per-family calibration against a human gold set — is more defensible but requires keeping a calibration set fresh. Some teams adopt a stricter rule: a case is "approved" only if all three families agree, and disagreement triggers human review.

The strict-agreement rule is the most informative variant for evaluation work that gates a launch. The cases where Claude and Gemini agree but GPT-5.1 dissents are precisely the cases where the candidate model's family-internal preferences are diverging from the broader population's preferences. That is the signal an all-GPT ensemble could not produce. It is also the signal where prompt-injection vulnerabilities, factuality regressions, and stylistic over-fitting tend to hide.

A position-swap discipline composes well with cross-family aggregation. Run A-then-B and B-then-A orderings, count the verdict only when both orderings agree, and aggregate across families. This catches more positional bias than any other single fix, and it catches family-specific positional preferences that a same-family ensemble has no chance of seeing.

Calibrating Against Humans Is Not Optional Either

Inter-judge agreement is not the same as judgment quality. Two judges can agree perfectly and both be wrong. Three judges from one family can agree at 91% and all be over-rating their sibling. The only way to detect this is to anchor the ensemble against a human-labeled sample on a periodic cadence.

For categorical labels, Cohen's kappa or Krippendorff's alpha between the ensemble's verdicts and a human gold set is the metric to report alongside inter-judge agreement. The "substantial agreement" threshold of 0.6 applies to the ensemble-vs-human comparison, not to the inter-judge comparison. A team that reports 0.83 inter-judge agreement without reporting the ensemble-vs-human number has reported the easier number and left the harder one unmeasured.

The per-family bias on the human-judged sample is the standing metadata that makes the eval interpretable. If GPT-5.1 over-rates GPT family outputs by 12 points on the gold set, that 12-point bias is a known quantity the eval can correct for. If you have never measured it, you are shipping upgrades whose magnitude is an unknown function of the bias you have not characterized. The team that reports the per-family bias as part of every eval result has built a system that can tell the difference between "this candidate is better" and "this candidate is more similar to the judges."

A Composition Policy Is the Cheapest Discipline

The architectural realization is that an LLM-as-judge ensemble is an opinion poll, and the team that polls three respondents from the same household has measured the household's opinion, not the population's. The eval whose confidence is set by family-internal agreement is an eval that ships the upgrades the family wants to ship.

The cheapest discipline that closes the gap is a written composition policy. The policy names provider diversity as a first-class property of any judge ensemble used for launch decisions. It forbids ensembles drawn entirely from one family. It requires that the candidate model's family be a minority of the ensemble. It mandates that inter-judge agreement be reported alongside ensemble-vs-human agreement on a periodic gold set, and that per-family bias be reported as standing metadata.

This policy costs nothing to write and a few thousand dollars per month in additional inference to enforce. Compared to the cost of shipping three model upgrades against a number that measured the wrong distribution, it is the cheapest insurance the eval pipeline can buy. The team that treats it as overhead is the team that will discover, on an external audit, that the eval was a mirror — and that what it reflected was not the candidate model's quality but the judges' family's preferences.

The 91% agreement was real. It just was not agreement about quality. It was agreement about familiarity, and familiarity is the property a judge ensemble most needs to look past if the eval score is going to mean what the launch committee thinks it means.

References:Let's stay in touch and Follow me for more thoughts and updates