Skip to main content

2 posts tagged with "ensemble"

View all tags

The LLM-as-Judge Ensemble That Agreed Because All Judges Were the Same Family

· 10 min read
Tian Pan
Software Engineer

Your evaluation pipeline runs a three-judge ensemble against every model output. The judges are GPT-4 with a strict rubric, GPT-4 with a permissive rubric, and GPT-4 with a chain-of-thought rubric. They agree on 91% of cases. You report inter-judge agreement of 0.83 Krippendorff's alpha to the launch review committee. The number lands in the "substantial agreement" band that every methodology textbook treats as a green light. Three model upgrades ship against that number over six months.

An external auditor swaps one of the three judges for Claude using the same rubric and the agreement rate on hard cases drops to 64%. The eval score that justified the last three upgrades turns out to be a number that depends on which provider family you treat as ground truth. The upgrades were upgrades against GPT-4 family preferences, not against quality — because the judges were the model being judged's siblings.

Multi-Model Consensus: When One LLM Isn't Enough to Sign Off

· 11 min read
Tian Pan
Software Engineer

Your AI feature ships with 85% accuracy. Leadership is thrilled. Then a compliance audit finds that the 15% wrong answers cluster around a specific regulatory interpretation — one that every model in your provider's family gets wrong in the same way. You called one model. It failed. And because you never compared it to anything else, you had no signal that the failure was systematic.

Multi-model consensus architecture is the structural answer to this problem. Instead of trusting a single LLM, you fan out to multiple models from different provider families, aggregate their responses, and route based on agreement. The disagreement pattern itself becomes a first-class signal in your system, not just a debugging artifact.

This approach costs 2–4× more per inference. For most use cases, that's obviously not worth it. But for a specific class of outputs — legal summaries, medical triage routing, financial risk flags, security assessments — the cost of a wrong answer so far exceeds the cost of extra inference that the math inverts almost immediately.