Ensemble vs. Debate: The Two Multi-Model Verification Paradigms and When Each Fails
When a single LLM gives you the wrong answer, the instinct is to ask more models. Run three in parallel and take the majority — that's ensemble. Or put them in a room and let them argue it out — that's debate. Both feel rigorous. Both have peer-reviewed results behind them. And both fail in exactly the same way when the conditions aren't right, which is the part practitioners rarely discuss.
The failure mode isn't subtle: when all your models learned from the same data, carry the same biases, or were trained by people with the same worldview, asking more of them doesn't give you more signal. It gives you more confident noise. Recent research has put a number on this: the pairwise error correlation between top frontier models sits around r = 0.77. That means roughly 60% of error variance is shared. Three models from different providers are effectively 1.3 independent models, not 3.0.
