Skip to main content

Debate Diversity Collapse: When Three Agents Vote 3-0 Because They Read the Same Internet

· 11 min read
Tian Pan
Software Engineer

The architecture diagram says "ensemble of three frontier models, debate-and-reconcile, majority vote." The trace says all three agents converged on the same answer in round one and spent two more rounds politely paraphrasing each other. The eval says +0.4 points over a single call. The bill says 4.2x. Somewhere in there, somebody decided the panel was working.

Multi-agent debate is sold as a way to get disagreement-driven reasoning: three minds arguing toward a better answer than any one of them would reach alone. It depends on the agents actually disagreeing. Frontier LLMs trained on overlapping web corpora, instruction-tuned against overlapping preference datasets, and aligned against overlapping safety taxonomies share priors more than the architecture diagrams admit. After a round of "let's reconcile," what you observe is not three perspectives converging on truth — it is three samples from one distribution converging on the mode they were never that far from.

The pattern has a name in the recent literature: when an ensemble's vote-disagreement rate trends to zero independent of question difficulty, you have debate diversity collapse. The panel is still voting. The vote no longer carries information.

The illusion of independent verification

The mental model behind multi-agent debate borrows from human institutions: juries, peer review, scientific replication. These work because the humans involved are independent enough that their errors are uncorrelated. Replication catches mistakes precisely when the second team has a different methodology, a different lab culture, a different prior on what is plausible.

LLM ensembles fail this independence test in ways that are easy to miss because the model names are different. Three checkpoints from the same lab share an instruction-tuning recipe, a refusal training set, a system-prompt convention, and large overlapping fractions of pretraining tokens. Even cross-lab ensembles often share the same publicly scraped Common Crawl snapshots, the same Reddit and Wikipedia priors, and the same RLHF judgment patterns trained against contractor pools whose demographics overlap. When such an ensemble is asked a question whose answer is a well-known fact, of course they agree — and the agreement is informative. When asked a question whose answer is contested or out-of-distribution, they often still agree, because their distributions were never that far apart, and the agreement is no longer informative.

Recent measurement work makes this concrete. In ensembles of homogeneous agents, accuracy improves at small agent counts and then collapses to diminishing returns: marginal gains drop toward zero as the additional agents produce trajectories that are increasingly redundant copies of each other. With only two genuinely diverse agents, performance can match or exceed that of sixteen homogeneous agents. The math says the second through sixteenth homogeneous agents are paying full token cost for near-zero information.

There is a subtler failure on top of this: sycophancy. LLMs tuned to be agreeable read peer outputs and update toward them, even when their own first-round answer was correct. In published debate transcripts, conformity (changing your answer to match a peer) consistently exceeds obstinacy (sticking to your prior). The system was supposed to use disagreement to find errors. Instead, the loudest or earliest answer wins because the other agents capitulate.

Why majority vote keeps looking like it works

The reason debate diversity collapse is hard to see in dashboards is that the topline metric usually does improve over a single zero-temperature call. The improvement is real. The mechanism is not what the diagram claims.

Most of the empirical gain attributed to multi-agent debate is actually attributable to ensembling — the same mechanism that makes self-consistency work with a single model sampled multiple times at high temperature. Variance reduction across independent samples raises accuracy on questions where the model is right on most draws. You do not need three different models to get this. You need three different samples.

Once you decompose the gain this way, the cost structure looks ugly. A self-consistency baseline at temperature 0.7 with five samples from one model is often within noise of a three-agent debate using three frontier models, at a fraction of the cost and a fraction of the latency. The debate buys you one thing self-consistency cannot: the chance of catching errors that the single model makes on most draws. That chance is exactly proportional to how different the other agents' priors actually are. If the diversity is fake, the debate is fake, and you are paying a premium for what self-consistency would have given you for free.

This is the cost frame the team building the system rarely puts in writing: the second and third agents are paying tokens to agree. If they are agreeing because the answer is obvious, you didn't need them. If they are agreeing because they share priors with the first, you have an expensive variance-reduction trick dressed up as deliberation.

Measuring whether the panel is actually deliberating

You cannot fix what you cannot measure, and "did the panel deliberate" is not a default field in any tracing tool. A few measurements separate real debate from theater:

  • Vote-disagreement rate over time. Track per-question, per-round disagreement at the level of intermediate plans, not just final answers. A healthy debate has high round-one disagreement on hard questions and convergence by the final round. A collapsed debate has zero disagreement from round one onward, regardless of difficulty.
  • Structural diversity on intermediate reasoning. Cosine distance over reasoning-trace embeddings catches the case where the agents reach the same answer via the same argument. If two agents' chains-of-thought cluster geometrically, the agents are not contributing independent information even when their final answers happen to differ.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates