When Your Agents Disagree: Consensus and Arbitration in Multi-Agent Systems
Multi-agent systems are sold on a promise: multiple specialized agents, working in parallel, will produce better answers than any single agent could alone. That promise has a hidden assumption — that when agents produce different answers, you'll know how to reconcile them. Most teams discover too late that they won't.
The naive approach is to average outputs, or pick the majority answer, and move on. In practice, a multi-agent system where all agents share the same training distribution will amplify their shared errors through majority vote, not cancel them out. A system that always defers to the most confident agent will blindly follow the most overconfident one. And a system that runs every disagreement through an LLM judge will inherit twelve documented bias types from that judge. The arbitration problem is harder than it looks, and getting it wrong is how you end up with four production incidents in a week.
The Taxonomy of Multi-Agent Disagreement
Before reaching for a resolution strategy, it helps to classify what kind of disagreement you're actually looking at.
Stylistic disagreement: Agents agree on substance but differ in phrasing, emphasis, or format. This is safe to synthesize — pick one, or merge. No judgment call required.
Reasoning disagreement: Agents reach different conclusions via different paths. The outputs are semantically distinct, not just stylistically. This requires arbitration.
High-confidence disagreement: Multiple agents are each highly confident in mutually exclusive answers. This is the diagnostic case — it signals you've hit a genuinely ambiguous region, one where human reasoners would also disagree. Synthesizing a false consensus here actively harms users.
Adversarial disagreement: One or more agents have been manipulated (via prompt injection, poisoned context, or adversarial inputs) to push the system toward a specific wrong answer. Controlled experiments in healthcare AI showed that adversarial assistant agents can achieve 98-100% attack success rates by manufacturing false consensus — pushing a target agent toward harmful recommendations through repeated coordinated agreement. A single verifier agent anchored to external ground truth eliminated the attack entirely.
Identifying which category you're in determines which resolution strategy applies. Most teams skip this step and apply a single strategy regardless, which is why most multi-agent systems underperform their theoretical ceiling.
Majority Voting: Powerful Baseline, Predictable Failure
Self-consistency — sampling multiple reasoning paths from the same or different models and taking the majority answer — remains the most robust starting point. It's simple to implement, requires no additional model calls for arbitration, and genuinely improves performance on tasks with multiple valid solution paths.
The ceiling is lower than it appears. Majority voting breaks in three specific situations:
Shared systematic error: If all your agents were trained on similar data, they share the same blind spots. Majority vote amplifies shared errors rather than filtering them. On hard questions where most agents are wrong, the system confidently outputs the wrong answer. The fix is model heterogeneity — pairing agents from different model families, sizes, or specializations. This is the single highest-leverage architectural decision in multi-agent design.
Approval voting collapse: When agents are asked to vote for all acceptable answers rather than just one, sycophantic agents vote for everything. A 2025 study found this approach collapsed in 59% of evaluation runs, producing ties that made the system unusable. Binary choices (select one best answer) dramatically outperform multi-select schemes with LLM voters.
Task-type mismatch: Research published at ACL 2025 found that majority voting outperforms consensus protocols by 13.2% on reasoning tasks, but underperforms consensus by 2.8% on knowledge retrieval tasks. The mechanism differs: on reasoning tasks, diverse solution paths need to coexist; on knowledge tasks, requiring agent agreement catches hallucinations that slip through individual agents. Most systems apply one strategy to all task types, leaving performance on the table.
LLM-as-Judge: Useful Tool, Unreliable Arbiter
The LLM judge pattern — using a separate model to evaluate and select among agent outputs — is now standard practice. It's also misunderstood.
A comprehensive 2024 benchmark catalogued twelve distinct bias types in LLM judges. The four most impactful in production:
Position bias: Judges systematically favor outputs presented first (or last) in pairwise comparisons. Swapping the presentation order of two responses causes accuracy shifts exceeding 10% in code evaluation tasks. All tested judge models show this effect.
Self-preference bias: LLM judges assign higher scores to outputs that are statistically more "familiar" to their own model — those with lower perplexity under the judge's own policy. GPT-4 shows this significantly. Cross-model evaluations are systematically biased toward the judge's own model family.
Length bias: Judges prefer longer, more formal responses regardless of content quality. This is an artifact of RLHF training on human preference data, where humans often use length as a quality heuristic.
Domain expert gap: In specialized fields like dietetics or mental health, LLM judge agreement with human subject matter experts sits at 60-68%. The commonly targeted threshold for production readiness is Cohen's kappa > 0.8. Most uncalibrated systems start at 0.3.
None of these biases are reasons to avoid LLM judges. They're reasons to calibrate them. Concretely: run your judge on a test set with known human labels and measure Cohen's kappa before deploying. Use few-shot examples drawn from real production failures rather than hypothetical cases. Use binary yes/no questions rather than numeric scores (LLMs lack natural numeric calibration — an "8 vs. 9" judgment is inconsistent across runs). And critically, monitor judge calibration over time — production distributions shift, and kappa degrades silently.
One operationally validated heuristic: use 3-5 judge models with majority voting rather than a single judge. The same evaluation trace can pass on Tuesday and fail on Friday from a single judge due to sampling stochasticity. Ensemble judging adds robustness without the complexity of confidence calibration.
Debate Protocols: When and Why They Work
Multi-agent debate — where agents independently propose answers, then read each other's reasoning and revise over multiple rounds — shows genuine gains on specific problem types. But the conditions for those gains are more specific than the literature usually presents.
Debate improves performance when there is information asymmetry: when different agents have access to different relevant information, and the goal is for the agent with better information to convince the judge. This is the structural mechanism Irving et al. identified in 2018: a liar needs to construct false claims, while a truth-teller only needs to find a single flaw in those claims. The asymmetry favors truth.
Debate does not reliably improve performance when agents have symmetric information access. If all your agents are reading the same context window, debate becomes a persuasion contest rather than a truth-seeking mechanism — and agents are more persuasive when arguing positions they "believe," which may or may not correlate with accuracy.
A 2025 paper specifically targeting this limitation introduced an anti-conformity mechanism: rather than converging toward consensus over debate rounds, agents explicitly resist majority pressure and all intermediate outputs across rounds are scored, not just the final position. The approach requires only a single debate round, dramatically reducing token cost. On eight benchmarks, it outperformed standard multi-round debate while using fewer tokens.
The failure mode to watch is hallucinated consensus. When agents converge on fabricated information and mutually reinforce it, the result is not a single agent's hallucination — it's a polished, confident, mutually corroborating wrong answer with no internal dissent to signal the error. This is more dangerous than individual agent hallucination because the signal you'd normally use to detect errors (disagreement, hedging, low confidence) is absent. If one agent stores a hallucinated fact in shared memory, downstream agents treat it as verified truth.
Confidence Weighting: More Effective With Calibration
The ReConcile framework adds confidence scores to majority voting — each agent provides its answer and a confidence estimate, and votes are weighted accordingly. Across seven benchmarks, it showed gains up to 11.4% over standard majority voting, and outperformed GPT-4 on three datasets.
The critical implementation detail: LLMs are systematically overconfident. Without calibration, agents cluster at uniformly high confidence scores, compressing the weighting signal to near-zero differentiation. The confidence weights need calibration (temperature scaling, Platt scaling, or empirical binning) before they add value. Uncalibrated confidence weighting can perform worse than flat majority vote because the overconfident wrong agents receive outsized influence.
Calibration also degrades silently over time. Calibration performed on your training distribution erodes as the production distribution shifts. Tracking calibration drift — comparing predicted confidence distributions to actual accuracy rates over rolling windows — should be part of your monitoring stack.
When to Surface Disagreement Rather Than Synthesize
The reflex in most system designs is to hide agent disagreement behind a synthesized output. Sometimes that's correct. Often it's not.
Surface disagreement to users when:
- Semantic distance is high: The outputs are not stylistically different but represent different conclusions. Embedding-based semantic clustering of agent outputs gives you a principled measure of this.
- Domain is high-stakes: Medical diagnosis, legal analysis, security decisions — anywhere where a wrong synthesis is worse than acknowledged uncertainty.
- Agents disagree with high individual confidence: This specific pattern signals a genuinely contested question. Agents that each hold strong, incompatible views are not producing noise — they're detecting genuine ambiguity that human reasoners would also struggle with. Synthesizing here creates false confidence.
- Debate fails to converge: If agents do not reach agreement after a fixed number of rounds, a circuit-breaker should route to human escalation rather than forcing a synthesis. Deadlock is information.
Synthesize and hide disagreement when:
- The disagreement is stylistic, not substantive: Different phrasing of the same underlying answer.
- The question is verifiable: If agents disagree about something checkable — code correctness, database records, mathematical results — route to a verification step rather than an arbitration step.
- Volume is high and stakes are low: Surfacing agent uncertainty for every routine query creates cognitive overload without informing better decisions.
A practical UX pattern from the ACM DIS 2025 conference: provide a layered interface with a top-level synthesized answer alongside an expandable "divergent views" section showing agent reasoning traces. This satisfies users who want simplicity while preserving access to disagreement for those making high-stakes decisions.
Measuring Arbitration Quality Without Ground Truth
The hardest part of this problem is that you often cannot measure whether your arbitration is working. There is no ground truth to check against. The field has developed several proxies that are more reliable than they initially appear.
Vote entropy measures disagreement intensity: Shannon entropy over the vote distribution tells you how evenly split agents are, not just whether they disagree. A system where 4 of 5 agents agree is in a different state than one where all 5 are split. Embedding geometry extends this further — measuring the geometric distance between majority-position and minority-position embeddings as a calibrated uncertainty signal. A 2026 paper using this approach achieved AUROC of 0.802 for detecting genuine uncertainty, compared to 0.791 for LLM aggregator baselines, with substantially better calibration.
Consistency under perturbation tests your judge rather than your agents. A well-calibrated judge should produce the same verdict when you:
- Shuffle the order of presented options (position bias check)
- Rephrase the question semantically identically
- Sample at different temperatures
Inconsistency under these perturbations reveals calibration problems before they manifest as production errors.
Operational escalation rate is a leading indicator that practitioners under-monitor. In a well-calibrated system, the rate at which arbitration routes to human escalation should stay in the 10-15% range. Below 10% suggests the system is overconfident and under-escalating. Above 15% suggests the system is not automating enough to be useful. Drift in this rate — either direction — signals that something has changed in your input distribution or your agent behavior.
Task Decomposition Beats Arbitration
The most important lesson from production multi-agent systems: the best arbitration strategy is preventing conflicts from arising in the first place.
Anthropic's multi-agent research system — an orchestrator running Claude Opus 4 with Sonnet subagents researching in parallel — outperformed single-agent Claude Opus 4 by 90% on internal benchmarks. The architecture does not use sophisticated voting or debate. It uses detailed task boundaries that prevent overlapping claims between agents. The lead agent synthesizes rather than arbitrates, receiving subagent findings and determining what is sufficient. Disagreements rarely arise because the task decomposition leaves no contested territory.
This is the lesson that gets underweighted in the debate and voting literature: the cost of preventing conflicts through clear task decomposition is fixed at design time. The cost of resolving conflicts through arbitration is paid per query, in latency and tokens, every time your agents disagree. A system designed to rarely disagree will always outperform a system with sophisticated disagreement resolution, at lower cost.
Design the task allocation first. Build the arbitration layer for the residual disagreements that task decomposition cannot prevent. And monitor for adversarial collusion — manufactured false consensus is the one failure mode that neither good task design nor good arbitration addresses without external verification.
The question of when agents disagree is ultimately the wrong frame. The right question is: what does your system do when the disagreement itself is the answer?
- https://arxiv.org/abs/2305.14325
- https://arxiv.org/abs/2309.13007
- https://arxiv.org/abs/2406.04692
- https://arxiv.org/html/2503.13657v1
- https://arxiv.org/abs/2502.14143
- https://arxiv.org/html/2502.19130v4
- https://arxiv.org/pdf/2502.08788
- https://arxiv.org/abs/2509.11035
- https://arxiv.org/abs/2410.21819
- https://arxiv.org/html/2512.03097v1
- https://arxiv.org/html/2603.20975
- https://arxiv.org/abs/2411.15594
- https://arxiv.org/html/2402.06782
- https://galileo.ai/blog/why-llm-as-a-judge-fails
- https://galileo.ai/blog/multi-agent-coordination-strategies
- https://www.anthropic.com/engineering/multi-agent-research-system
- https://www.together.ai/blog/together-moa
- https://arxiv.org/html/2511.14136v1
- https://aclanthology.org/2025.findings-acl.1141/
