Skip to main content

Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget

· 11 min read
Tian Pan
Software Engineer

Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.

Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.

What Temperature Actually Controls

Before designing a per-role policy, it helps to understand precisely what temperature does and doesn't do.

Temperature scales the logit values that emerge from the final transformer layer before they're converted into probabilities via softmax. At temperature=0, the model deterministically picks the highest-probability token at each step. At temperature=1.0, the distribution is used as-is. Higher values flatten the distribution, making low-probability tokens more competitive.

Two things temperature does not do: it doesn't remove tokens from consideration (that's top-k and top-p), and it doesn't guarantee deterministic output at zero (floating-point precision and batching effects introduce variance even then). More importantly for multi-agent design, it doesn't control what the model knows or doesn't know—it controls how consistently the model expresses what it knows.

The modern parameter stack goes further. Top-p (nucleus sampling) filters by cumulative probability mass. Top-k fixes the candidate pool size. Min-p, which gained traction after its ICLR 2025 oral presentation, adapts the threshold relative to the top token's probability—when the model is confident, it demands higher quality candidates; when uncertain, it relaxes. For production multi-agent systems, temperature combined with min-p is increasingly the standard approach for open-source deployments.

Repetition penalty is a separate mechanism entirely. It discourages re-using tokens already generated, operating on the logits before temperature is applied. Confusing repetition issues with temperature issues is one of the more common misdiagnoses in production debugging.

The Two Failure Modes Nobody Talks About Together

Most discussions of temperature focus on a single axis: too low is boring, too high is incoherent. But in multi-agent systems, the failure modes split cleanly by role, and you need to think about both ends simultaneously across different pipeline stages.

Over-confident classifiers emerge when you run classification agents at temperature=0 or very close to it. The model always picks its highest-probability answer, which sounds good until you consider borderline cases. A classifier that cannot express uncertainty will never flag a document for human review—it will make a definitive call even when the right answer is genuinely ambiguous. Research on multi-agent consensus frameworks found that moderate temperatures (0.1-0.3) allow models to occasionally select alternative classifications for borderline cases, which is exactly the uncertainty signal downstream components or human reviewers need.

The other failure mode of ultra-low temperature classifiers is surface pattern overfitting in the sampling sense: the model collapses to the single most-probable interpretation of ambiguous inputs, rather than maintaining any distribution over plausible alternatives. This is different from training overfitting but creates similar symptoms—confident wrong answers on out-of-distribution inputs.

Under-creative generators appear when content-producing agents run at temperatures appropriate for structured tasks. A summarization agent at temperature=0.2 will produce grammatically correct, semantically accurate output that reads like a form letter—same sentence structures, same transition words, same paragraph rhythm every time. Users don't complain about accuracy; they complain the content feels robotic. Increasing temperature on the generator doesn't fix this if you've also set frequency_penalty=0—the fix requires both: higher temperature to explore the distribution, and frequency_penalty around 0.3-0.5 to discourage repeating the same words and phrases across multiple generations.

Schema violations from high-temperature structured tasks are the third failure mode, and they interact with the others in multi-agent pipelines. When a generator at temperature=0.9 feeds output to a structured extraction agent also at temperature=0.9, and that agent needs to produce valid JSON, you've created a compounding problem. Raw JSON generation without constrained decoding fails 15-20% of the time even at moderate temperatures. At 1.5, the first noticeable accuracy drops appear. A 2024 study on clinical information extraction found GPT-4o maintained 98.7-99.0% formatting accuracy from temperature 0.0 to 1.5, with the first significant degradation occurring above 1.75—but that's for a single model. In a chained multi-agent pipeline, each stage's failure probability compounds.

Matching Sampling Policy to Role

The practical framework is straightforward once you separate roles:

Classifiers and routers should run at temperature 0.1-0.3. Not zero—you want the model to occasionally express uncertainty by sampling slightly off the mode. Zero creates false confidence. The goal is high determinism with minimal uncertainty signal, not complete determinism.

Structured extractors and formatters should run at temperature 0.0-0.2, and wherever possible, use provider-native structured output APIs (constrained decoding, grammar sampling, or tool-use schemas that enforce output format at generation time). OpenAI's Structured Outputs API achieved near-100% schema compliance—up from under 40% with raw JSON prompting—by enforcing schemas at the token level rather than hoping the model self-corrects. Temperature below 0.2 plus schema constraints gives you both variance control and structure guarantees.

Generators producing narrative content should run at temperature 0.7-1.0. If you're seeing repetition at these temperatures, the fix is frequency_penalty (0.3-0.7), not lower temperature. Lower temperature cures repetition by collapsing the distribution to the mode, which also collapses diversity. Frequency penalty specifically targets already-generated tokens without affecting the overall distribution shape.

Verifiers and reviewers need a moderate setting—0.3-0.7—because they're doing evaluation work that benefits from some exploratory reasoning, but shouldn't produce wildly different verdicts across runs. A verifier at temperature=0.9 might reverse its conclusion between two identical runs, which makes the pipeline non-deterministic in unpredictable ways.

Consensus and aggregation agents that synthesize outputs from multiple upstream agents should run at low temperature (0.1-0.2). By this stage in the pipeline, diversity has already been introduced upstream. The aggregator's job is to make a reliable final call, not to introduce additional variance. Research on multi-agent consensus in deductive coding found that temperature had minimal effect on consensus accuracy once reasonable temperatures were used—the design of the consensus mechanism mattered more than the precise temperature value.

The Variance Budget Framework

A useful mental model for multi-agent temperature governance is the variance budget: the total allowable randomness in the system before accumulated uncertainty exceeds your error threshold.

Each agent running at temperature > 0 contributes variance. In a five-stage pipeline, five agents each contributing small amounts of variance can accumulate into a system that's unreliable in ways no single-agent analysis predicts. This is why debugging "the pipeline sometimes gives wrong answers" often reveals that no individual agent is broken—the failure is emergent from compounded sampling variance.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates