Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget
Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.
Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.
What Temperature Actually Controls
Before designing a per-role policy, it helps to understand precisely what temperature does and doesn't do.
Temperature scales the logit values that emerge from the final transformer layer before they're converted into probabilities via softmax. At temperature=0, the model deterministically picks the highest-probability token at each step. At temperature=1.0, the distribution is used as-is. Higher values flatten the distribution, making low-probability tokens more competitive.
Two things temperature does not do: it doesn't remove tokens from consideration (that's top-k and top-p), and it doesn't guarantee deterministic output at zero (floating-point precision and batching effects introduce variance even then). More importantly for multi-agent design, it doesn't control what the model knows or doesn't know—it controls how consistently the model expresses what it knows.
The modern parameter stack goes further. Top-p (nucleus sampling) filters by cumulative probability mass. Top-k fixes the candidate pool size. Min-p, which gained traction after its ICLR 2025 oral presentation, adapts the threshold relative to the top token's probability—when the model is confident, it demands higher quality candidates; when uncertain, it relaxes. For production multi-agent systems, temperature combined with min-p is increasingly the standard approach for open-source deployments.
Repetition penalty is a separate mechanism entirely. It discourages re-using tokens already generated, operating on the logits before temperature is applied. Confusing repetition issues with temperature issues is one of the more common misdiagnoses in production debugging.
The Two Failure Modes Nobody Talks About Together
Most discussions of temperature focus on a single axis: too low is boring, too high is incoherent. But in multi-agent systems, the failure modes split cleanly by role, and you need to think about both ends simultaneously across different pipeline stages.
Over-confident classifiers emerge when you run classification agents at temperature=0 or very close to it. The model always picks its highest-probability answer, which sounds good until you consider borderline cases. A classifier that cannot express uncertainty will never flag a document for human review—it will make a definitive call even when the right answer is genuinely ambiguous. Research on multi-agent consensus frameworks found that moderate temperatures (0.1-0.3) allow models to occasionally select alternative classifications for borderline cases, which is exactly the uncertainty signal downstream components or human reviewers need.
The other failure mode of ultra-low temperature classifiers is surface pattern overfitting in the sampling sense: the model collapses to the single most-probable interpretation of ambiguous inputs, rather than maintaining any distribution over plausible alternatives. This is different from training overfitting but creates similar symptoms—confident wrong answers on out-of-distribution inputs.
Under-creative generators appear when content-producing agents run at temperatures appropriate for structured tasks. A summarization agent at temperature=0.2 will produce grammatically correct, semantically accurate output that reads like a form letter—same sentence structures, same transition words, same paragraph rhythm every time. Users don't complain about accuracy; they complain the content feels robotic. Increasing temperature on the generator doesn't fix this if you've also set frequency_penalty=0—the fix requires both: higher temperature to explore the distribution, and frequency_penalty around 0.3-0.5 to discourage repeating the same words and phrases across multiple generations.
Schema violations from high-temperature structured tasks are the third failure mode, and they interact with the others in multi-agent pipelines. When a generator at temperature=0.9 feeds output to a structured extraction agent also at temperature=0.9, and that agent needs to produce valid JSON, you've created a compounding problem. Raw JSON generation without constrained decoding fails 15-20% of the time even at moderate temperatures. At 1.5, the first noticeable accuracy drops appear. A 2024 study on clinical information extraction found GPT-4o maintained 98.7-99.0% formatting accuracy from temperature 0.0 to 1.5, with the first significant degradation occurring above 1.75—but that's for a single model. In a chained multi-agent pipeline, each stage's failure probability compounds.
Matching Sampling Policy to Role
The practical framework is straightforward once you separate roles:
Classifiers and routers should run at temperature 0.1-0.3. Not zero—you want the model to occasionally express uncertainty by sampling slightly off the mode. Zero creates false confidence. The goal is high determinism with minimal uncertainty signal, not complete determinism.
Structured extractors and formatters should run at temperature 0.0-0.2, and wherever possible, use provider-native structured output APIs (constrained decoding, grammar sampling, or tool-use schemas that enforce output format at generation time). OpenAI's Structured Outputs API achieved near-100% schema compliance—up from under 40% with raw JSON prompting—by enforcing schemas at the token level rather than hoping the model self-corrects. Temperature below 0.2 plus schema constraints gives you both variance control and structure guarantees.
Generators producing narrative content should run at temperature 0.7-1.0. If you're seeing repetition at these temperatures, the fix is frequency_penalty (0.3-0.7), not lower temperature. Lower temperature cures repetition by collapsing the distribution to the mode, which also collapses diversity. Frequency penalty specifically targets already-generated tokens without affecting the overall distribution shape.
Verifiers and reviewers need a moderate setting—0.3-0.7—because they're doing evaluation work that benefits from some exploratory reasoning, but shouldn't produce wildly different verdicts across runs. A verifier at temperature=0.9 might reverse its conclusion between two identical runs, which makes the pipeline non-deterministic in unpredictable ways.
Consensus and aggregation agents that synthesize outputs from multiple upstream agents should run at low temperature (0.1-0.2). By this stage in the pipeline, diversity has already been introduced upstream. The aggregator's job is to make a reliable final call, not to introduce additional variance. Research on multi-agent consensus in deductive coding found that temperature had minimal effect on consensus accuracy once reasonable temperatures were used—the design of the consensus mechanism mattered more than the precise temperature value.
The Variance Budget Framework
A useful mental model for multi-agent temperature governance is the variance budget: the total allowable randomness in the system before accumulated uncertainty exceeds your error threshold.
Each agent running at temperature > 0 contributes variance. In a five-stage pipeline, five agents each contributing small amounts of variance can accumulate into a system that's unreliable in ways no single-agent analysis predicts. This is why debugging "the pipeline sometimes gives wrong answers" often reveals that no individual agent is broken—the failure is emergent from compounded sampling variance.
Variance budget allocation follows a few principles:
- Front-load variance, back-load consistency. Let generators and brainstormers run hot; let verifiers and formatters run cold. You want creative diversity early and reliable convergence late.
- Structured tasks consume zero variance budget. If a stage produces structured output that downstream code parses, it gets temperature 0.0-0.1 and constrained decoding. Spending variance budget on formatting is wasteful.
- Uncertainty signals are valuable, not just noise. A classifier running at temperature 0.2 that occasionally outputs a different label on the same input is telling you the input is borderline. Routing borderline cases to human review or a second-opinion agent is a feature, not a bug.
- Compounding is multiplicative, not additive. Five agents each at 2% error rate don't give you 10% pipeline error—they give you roughly 10% (from 1-(0.98^5)≈0.096), but with correlated failures when upstream errors propagate downstream.
Recent research on multi-agent robustness proposes formalized variance budgeting using adaptive two-stage sampling: an initial probe with few samples to estimate variance, followed by a larger sample set sized based on observed uncertainty. The principle translates to pipeline design—use cheap probes to detect high-variance stages, then apply tighter sampling constraints where the variance is unexpected.
Common Misconfigurations in Production
Several patterns appear repeatedly in production systems with temperature problems:
The copy-paste temperature is temperature=0.7 applied uniformly because it was in a notebook somewhere. It's too high for classifiers and too low for creative generators. Almost no production multi-agent system should use a single temperature value across all components.
Fixing repetition with lower temperature is the most common misdiagnosis. If a generator repeats phrases, lowering temperature "fixes" repetition by collapsing diversity—but now the generator is worse at its actual job. The correct intervention is frequency_penalty.
Ignoring reasoning model constraints. Several frontier reasoning models (including some configurations of o1-series models) lock temperature to 1.0 and don't allow modification. If your pipeline routes some tasks to reasoning models, your temperature policy needs to account for stages where you have no control over the parameter.
Over-parameterizing with both temperature and top-p. The standard guidance is to alter temperature or top-p, not both simultaneously. When you use both, they create conflicting signals: temperature says "treat unlikely tokens as more plausible," while tight top-p says "only consider the most probable tokens." For most production use cases, set temperature to your target value and leave top-p at 0.9. If you need tighter diversity control, reduce top-p rather than reducing temperature further.
Schema violations from high-temperature extractors. This one is subtle: teams often introduce a structured extraction stage to "clean up" output from a creative generator, then run the extractor at the same temperature as the generator. The extractor should run at 0.0-0.1 with constrained decoding.
Building the Policy
In practice, documenting your temperature policy is as important as setting it. For each agent in your system:
- What role does this agent play? (Classify, generate, verify, format, aggregate)
- What's the acceptable error rate for this stage?
- Does this stage produce structured output that downstream code parses?
- Is this stage a creative generator where diversity matters?
From those answers, a per-role temperature table almost writes itself. The key insight is that temperature is a per-role resource allocation decision—how much variance does this stage need to do its job, and how much variance can this stage afford given what comes after it.
Most teams set temperature once and move on. The teams that tune per-role policies tend to discover that their "model accuracy" problems were actually sampling policy problems all along—and those are dramatically cheaper to fix.
What to Measure
Once you have a per-role temperature policy, the monitoring question is whether each stage is behaving consistently with its setting.
For classifiers: track label variance across identical inputs (re-run the same input multiple times and measure the distribution of outputs). A classifier at temperature=0.1 should have very low variance; high variance on identical inputs at low temperature suggests the classification task is genuinely ambiguous and should be routed to human review.
For generators: measure output diversity metrics (vocabulary breadth, n-gram diversity) across multiple runs. A generator at temperature=0.8 that produces identical outputs on repeated runs probably has its repetition penalty set too high.
For structured extractors: track schema validation failure rate as a primary metric, not an edge case. If constrained decoding is available, schema failures should be zero. If you're relying on prompt instructions to produce JSON, track the failure rate explicitly.
For the full pipeline: measure end-to-end variance by running identical inputs through the complete pipeline and measuring output consistency. Unexpected high variance at the system level, when individual agent temperatures are tuned correctly, often reveals unexpected state leakage between pipeline runs.
Temperature governance won't solve problems that are actually caused by poor prompts, bad retrieval, or wrong tool choices. But a surprising fraction of "the model is unreliable" complaints in multi-agent systems trace directly back to sampling policies that were never designed to match the pipeline's actual structure.
- https://arxiv.org/html/2506.07295v1
- https://arxiv.org/html/2507.11198
- https://pmc.ncbi.nlm.nih.gov/articles/PMC11731902/
- https://arxiv.org/abs/2407.01082
- https://arxiv.org/html/2507.04105
- https://arxiv.org/html/2502.05234v1
- https://www.promptfoo.dev/docs/guides/evaluate-llm-temperature/
- https://www.zenml.io/blog/llmops-in-production-457-case-studies-of-what-actually-works
- https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation/
- https://machinelearningplus.com/gen-ai/llm-temperature-top-p-top-k-explained/
