Skip to main content

Temperature Is a Product Decision, Not a Model Knob

· 9 min read
Tian Pan
Software Engineer

When a new LLM feature ships, someone eventually asks: "what temperature should we use?" The answer is almost always the same: "I don't know, let's leave it at 0.7." Then the conversation moves on and nobody touches it again.

That's a product decision made by default. Temperature doesn't just control how "random" the model sounds — it shapes whether users trust outputs, whether they re-run queries, whether they feel helped or overwhelmed. Getting it right matters more than most teams realize, and getting it wrong in the wrong direction is hard to diagnose because the failure mode looks like bad model behavior rather than bad configuration.

What Temperature Actually Does (And Doesn't Do)

Temperature scales the logits before the softmax sampling step. Low values sharpen the probability distribution — the highest-probability token dominates and outputs become predictable. High values flatten it — more tokens become plausible, outputs become more varied.

The common framing is "high temperature = creative, low temperature = factual," but research consistently shows this is too simple. A 2024 study on narrative generation found temperature is only weakly correlated with novelty, and showed no reliable relationship with cohesion or coherence. Separate work on problem-solving tasks found that temperature changes in the 0.0–1.0 range had no statistically significant effect on accuracy. Temperature doesn't make a model smarter, more creative, or better informed. It controls output variance — the spread of the distribution of possible outputs the model will produce.

That distinction matters because variance has direct product consequences. Variance is what determines whether users see the same answer twice, whether they discover outputs they didn't expect, whether they need to read carefully or can skim, and whether the system feels authoritative or exploratory.

Why Variance Is a Product Dimension

Consider two features built on the same base model:

  • A legal document summarizer for a compliance team
  • A brainstorming tool for a marketing team

The compliance team needs low variance. They're using the tool to confirm understanding, cross-check their own reading, and document decisions. If the same document produces two different summaries on two runs, they have a problem. They need to trust that the output is stable enough to share and cite. High variance actively undermines the feature's value proposition.

The marketing team needs some variance. They're trying to generate options — taglines, angles, framings they haven't considered. Giving them the same three taglines every time defeats the purpose. Unpredictability isn't a bug for ideation; it's the mechanism that creates value.

These aren't hypothetical differences. They translate directly to user behavior:

  • Low variance reduces decision fatigue. Users get one strong answer, act on it, and move on. This increases task completion rates in utility-focused workflows.
  • High variance creates something closer to variable-ratio reinforcement. Users run the feature again to see what else comes out. This increases engagement in exploratory workflows — and user frustration in ones that aren't.
  • Mismatched variance creates a category of failure that doesn't show up in accuracy evals: the feature produces correct outputs that feel wrong for the context.

A compliance tool that gives subtly different summaries each time will eventually get flagged as unreliable, even if every output is technically accurate. A brainstorming tool that gives the same three ideas every time will be abandoned, even if those ideas are genuinely good.

The Trust Calibration Problem

The most consequential effect of temperature on product experience is its influence on how users calibrate trust.

Low-variance outputs feel authoritative. The model commits to a position and repeats it consistently. This creates a perception of expertise — which is valuable in contexts where the model is likely to be correct (structured extraction, well-supported Q&A) and dangerous in contexts where it isn't (nuanced judgment calls, rapidly-changing information domains).

High-variance outputs signal uncertainty. When the model gives different answers on different runs, sophisticated users read that as "the model isn't sure." This is often more honest, but it shifts cognitive load back to the user. They now need to evaluate competing outputs instead of accepting one.

Neither is universally better. The question is whether your output variance matches your product's epistemic contract with users. When a user asks a legal AI whether a clause is enforceable and the model gives three different answers on three runs, the user correctly concludes they can't rely on it — even if the "right" answer is somewhere in those three outputs. When a user asks a creative assistant for campaign concepts and gets the same three ideas every session, they correctly conclude the tool is less useful than they hoped.

The failure mode to watch for is misalignment between the variance your temperature setting produces and the variance users expect for that task type.

Common Mistakes

Accepting defaults without understanding them. Most API defaults sit around 0.7–1.0, tuned for general-purpose applications. These are defensible starting points, not optimal settings for specific use cases. Most teams never revisit them.

Using the same temperature across all features. A codebase that sets temperature in one shared config file and uses that setting for code completion, customer support responses, and content generation has made a product decision without realizing it. Those three features have different variance requirements.

Copying settings across model families. Temperature calibration is not standardized across models. The token probability distributions coming out of GPT-4, Claude, and Mistral differ in ways that make a temperature of 0.7 behave differently on each. A setting that works well on one model needs to be independently validated on another.

Treating low temperature as a reliability fix. Teams dealing with inconsistent outputs often lower temperature, hoping it will stabilize behavior. Sometimes it helps. More often, the inconsistency is caused by ambiguous prompts, insufficient context, or knowledge boundary violations — problems temperature doesn't fix. Lowering temperature on a broken prompt makes the output consistently broken.

Assuming temperature=0 means deterministic. Temperature=0 selects the highest-probability token at each step, which dramatically reduces variance, but doesn't eliminate it. Floating-point arithmetic on GPUs is non-associative. Mixture-of-experts models route tokens non-deterministically based on batch composition. OpenAI's documentation notes the system may silently increase temperature slightly to avoid pathological repetition loops. Temperature=0 is the first step toward consistency, not a guarantee of it.

A Decision Framework for Setting Temperature

Start by categorizing the feature's primary function:

Extraction and classification (entities, categories, structured fields from documents): use 0.0–0.2. The correct answer exists; variance is pure noise.

Factual Q&A (questions with documented answers, summarization, translation): use 0.2–0.4. Some phrasing variation is acceptable but the semantic content should be stable.

Conversational and support (customer service, interactive Q&A, chatbots): use 0.5–0.7. Natural language variety improves perceived warmth without introducing unreliability.

Generation with constraints (emails from a template, code with specific requirements, explanations with style guidelines): use 0.4–0.7. Enough variance to avoid mechanical outputs; enough constraint to stay on target.

Open-ended generation (brainstorming, creative writing, marketing copy exploration): use 0.8–1.2. Variance is the product. Watch for coherence degradation above 1.0 on most models.

These ranges are starting points, not rules. Model-specific calibration always matters more than generic guidance. Test your specific model on your specific inputs at several temperature settings before committing to one.

Testing Temperature Properly

The core challenge with temperature A/B tests is that LLM outputs have high inherent variance, which means you need larger sample sizes than you'd expect to reach statistical significance. Before running an experiment, do a power analysis. Decide what metric "better" means before running any tests — task completion, re-run rate, satisfaction score, and time-on-task all measure different things and can point in different directions.

Run at least a few temperature values (0.0, 0.3, 0.5, 0.7, 0.9) against a representative sample of real production inputs. For each, evaluate both the outputs themselves and the user behavior they produce. Measure whether users re-run the feature (high re-run on low temperature may signal distrust; high re-run on high temperature may signal the feature is working as a generator of options).

One practical shortcut: generate multiple outputs at your current temperature, compare them for semantic similarity, and ask whether the variance you see is the variance you want users to experience. If the outputs cluster tightly, users will never notice variance. If they diverge meaningfully, users will.

Surfacing Variance to Users

Once you've chosen a temperature that produces appropriate variance for your feature, consider whether to surface that variance explicitly.

For features where some variance is expected and valuable, showing multiple candidate outputs is often better than showing one. Give users two or three options and let them choose. This converts variance from an inconsistency to a capability. It also shifts trust dynamics: instead of one authoritative answer the user might question, you're presenting a space of options and positioning the user as the decision-maker.

For features where variance exists but shouldn't be visible, build consistency mechanically. Cache responses to identical or near-identical inputs. Use structured outputs to constrain the shape of the response. Add a validation step that checks whether the current output diverges meaningfully from recent outputs on similar inputs — and re-generates if it does.

For features where you need determinism, build explicit tests for it. Don't assume temperature=0 is sufficient. Log production outputs and periodically sample pairs of identical queries to measure whether outputs match. If they don't, you have a variance problem that temperature isn't solving.

When Model Updates Break Your Temperature Calibration

One underappreciated failure mode: when your LLM provider updates the underlying model, your temperature setting's effective behavior changes. The token probability distributions shift, meaning temperature=0.7 on model v1 produces a different variance profile than temperature=0.7 on model v2.

This is one of the reasons why many behavioral regressions after model upgrades look inexplicable — the prompts haven't changed, the temperature hasn't changed, but the output distribution has shifted. If your feature relies on a specific variance level, treat model upgrades as a trigger to re-validate temperature settings, not just output quality.

The Bottom Line

Temperature is a lever for controlling user experience, not just model behavior. The variance it introduces shapes trust, re-engagement, decision fatigue, and how users calibrate reliance on your feature.

The teams that get this right don't treat temperature as a technical detail they can leave to defaults. They treat it as a product choice with downstream UX consequences — one that gets tested like any other product variable, calibrated to the specific task and user context, and revisited when the model changes underneath it.

Pick the variance you want users to experience. Then test whether your temperature setting produces it.

References:Let's stay in touch and Follow me for more thoughts and updates