Skip to main content

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

The Surface Map: Where Randomness Earns Its Keep

The first practical move is to stop thinking about your system as "calls to an LLM" and start thinking about it as a collection of distinct surfaces, each with its own randomness purpose. A useful taxonomy:

  • Routing. The model picks one of N options — a tool, a sub-agent, a branch in a workflow. Temperature should be 0 (or as close as the provider allows). Variance here is pure cost: a flaky router produces a debugging trail that crosses three logs and four engineers.
  • Parsing. The model is asked to extract structured fields, normalize a value, or emit JSON. Temperature should be 0. Any non-zero setting on a structured-output call is a latent bug whose only manifestation is a flaky parser two months from now.
  • Synthesis. The model summarizes, rewrites, or composes a response from retrieved context. Temperature 0.2–0.5 is usually right. Low enough to be faithful to the source, high enough to avoid copy-paste robotics across thousands of identical queries.
  • Generation. The model produces a full creative output — marketing copy, an essay outline, a code skeleton. Temperature 0.5–0.8. The job is to be plausibly varied; pure greedy decoding flattens the work.
  • Exploration. The model proposes alternatives for a human to pick from, runs N samples for self-consistency, or feeds a downstream selector. Temperature 0.7+ and intentional sampling. Diversity is the entire point; treating these calls like routing calls is what makes "AI brainstorming" features feel like a thesaurus with a chatbot wrapper.

Notice that "creative writing" is not on this list as a single category. The marketing-copy and brainstorming surfaces look identical from outside but have different randomness purposes — one wants plausibly different outputs across users, the other wants intentionally diverse alternatives within a single user's session. Conflating them is the most common drift point.

Parsing Surfaces: Where Non-Zero Temperature Is a Bug

For the parsing surface, the rule is unambiguous: if the response will be parsed by code, the temperature should be zero. Practitioners argue about this because they have heard the counterintuitive result that, for some complex multi-section schemas, a small amount of temperature actually improves field completion — a strict greedy decoder can short-circuit before all fields are emitted, especially when the prompt has many parts.

The right response to that result is fix the schema or the prompt, not raise the temperature. A non-zero parser temperature solves the symptom and creates a different problem: now your structured output is non-deterministic, which means your downstream parser has to handle a wider set of inputs, which means flaky tests, intermittent regressions, and a class of bugs that occur only in production when no engineer is looking. Solving "the model truncates fields" with sampling is taking a deterministic problem and turning it into a probabilistic one.

When the schema or prompt fix is genuinely not available — vendor JSON mode, no constrained decoding, complex multi-section output — the right move is to wrap the call in retry-with-validation, not to raise temperature. Retry produces a deterministic call that converges to a deterministic result; raising temperature produces a probabilistic call that converges nowhere.

The Cost Frame Nobody Surfaces

Higher temperature is correlated with longer outputs because the model wanders. A "creative" synthesis surface that runs at temperature 0.9 will burn meaningfully more output tokens than the same surface at 0.3, with no quality lift you can measure on a held-out eval. This is the cost frame that practitioners rarely surface in their cost models because it crosses the boundary between two teams: the prompt team is responsible for quality, the platform team is responsible for tokens, and the temperature-output-length link sits in the gap.

Make the link visible. Plot per-surface output length against temperature. You will find at least one surface where the model is wandering — often a brainstorming or synthesis call where someone bumped the temperature once to fix a quality issue and never reset it after the prompt was rewritten. Dropping that surface from 0.9 to 0.5 typically shaves 20–30% off output tokens with no measurable quality regression. Compounded across millions of calls a month, that is real money for a single config change.

The cheaper move on the inverse side is more interesting. Raising temperature on a routing call is "free" in the immediate prompt cost — same output length, same input length — but the cost shows up downstream as retry traffic when the wrong tool gets called and the agent has to back out. The expensive randomness is the kind that costs nothing at the call site and everything at the system level.

Sampling Variance Pollutes Your Eval Suite

The eval discipline is where most teams lose the thread. The pattern is: production runs at temperature 0.7, the eval suite also runs at temperature 0.7 because "we want to evaluate what production does," and then a model upgrade ships and the eval scores wobble by two points. Is that a regression, or is it sampling noise?

You cannot tell. And because you cannot tell, you cannot decide whether to roll back. Sampling noise has eaten your regression signal.

The fix is to split the question. Run regression evals at temperature 0 — even when production runs at non-zero — so a regression signal is not washed out by sampling. Separately, run a variance budget eval at production temperature with N samples per question, to confirm that production-time variance is within historical bounds. Two evals, two questions: "did the model get worse" and "is the model rolling different dice than usual." Recent research on noise sources in LLM evaluations confirms that prediction noise from sampling can dominate at higher temperatures, and that averaging across samples is where the statistical power comes from — not from running once at production temperature and squinting at the result.

A subtler trap: if your eval judge is itself an LLM, the judge's temperature matters too. A judge at temperature 0.7 will rate the same answer differently across runs. The eval pipeline that runs the production model deterministically but the judge stochastically is the worst of both worlds — you have removed the noise you can analyze and kept the noise you cannot. Pin the judge at zero unless you have a specific reason to vary it.

Temperature 0 Is Not Determinism

Now the unwelcome truth: pinning everything to zero does not get you a deterministic system. It gets you a system whose probability mass is concentrated on the most likely token at each step — which is usually but not always the same token, because real inference runs in floating-point, and floating-point arithmetic is not associative. Two candidate tokens with very close logits can swap order based on the order of intermediate sums in a matrix reduction, and once one token changes, every subsequent step runs on a different prefix and the divergence compounds.

There is also the batch-invariance problem: production inference servers batch your request with whatever else arrived at the same time, and the result for your request can depend on the composition of the batch. Same prompt, same temperature, same model — different output, because a different set of other users happened to arrive in the same millisecond.

Practical implications:

  • Do not promise byte-identical outputs in your contract or your tests. Your retry-with-validation and your regression eval need to tolerate some surface-level variation even at temperature 0.
  • Do not chase "we ran it twice and got different answers at temperature 0" as a bug. It is a property of the substrate, not your prompt.
  • Treat reproducibility as a property of the system, not a property of the call. If you need replay for forensics, log the full request, response, and model fingerprint — do not assume you can rerun the call and get the same thing.

The mental model that survives is this: temperature 0 is the purpose declaration, not a guarantee. You are telling the system "I want the most likely token here" — the system tries hard to give you that, but the floating-point and batching layers underneath are not under your control. The contract you actually have is "low variance," not "zero variance," and the system around the call has to be built for that.

The Org Failure Mode and the Discipline That Replaces It

The org failure mode is straightforward: someone, usually early in the project, copies a temperature value from a tutorial — often 0.7, the LangChain example default — and puts it in a shared config. Every team building on the platform inherits the value. Every prompt that gets cloned for a new use case also clones the temperature. Six months later, the parser surfaces are running at 0.7, the routing surfaces are running at 0.7, and the synthesis surfaces are running at 0.7, and the only surface where 0.7 is actually correct is the brainstorming feature that nobody has launched yet.

The discipline that replaces this is small but enforceable:

  • Annotate every model call with a purpose. routing, parsing, synthesis, generation, exploration. Five tags, one per call. The annotation lives next to the call site, not in a doc.
  • Set sampling parameters per purpose, not per provider. A routing call should have the same temperature whether it goes to Claude, GPT, or a local model. The provider-specific config layer translates purpose to provider parameters; the call site never names a temperature directly.
  • Contract-test the mapping. A test that fails when a routing call has temperature > 0 or a parsing call has temperature > 0. Cheap, fast, catches the copy-paste bug at PR time instead of in production.
  • Audit periodically. Every quarter, list the actual temperature values being used in production by purpose tag. Drift will show up. Someone will have raised a parser temperature to 0.3 to "fix" a flaky output and forgotten to drop it back. Catch it.

This is a small amount of process for what is, in practice, the difference between a system that feels alive where it should and predictable where it must, and a system that is randomly broken in places nobody can predict.

The Architectural Realization

Randomness is a tunable that should be designed surface-by-surface, with a stated purpose, not inherited as a framework default. The teams that get this right treat sampling parameters the same way they treat database isolation levels — boring, deeply consequential, and worth a small amount of upfront design. The teams that get it wrong treat sampling parameters the same way they treat log levels — copy-pasted, never reviewed, occasionally the source of an incident nobody traces to its actual cause.

The one-line takeaway: if you cannot say what randomness is for at a given call, you have an unbudgeted cost. Spend it on purpose, or stop spending it.

References:Let's stay in touch and Follow me for more thoughts and updates