Skip to main content

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

The Surface Map: Where Randomness Earns Its Keep

The first practical move is to stop thinking about your system as "calls to an LLM" and start thinking about it as a collection of distinct surfaces, each with its own randomness purpose. A useful taxonomy:

  • Routing. The model picks one of N options — a tool, a sub-agent, a branch in a workflow. Temperature should be 0 (or as close as the provider allows). Variance here is pure cost: a flaky router produces a debugging trail that crosses three logs and four engineers.
  • Parsing. The model is asked to extract structured fields, normalize a value, or emit JSON. Temperature should be 0. Any non-zero setting on a structured-output call is a latent bug whose only manifestation is a flaky parser two months from now.
  • Synthesis. The model summarizes, rewrites, or composes a response from retrieved context. Temperature 0.2–0.5 is usually right. Low enough to be faithful to the source, high enough to avoid copy-paste robotics across thousands of identical queries.
  • Generation. The model produces a full creative output — marketing copy, an essay outline, a code skeleton. Temperature 0.5–0.8. The job is to be plausibly varied; pure greedy decoding flattens the work.
  • Exploration. The model proposes alternatives for a human to pick from, runs N samples for self-consistency, or feeds a downstream selector. Temperature 0.7+ and intentional sampling. Diversity is the entire point; treating these calls like routing calls is what makes "AI brainstorming" features feel like a thesaurus with a chatbot wrapper.

Notice that "creative writing" is not on this list as a single category. The marketing-copy and brainstorming surfaces look identical from outside but have different randomness purposes — one wants plausibly different outputs across users, the other wants intentionally diverse alternatives within a single user's session. Conflating them is the most common drift point.

Parsing Surfaces: Where Non-Zero Temperature Is a Bug

For the parsing surface, the rule is unambiguous: if the response will be parsed by code, the temperature should be zero. Practitioners argue about this because they have heard the counterintuitive result that, for some complex multi-section schemas, a small amount of temperature actually improves field completion — a strict greedy decoder can short-circuit before all fields are emitted, especially when the prompt has many parts.

The right response to that result is fix the schema or the prompt, not raise the temperature. A non-zero parser temperature solves the symptom and creates a different problem: now your structured output is non-deterministic, which means your downstream parser has to handle a wider set of inputs, which means flaky tests, intermittent regressions, and a class of bugs that occur only in production when no engineer is looking. Solving "the model truncates fields" with sampling is taking a deterministic problem and turning it into a probabilistic one.

When the schema or prompt fix is genuinely not available — vendor JSON mode, no constrained decoding, complex multi-section output — the right move is to wrap the call in retry-with-validation, not to raise temperature. Retry produces a deterministic call that converges to a deterministic result; raising temperature produces a probabilistic call that converges nowhere.

The Cost Frame Nobody Surfaces

Higher temperature is correlated with longer outputs because the model wanders. A "creative" synthesis surface that runs at temperature 0.9 will burn meaningfully more output tokens than the same surface at 0.3, with no quality lift you can measure on a held-out eval. This is the cost frame that practitioners rarely surface in their cost models because it crosses the boundary between two teams: the prompt team is responsible for quality, the platform team is responsible for tokens, and the temperature-output-length link sits in the gap.

Make the link visible. Plot per-surface output length against temperature. You will find at least one surface where the model is wandering — often a brainstorming or synthesis call where someone bumped the temperature once to fix a quality issue and never reset it after the prompt was rewritten. Dropping that surface from 0.9 to 0.5 typically shaves 20–30% off output tokens with no measurable quality regression. Compounded across millions of calls a month, that is real money for a single config change.

The cheaper move on the inverse side is more interesting. Raising temperature on a routing call is "free" in the immediate prompt cost — same output length, same input length — but the cost shows up downstream as retry traffic when the wrong tool gets called and the agent has to back out. The expensive randomness is the kind that costs nothing at the call site and everything at the system level.

Sampling Variance Pollutes Your Eval Suite

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates