The Non-Determinism Tax: Building Reliable Pipelines on Probabilistic Infrastructure
Setting temperature=0 and expecting reproducible outputs is one of the most common misconceptions in production LLM engineering. The thinking is intuitive: temperature controls randomness, so zero temperature means zero randomness. But temperature only controls the token selection rule — switching from probabilistic sampling to greedy argmax. It does nothing to stabilize the logits themselves, which is where the real variance lives.
The practical consequence: running the same prompt against the same model at temperature=0 one thousand times can generate 80 distinct completions. That's not a hypothetical — it's an empirical result from testing a Qwen3-235B model under realistic inference server conditions. Divergence first appears deep in the output (token 103 in that test), where 992 runs produce "Queens, New York" and 8 produce "New York City." Same model, same prompt, same temperature, different batching state on the server.
This post explains what actually drives that variance, how to measure it, and the three architectural patterns that let you build reliable systems on top of inherently probabilistic infrastructure.
Why Temperature=0 Isn't Deterministic
The core issue is floating-point arithmetic. Floating-point addition is not associative: (a + b) + c does not necessarily equal a + (b + c) because each operation introduces a rounding error. In GPU parallel execution, threads run in non-deterministic order, which means intermediate values accumulate in different sequences across runs. Those tiny rounding differences compound through the attention mechanism, layer norm, and softmax — occasionally shifting which token has the highest logit. When a second token edges out the first by a margin smaller than floating-point noise, output diverges.
That's the source of variance within a single machine. In production, you have several additional amplifiers.
Batch composition effects. Modern inference servers (vLLM, SGLang, TensorRT-LLM) use continuous batching to pack multiple requests together. The kernel reduction strategies for softmax, layer norm, and attention change their numerical path depending on how many items are in the batch. Your request processed alone produces different logits than the same request processed alongside seventeen others. Server load — which changes second to second — determines which batch you land in.
Tensor parallelism across GPUs. When inference is distributed across multiple GPUs, row-parallel layers split tensors and reduce across devices. Changing the number of GPUs changes the reduction order, which changes the floating-point rounding path. One paper showed that a Qwen3-8B model produced four distinct outputs across tensor-parallel sizes of 1, 2, 4, and 8 GPUs, with accuracy varying by more than 4 percentage points on a math benchmark — purely due to the TP size, not any model or prompt change.
Sparse mixture-of-experts routing. In MoE architectures, tokens from different sequences compete for fixed-size expert groups within a batch. The same token can be routed to different experts depending on which other tokens happen to be in the same batch. This means MoE models are only deterministic at the batch level, not the sequence level: changing which other requests share your batch changes which experts process your tokens.
Hardware and CUDA version differences. Different GPU generations and CUDA releases implement reductions slightly differently. Across provider regions — which may run different hardware or driver versions — the same seed produces different outputs. OpenAI's system_fingerprint field exists precisely because backend infrastructure changes "a few times a year" and each change breaks any seed-based reproducibility guarantees. The seed parameter is documented as "best effort" — not a contract.
How Bad Is It in Practice?
The gap between best-case and worst-case performance across natural runs is the number that should recalibrate your expectations.
A formal study tested five LLMs across eight tasks with ten runs each. Accuracy variation across naturally occurring runs reached 15% on some tasks. But the striking finding was the best possible vs. worst possible gap — the range between the luckiest and unluckiest run configurations. On college-level math:
- Mixtral-8x7b: 75% best vs. 3% worst — a 72 percentage point swing
- GPT-4o: 88% best vs. 44% worst — a 44 percentage point swing
- Llama-3-70B: 85% best vs. 22% worst — a 63 percentage point swing
These aren't different prompts or different models. They're the same model, same prompt, different run. A separate reproducibility audit of 85 LLM papers from two major software engineering conferences found that zero out of five executable studies achieved full reproducibility.
The production implications compound. Any A/B test comparing "model version A" vs. "model version B" is contaminated by this noise. If version A shows a 2% accuracy improvement but batch-induced variance alone can swing results by 15%, the comparison may be statistically meaningless. RLHF fine-tuning pipelines are silently vulnerable too: if the inference engine uses a different tensor-parallel configuration than the training engine, the two systems have numerically different logits for the same model weights, creating a covert off-policy distribution shift that can destabilize reward training without producing obvious error signals.
Measuring Reproducibility
Exact string match is the wrong metric for reproducibility. Because small formatting variations produce high reported instability even when the semantic content is consistent, exact match gives you a catastrophically pessimistic view. A model that always produces the correct answer but sometimes writes "A" and sometimes writes "Option A" looks completely unreliable by exact match.
Semantic equivalence testing is the practical alternative. Embed each response, compute pairwise cosine similarity, and report the standard deviation of those distances as a semantic stability score. Responses with cosine similarity above roughly 0.95 can be treated as semantically equivalent for most tasks. This translates to a concrete CI check: run N identical requests, embed all responses, and flag the output as unstable if pairwise similarity drops below your threshold.
For structured outputs, you can go one level coarser: check that the parsed answer (the field values in a JSON object, or the letter choice in multiple-choice) matches across runs, ignoring surface formatting. This parsed agreement rate is usually much higher than raw string match and gives you a realistic read on whether variance is actually affecting decisions.
For capacity planning and regression detection, the most useful aggregate is: run 50 identical requests, count unique completions. A well-structured output format with a narrow output space should produce 1-3 unique completions. If you're seeing 10+, you have a variance problem that structured output or caching can contain.
Three Patterns That Help
None of these patterns eliminate logit-level non-determinism — that requires infrastructure-level changes (batch-invariant kernels, which currently cost 30-60% throughput). Instead, they bound the impact of variance on system behavior.
Strict output schemas. Constrained decoding — JSON mode, function calling with strict schemas, tool use with typed parameters — works by masking invalid tokens at each generation step. Small logit variations can no longer push the model into structurally different output shapes. The model might produce "New York City" instead of "Queens, New York" in free text, but if your schema enforces {"borough": string}, both answers converge on the same field structure. Schema enforcement doesn't eliminate variance in the values; it eliminates variance in the shape. Combined with an enum constraint for fields with a bounded value set, it can eliminate variance in both.
Semantic equivalence layers in testing and monitoring. Replace exact string matching in eval pipelines, regression tests, and output monitors with embedding-based similarity checks. This has two effects. First, it prevents false positive alerts when outputs vary superficially but not semantically. Second, it forces you to define what "same" means for your application — which is a useful forcing function for understanding your own tolerance for variance. Teams that go through this exercise often discover that many surface variations they were treating as failures are actually acceptable, and a smaller number of semantic variations they were ignoring are the real problem.
Idempotency caching. The most robust application-level pattern: cache model outputs keyed on a deterministic hash of (model_id, prompt_hash, parameters). For repeated requests with identical inputs, serve the cached response. This is trivially implementable with Redis and eliminates model-level variance for duplicate requests entirely — which covers a surprisingly large fraction of production traffic (retries, polling patterns, re-renders). For agent tool calls specifically, idempotency keys prevent duplicate side effects when retries occur. The limitation is that this doesn't help you when you genuinely need a fresh response to new input, but it does eliminate the class of failures where non-determinism causes your system to make two different decisions in response to the same event.
The Deeper Design Principle
The mental model shift that underlies all three patterns: treat your LLM like a network call, not a function call. A function call with the same inputs always returns the same output. A network call may return different results depending on conditions you don't control. You wouldn't build a payment system that fails if two sequential HTTP requests return slightly different response times. You would build it to tolerate variation and make stable decisions in spite of it.
LLM outputs are the same. Forcing them into typed schemas, testing for semantic equivalence rather than string equality, and caching where possible are the same discipline as building idempotent HTTP clients. The underlying infrastructure is probabilistic. Your application logic does not have to be.
Infrastructure teams shipping batch-invariant kernels and deterministic attention implementations will eventually close much of the hardware-level gap — the throughput cost is dropping and the tooling is maturing. But even with deterministic kernels, cross-provider routing, hardware fleet evolution, and model version updates will keep introducing variance at the system level. The architectural discipline of building for semantic stability rather than string reproducibility will remain relevant regardless of how the infrastructure evolves.
The non-determinism tax is real, but it's also bounded. Know where it comes from, measure it in terms that match your reliability requirements, and design your application layer to absorb it.
- https://arxiv.org/html/2408.04667v5
- https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- https://www.lmsys.org/blog/2025-09-22-sglang-deterministic/
- https://arxiv.org/html/2511.17826
- https://developers.openai.com/cookbook/examples/reproducible_outputs_with_the_seed_parameter
- https://arxiv.org/html/2510.25506v3
- https://www.zansara.dev/posts/2026-03-24-temp-0-llm/
- https://martynassubonis.substack.com/p/zero-temperature-randomness-in-llms
- https://docs.vllm.ai/en/latest/features/batch_invariance/
- https://community.openai.com/t/the-seed-option-for-gpt-does-not-increase-the-determinism-level/512892
