The Deterministic Seed Your Eval Suite Set That Your Provider Quietly Ignored
You set seed=42. You set temperature=0. You logged the run, posted the dashboard, signed off on the model swap. The next morning the rerun returned a different number on the same prompts, and the explanation you reached for — "must be sampling noise" — was wrong twice over: there was no sampling, and the noise was structural. The seed left your client, the gateway threw it away, the kernel batched your request next to seventeen unrelated ones, and the floating-point reduction order changed under you. Your "reproducible" benchmark was always within one batch of being a different benchmark.
This failure mode is quiet because every layer in the stack is technically correct. The SDK accepts the seed. The provider documents the seed. The model returns a system_fingerprint. The eval harness logs all three. Nothing 5xx's, nothing warns, nothing protests. The number on the dashboard just shifts, and the team rationalizes the shift as the kind of jitter that always existed — because they have no instrument that can tell them whether they're looking at stochastic decoding or at a backend rotation that invalidated three weeks of comparisons.
The seed has three places it can disappear, and none of them are loud
When you call a chat completion with seed=42, the integer travels through your client SDK, through whatever gateway or proxy sits between you and the model host, through the inference server, and finally into the sampler. Each hop has its own reason to drop it.
The SDK is usually fine. Modern OpenAI, Anthropic, and Google SDKs forward the seed verbatim. But the SDK isn't always the last code that touches your request. If you route through a vendor-agnostic gateway — LiteLLM, Portkey, an internal abstraction layer your platform team wrote — the seed lives or dies based on whether that gateway's adapter for your chosen provider knows about it. Adapter coverage is uneven. The OpenAI adapter forwards seed; the bedrock adapter may not surface it; a custom proxy written six months ago against an older schema may have a hardcoded allowlist that excludes it. The request returns 200 either way. You have no negative signal.
The provider's own gateway is the second silent dropper. As of mid-2026, Anthropic's Opus 4.7 migration guide changed temperature and top_p semantics — non-default values now return a 400 — but the seed situation across providers remains a mix of "supported but best-effort" (OpenAI on certain models), "not supported, silently accepted" (some Anthropic endpoints historically), and "supported but undocumented" (various hosted open-weights). The frustrating cases are the silent-accept ones: the gateway strips the field on the way in, the model gets no seed, the response comes back normally, and your eval harness logs a seed value that influenced nothing.
The third place is the inference server itself, where the seed does arrive but where it controls only a sliver of the variance. vLLM users have spent the better part of two years filing issues that read "I set temperature=0, top_p=1, seed=42 and outputs still differ" — and the canonical answer is that the seed only governs the sampler's pseudorandom draws. When sampling is greedy, the seed is doing essentially nothing. The variance you see at temperature zero comes from somewhere else entirely.
Batch invariance is the variance you couldn't see in your logs
The Thinking Machines team published a result in late 2025 that reframed how serious people talk about LLM determinism. Running Qwen3-235B at temperature 0 on a hosted vLLM, they sent 1,000 identical queries and got 80 unique completions. Not 80 with small variations — 80 distinct strings. The seed was set. Temperature was zero. Greedy decoding was on. The model still produced eighty answers.
The cause is batch-size dependency. Modern inference servers pack concurrent requests into the same forward pass to amortize GPU cost. The matrix multiplications, RMSNorms, and attention reductions that run inside that forward pass aren't batch-invariant — their numerical results change when the batch shape changes. Floating-point addition isn't associative, so the order in which a reduction accumulates partial sums affects the last few bits of the result, and the order is determined by how the kernel tiles the batch. In their measurements, a single matrix multiplication on the same inputs differed by 1669.25 between batch size 1 and batch size 2048. That difference, propagated through dozens of layers, is enough to flip the argmax on at least one token, and the divergence cascades from there.
This means the "seed" that actually controls your output isn't the integer you sent. It's the implicit batch composition at the moment your request hit the server — who else was calling the API, how the scheduler grouped you, whether the kernel chose a split-K strategy for your token count. None of that is in your control. None of it is in your logs.
The same paper showed that with batch-invariant kernels — replacing the matmul, RMSNorm, and attention paths with versions whose tile and reduction strategy is fixed regardless of batch shape — all 1,000 completions became identical. The cost was real: 42 seconds instead of 26 seconds for the run, about 60% slower. That's the price of true reproducibility, and the fact that no hosted provider charges it tells you something about what guarantee you're actually buying.
System_fingerprint is the only signal you have, and you probably aren't watching it
OpenAI ships a partial workaround for the bigger structural problem. Every chat completion response includes a system_fingerprint field, an opaque identifier for the combination of weights, infrastructure, and configuration that produced the output. If your fingerprint stays stable across runs, you have a chance of approximate determinism. If it rotates, you're comparing apples and oranges — even with the same seed, same prompt, same temperature, the backend has changed enough that OpenAI itself stops claiming reproducibility.
Backend rotations happen on the order of a few times per year for each model family, but they're not announced as such. They show up as a new fingerprint hash on some percentage of your traffic. If your eval harness records the fingerprint but doesn't gate on it, you'll see drift on Tuesday morning that looks like model regression and is actually a deployment rollover. If your eval harness doesn't record the fingerprint at all, you'll see drift that looks like nothing.
The defensive pattern most teams skip: bucket every eval run by system_fingerprint, report the per-bucket scores, and refuse to compare across buckets. If two runs were served by two different fingerprints, they're two different models for benchmarking purposes, even if the model name on the URL is identical. Treat fingerprint as a part of the model identity. The provider doesn't do this for you because the provider's product story is "you called Opus 4.7 both times." Your eval story has to be more specific.
Azure's documentation for the same parameter is more honest about the limit: it tells you to expect drift, monitor the fingerprint, and treat anything that doesn't match the previous value as a fresh evaluation. The non-Azure docs say roughly the same thing but bury it under "best-effort." Either way the operational behavior is the same — your benchmark is a number plus a fingerprint plus a date, and dropping any one of those three reduces the number to anecdote.
The provider-quirk variance that fakes itself as noise
Beyond the seed-drop and the batch-dependency, there is a third class of nondeterminism that masquerades as the first two. Across providers, the same logical request triggers different code paths depending on subtle inputs. GPT-4o is deterministic for fixed-text prompts when you set the seed; introduce an image attachment and the determinism collapses, even with seed and temperature locked. Mixture-of-Experts models route tokens differently under different batch compositions, and that routing isn't seeded — it's a function of the entire batch's tokens, so your single-prompt run depends on what else is in the batch. Speculative decoding adds another layer: the draft model and the verifier model interact stochastically with traffic load.
This is the part that breaks the comparison instinct. If your eval has both vision and text prompts, the vision prompts will drift more even when the seed is identical. If your eval hits an MoE model during off-peak hours, batch shapes will be skinnier and the per-prompt outputs will differ from the on-peak run that gave you the original baseline. If you're evaluating a model that the provider has quietly enabled speculative decoding on, the same prompt at the same temperature will diverge run-to-run because the draft model occasionally guesses wrong. None of these are bugs. All of them invalidate the assumption that "same input, same seed" produces "same output."
A useful framing: there are three distinct kinds of variance you are absorbing, and they require different mitigations.
- Sampler variance. Controlled by seed and temperature. Easy to lock down.
- Batch variance. Controlled by batch-invariant kernels, which providers don't offer. Mitigated only by accepting it as the noise floor and running enough samples to estimate it.
- Backend variance. Controlled by
system_fingerprint. Mitigated by bucketing and by refusing cross-fingerprint comparisons.
Most eval harnesses conflate all three into "did the score change." That conflation is where benchmark numbers go to die.
What an eval harness that respects this would look like
The minimum viable upgrade is not "use a better seed." It's making the harness honest about which kind of determinism it has.
Log five things per eval call: the seed you sent, the seed in the response payload (if echoed), the system_fingerprint, the wall-clock time, and the model name with its version suffix. When you write the report, group by fingerprint and emit one score per group. If a single run spans two fingerprints, surface that as a warning, not a footnote. Anything else hides the variance source from the reader.
Run each eval prompt three to five times against the same fingerprint and report variance, not just the mean. The Thinking Machines numbers — 80 unique completions out of 1,000 at temperature zero — imply a per-call disagreement rate that swamps the gap between most candidate models you'd compare. If you don't measure the noise floor, you'll attribute it to whichever change happened most recently.
Maintain a fingerprint history. When a new fingerprint appears, treat the previous baseline as expired and rebaseline against the new one. Some teams keep a "reference set" of 200 prompts they re-score after every fingerprint rotation, so the comparison they care about — model A versus model B — is always within a single backend snapshot rather than across snapshots.
And probe the seed path. Once a week, send a batch of identical prompts with the same seed and different seeds and verify that the same-seed batch produces tighter distributions than the different-seed batch. If the two distributions look identical, your seed is being dropped somewhere upstream, and you should learn that from the probe rather than from a regression you can't reproduce.
The number you report is a snapshot, not a measurement
The honest framing of an LLM eval number in 2026 is this: it is the score your prompts achieved against a specific model name, a specific system_fingerprint, on a specific date, under whatever batch conditions the provider was running at that time. The seed parameter narrows the variance contributed by sampling. It does nothing about the variance contributed by batching, by fingerprint rotation, by MoE routing, or by the provider's gateway quietly stripping fields that should have made it through.
The eval suite that sets a seed and calls it reproducible is not lying — it just thinks it's measuring something more stable than it is. The provider isn't lying either; the docs say "best effort." The gap is in the part of the stack that nobody owns: the contract between "I told you the seed" and "you used the seed," with three layers of silent failure in between.
The teams whose benchmark numbers hold up across weeks aren't the ones who picked the right seed. They're the ones who decided early that their unit of comparison was (model, fingerprint, batch-class) rather than (model). Everyone else is publishing snapshots and treating them as measurements, and the dashboard ticks up and down to the rhythm of someone else's deployment schedule.
- https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- https://cookbook.openai.com/examples/reproducible_outputs_with_the_seed_parameter
- https://github.com/openai/openai-cookbook/issues/861
- https://www.keywordsai.co/blog/llm_consistency_2025
- https://arxiv.org/html/2506.09501v2
- https://arxiv.org/pdf/2511.07585
- https://github.com/vllm-project/vllm/discussions/17166
- https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/reproducible-output
- https://community.openai.com/t/question-about-the-use-of-seed-parameter-and-deterministic-outputs/773638
- https://www.vellum.ai/llm-parameters/seed
- https://lakefs.io/blog/toggle-openai-model-determinism/
