The Deterministic Seed Your Provider Treated as a Hint, Not a Contract

June 2, 2026 · 10 min read

Software Engineer

The CI test was a single assertion: same model, same temperature, same prompt, same seed, same output string. It passed on every developer's laptop, passed on the first hundred CI runs, and then flaked once every fifty runs for three weeks before anyone admitted the pattern was real. The first hypothesis was the obvious one — a non-deterministic dependency somewhere in the test harness — and three days of investigation found nothing. The actual cause was sitting in a footnote on the provider's API reference: "seed provides best-effort determinism." The team had read the parameter name and assumed a contract. The provider had documented a hint.

This is a specific failure mode of hosted inference that catches teams who design test infrastructure around a single mental model: the model is a pure function of its inputs, and the seed is what makes the function reproducible. Both halves of that model are wrong in production, and the gap between the API surface and the underlying physics is wide enough that teams build entire eval and regression-test stacks on top of an assumption their provider explicitly disclaimed.

Sampling Determinism Is Not Execution Determinism

The seed parameter, in every public LLM API that exposes one, controls the random number generator that picks tokens from the model's output distribution. Set the seed to a fixed integer and the sampler will, given an identical probability distribution, always pick the same token. That is a meaningful guarantee, but it is a guarantee about one specific layer of the stack: the sampler.

The layer beneath the sampler is the forward pass that produced the distribution. That layer is not sampling at all — it is matrix multiplication, attention, normalization, sometimes mixture-of-experts routing, executed on a GPU shared with other requests. The seed has no influence on any of it. The distribution your sampler is sampling from depends on the kernel implementation, the batch composition at the moment your request arrived, the shard your request was routed to, and on mixture-of-experts models, the routing decisions that depend on which other requests happened to be co-batched with yours.

In other words, the seed makes the sampler deterministic given the distribution. The distribution itself is a function of variables you do not control and that your provider does not expose. Two requests with the same seed, same prompt, same parameters, on the same model version, can produce different outputs because they arrived in different batches.

Batch Size Is the Hidden Input

The clearest published analysis of this comes from work that traced the root cause of inference-time non-determinism to batch-invariance failures in three operations: normalization, matrix multiplication, and attention. Standard GPU kernel implementations are not batch-invariant — the numerical result for token i depends on how many other tokens are being processed alongside it, because the reduction order in the underlying summations changes with batch shape. Floating-point arithmetic is not associative, so a different reduction order produces a numerically different result, and downstream that difference cascades through the rest of the layers.

For a typical hosted endpoint, the batch your request lands in is determined by how busy the server is at the moment your packet hit the gateway, what other requests are in the queue, and how the scheduler chose to group them. From the user's point of view, batch size is effectively a random variable driven by other users' traffic. Your request is being co-batched with strangers, and the numerical result of the forward pass depends on who they are.

The fix is non-trivial. The published reference implementation builds batch-invariant versions of the three operations, and demonstrates 1,000 identical inputs producing 1,000 bit-identical outputs. The cost is significant — roughly 60% slower than the standard implementation in the initial release — and the engineering work is meaningful enough that no major commercial provider has rolled it into their general-purpose endpoints. The default assumption in production inference is that throughput matters more than reproducibility, and that the cost of reproducibility is a trade most customers have not asked for.

Mixture-of-Experts Adds Another Layer

For mixture-of-experts models, the batch-dependence problem compounds. MoE routing decides, per token, which experts to send the token to, and the routing operates under capacity constraints — each expert can only process so many tokens per batch. When the batch is full, tokens compete for expert slots, and the decision about which tokens get their preferred expert depends on what other tokens are in the batch.

The practical consequence is that a token's path through the network is not a function of the token alone. It is a function of the token plus the population of tokens it was co-batched with. The same input prompt, sent twice with the same seed and parameters, can be routed through different experts on the two runs because the co-batched population was different. This was the analysis that pointed at MoE as the dominant source of non-determinism in earlier large models, and the analysis still applies to current-generation MoE deployments.

For dense models the variance from batch-size effects is small and shows up as occasional token-level divergence. For MoE models it can be a larger effect because expert routing is a discrete decision and a single different routing choice on a single token can change the entire downstream trajectory.

What "system_fingerprint" Actually Tells You

OpenAI's response payload includes a system_fingerprint field whose documented purpose is to let you detect when the backend changed in a way that might affect reproducibility. The documentation is precise about the contract: if the fingerprint, seed, and request parameters all match between two runs, the outputs "will mostly be identical." The word mostly is doing real work in that sentence. The fingerprint does not promise that two requests with the same fingerprint will produce the same output. It promises that if the fingerprint differs, you should expect drift.

This is the right way to use the field, and most teams that build regression tests on hosted inference are not using it this way. The common pattern is to assert on the exact output string and treat fingerprint changes as out-of-band events to investigate later. The pattern that actually matches the API contract is to log the fingerprint on every response, treat any fingerprint change as a reason to expect drift, and design the assertion to tolerate the small remaining variance even when the fingerprint matches.

Anthropic's Claude API does not expose a seed parameter at all in its public surface, and the documentation does not claim run-to-run reproducibility. Teams building regression tests against Claude that pin temperature to zero and assert on byte-exact output strings are operating outside the documented contract. The tests will mostly pass and will occasionally flake, and the flakes are not bugs in the test infrastructure — they are the provider behaving exactly as their documentation describes.

The Test Suite That Measures the Load Balancer

The deeper problem with byte-exact assertions on hosted inference is that the test is measuring something the test author did not intend to measure. The test was supposed to measure the model's behavior. What it actually measures is the model's behavior plus the provider's routing topology, plus the current batch composition, plus the current kernel build, plus any backend configuration the provider has tuned since the test was written.

When the test passes, the assertion was robust to all of those variables. When the test flakes, you do not know which variable changed. The provider does not tell you when they cycle a shard, when they roll out a new kernel build, when they change the batching strategy, when they shift traffic between datacenters. You see a flake in CI and have no signal about what actually moved.

This makes the test a worse signal than no test at all on a particularly bad day, because it forces engineering time into chasing flakes that have nothing to do with anything the team can fix. The asserting code path is owned by a load balancer that belongs to someone else.

Patterns That Survive Provider Drift

The substitutions worth making, in roughly increasing order of engineering cost:

Assert on semantic content, not byte sequences. For regression tests, the question is almost never "did the model produce the same string." It is "did the model produce a string that means the same thing." LLM-as-a-judge with a fixed grading prompt and a deterministic rubric is more robust to small output drift than a string equality check, and it is what teams actually need from a regression test in most cases.

Log the fingerprint alongside every eval run. Treat the fingerprint as a confounding variable in your data, not as metadata to glance at when things go wrong. When a regression score moves, the first question to answer is whether the fingerprint moved. If it did, the regression has a known structural cause and the comparison across the change is not apples-to-apples without retraining the baseline. For providers without a fingerprint field, log the model version string and the timestamp and use those as proxies.

Pin to a dedicated deployment when reproducibility actually matters. Hosted inference on shared infrastructure trades reproducibility for throughput. If your test suite, eval pipeline, or production replay needs run-to-run reproducibility as a hard requirement, the answer is a dedicated endpoint — either a provider-managed dedicated instance with a guaranteed deployment, or self-hosted inference with batch-invariant kernels. Both options cost more than the default endpoint, and the cost is what reproducibility actually costs.

Make determinism a procurement question. The teams that get burned by this are the teams who picked a provider on price and latency, signed the contract, and then discovered in month four that the SLA has nothing in it about determinism. The cheap time to ask is during evaluation. The right question is not "do you support a seed parameter" — every provider can answer yes to that. The right questions are: under what conditions does the same seed produce the same output, what backend changes invalidate that guarantee, and what is the contractual notice period before those changes ship.

The Architectural Reframe

The framing that survives the longest is this: determinism in hosted inference is a property of the deployment, not of the model. The model is a mathematical object that, given a probability distribution, could in principle be sampled deterministically. The deployment is a distributed system with shared hardware, batched execution, and load-dependent routing, and that system has its own determinism properties that are independent of the model and that the model's API mostly cannot expose.

The team that treated the seed as a contract built a test suite whose reliability lives in someone else's load balancer. The team that treats the seed as a hint and the deployment as the actual source of reproducibility builds eval and regression infrastructure that measures the thing they wanted to measure. Both teams will see the same flakes for a while. Only one of them will be able to act on the signal.

The cheap version of the lesson is to stop asserting on byte-exact strings against hosted inference. The full version is to redesign the eval contract around the deployment you actually have, and to make the deployment a parameter of the eval result rather than an invisible constant. The seed is a knob on the sampler. The thing that produces the distribution is the system, and the system is not yours.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Deterministic Seed Your Provider Treated as a Hint, Not a Contract

Sampling Determinism Is Not Execution Determinism

Batch Size Is the Hidden Input

Mixture-of-Experts Adds Another Layer

What "system_fingerprint" Actually Tells You

The Test Suite That Measures the Load Balancer

Patterns That Survive Provider Drift

The Architectural Reframe

Recommended Reading

About Tian Pan

Sampling Determinism Is Not Execution Determinism​

Batch Size Is the Hidden Input​

Mixture-of-Experts Adds Another Layer​

What "system_fingerprint" Actually Tells You​

The Test Suite That Measures the Load Balancer​

Patterns That Survive Provider Drift​

The Architectural Reframe​

Recommended Reading

About Tian Pan

Sampling Determinism Is Not Execution Determinism

Batch Size Is the Hidden Input

Mixture-of-Experts Adds Another Layer

What "system_fingerprint" Actually Tells You

The Test Suite That Measures the Load Balancer

Patterns That Survive Provider Drift

The Architectural Reframe