Skip to main content

The Bug You Can't Reproduce Because the Model Picked a Different Token

· 10 min read
Tian Pan
Software Engineer

A user files a bug. The summary your agent generated dropped a critical paragraph, or the JSON came back malformed, or the answer was confidently wrong. You open the ticket, copy the request, and replay it. It works. You replay it again. Still works. You mark the ticket "cannot reproduce" and move on.

The bug is still there. It is still happening to real users. You just closed it because your debugging toolchain assumes that a fixed input produces a fixed output — and the component you are debugging samples from a probability distribution.

This is not a rare edge case. It is a structural mismatch between how LLM systems behave and how the entire discipline of debugging was built. Every tool you reach for during an incident — replaying the request, bisecting a regression, writing a regression test — silently assumes a deterministic function. When one of your components is a sampler, those tools don't break loudly. They lie quietly. They tell you the bug is gone when it isn't.

"Replay the request" is a deterministic-world habit

The replay reflex is decades old and almost always correct. A web handler, a SQL query, a pure function: feed it the same input, get the same output. If it misbehaved once and behaves now, either the input was different or some external state changed. Replay is how you find out which.

LLM inference quietly violates the premise. Even when you think you have pinned everything, you haven't. Setting temperature=0 feels like it should give you greedy decoding and a single deterministic answer. It mostly doesn't. Temperature zero only removes randomness from the sampling step — picking a token from the distribution. It does nothing about the distribution itself shifting underneath you.

In 2025, Thinking Machines Lab ran the same prompt through a popular open model 1,000 times at temperature 0 and got 80 unique completions. The culprit was not the sampler. It was batch variance: production inference servers pack many users' requests into a shared batch, and the batch composition changes which order GPU reduction kernels accumulate floating-point numbers in. Floating-point addition is not associative, so (a + b) + c and a + (b + c) can differ in the last bits. Three operations — RMSNorm, matrix multiplication, and attention — were enough to make "identical input" produce divergent logits.

Here is why that last bit matters so much. Suppose the top two candidate tokens have probabilities that differ by 0.0000001. A microscopic numerical wobble flips the argmax. A different token gets emitted. And because each token feeds back into the context for the next one, that single flip doesn't stay small — the generation veers onto an entirely different path. One token of divergence near the start can mean a completely different answer by the end.

So when you replay a "cannot reproduce" ticket, you are not re-running the user's failure. You are drawing a fresh sample. The batch your request lands in is different. The other users sharing your GPU are different. Mixture-of-experts routing, which assigns tokens to experts in fixed-size groups, can route your tokens differently depending on who else is in the group. Your replay succeeding tells you almost nothing — it is one new draw from a distribution that produced a bad answer at some unknown rate.

Re-runnable versus re-rollable

The fix starts with giving up a goal you can't have and adopting one you can.

You probably cannot make a hosted model API bitwise reproducible. You don't control the batch, the kernels, the routing, or the fleet's hardware mix. Strict determinism is realistically available only when you run open-weights models on your own hardware with a pinned seed, temperature 0, deterministic kernels, and single-request batching — and even then it takes deliberate engineering. For most teams shipping on a vendor API, "re-run the exact failure" is off the table.

What you can have is re-rollability. You may not be able to reproduce the exact bad output, but you can reproduce the exact distribution it was drawn from. If you capture the complete input to that distribution, you can draw from it again — many times — and observe how often it fails.

The distinction is the whole game. A re-runnable bug gives you one deterministic failure to fix. A re-rollable bug gives you a failure rate you can measure, attack, and verify. For a probabilistic system, the failure rate is the real bug. "This prompt produces malformed JSON 4% of the time" is a tractable engineering target. "It broke once for one user" is a ghost story.

Capture the whole input to the distribution

To make a failure re-rollable, you need every input that shapes the distribution — not just the user's text. Most logging setups capture a fraction of this and quietly drop the rest. The full set:

  • The exact rendered prompt. Not the template — the final string after every variable, system message, and few-shot example was substituted in. Template plus separately-logged variables is not enough; whitespace, ordering, and truncation all move the distribution, and you want the bytes that actually went over the wire.
  • The full retrieved context. For a RAG or agent system, the chunks, tool outputs, and prior turns that landed in the context window. Retrieval is itself nondeterministic — an index update or an ANN search can return different neighbors — so the context at failure time is gone unless you logged it.
  • All sampling parameters. Temperature, top-p, top-k, frequency and presence penalties, max tokens, stop sequences, and any structured-output or grammar constraints.
  • Model and tokenizer version. "GPT-4o" or "Claude" is not a version. Providers ship silent updates; a pinned snapshot string is the only thing that means anything months later. The tokenizer matters too — re-tokenization differences shift the distribution before the model even runs.
  • The seed, if the API exposes one. It will not give you determinism by itself, but combined with the rest it narrows the space of explanations.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates