The Retry That Changed the Answer: Idempotency Keys for Nondeterministic LLM Calls
Every distributed system you have ever built leans on one quiet assumption: a retry after a timeout is safe. The operation is idempotent, so if the client gives up waiting and re-sends, the worst case is duplicate work that converges to the same state. Two PUTs land the same row. Two DELETEs leave the same absence. The retry is a no-op dressed as a second attempt.
LLM calls break this assumption, and they break it silently. A retry does not re-fetch the same answer — it samples a new one. When a client times out at the network layer because the response was lost in transit, but the provider actually finished the generation, the retry produces a second, different answer. Now two distinct outputs exist for one logical request, and nothing in your stack knows which one is canonical.
This is not a rare edge. Practitioners running models behind timeouts report that 5–10% of requests hit the full timeout-plus-retry cycle even when the underlying call eventually succeeds. Every one of those is a coin flip your system was never designed to adjudicate.
Why "retry on timeout" stopped being safe
The retry primitive comes from a deterministic world. GET /users/42 returns the same user whether you call it once or five times. PUT with a full body is idempotent by construction. The whole reliability stack — load balancers, retry middleware, circuit breakers — is built on the premise that re-issuing a request either does nothing new or converges.
A sampled generation has none of that structure. Temperature above zero means the output is drawn from a distribution, not looked up. Even at temperature zero, you do not get a guarantee: floating-point non-associativity across GPU batches, mixture-of-experts routing that depends on which other requests share your batch, and silent provider-side model updates all mean "same input" does not imply "same output." The request is not a read. It is closer to a roll of dice that also costs money.
So the timeout creates a three-way ambiguity the client cannot resolve from its side:
- The request never reached the model. Retrying is safe and necessary.
- The request reached the model, generation is still running. Retrying starts a second generation.
- The request completed, the response was lost on the way back. Retrying produces a different answer to a question already answered.
A bare retry-on-timeout treats all three identically. In the deterministic world that was fine, because cases two and three were harmless. In the nondeterministic world, cases two and three are where the money leaks and the bugs hatch.
The failure modes are subtle and expensive
The damage from a duplicated generation rarely shows up as a crash. It shows up as a slow accumulation of wrongness that no single log line explains.
Double billing. The most direct one. The provider charges per generation, not per logical request. A client that retries through a lost response is billed twice for an answer the user asked for once. At a 5% retry rate this is a rounding error on a small app and a real line item at scale — and it is invisible, because both charges look legitimate in isolation.
Side effects fired twice. This is the dangerous one. In an agent loop, a generation often is the decision to act: the model emits a tool call, and the orchestrator executes it. If the transport for that generation is retried, the model is re-sampled, emits a tool call again — possibly a different one — and the orchestrator runs it. A refund gets issued twice. An email goes out twice. A row gets inserted twice. Agent frameworks are full of open issues describing exactly this: tools invoked twice after a retry, MCP calls duplicated four times under parallel subagents. The model layer cannot promise idempotency, so the side-effect layer inherits the coin flip.
Divergent conversation history. When a retry produces a second assistant turn, your conversation log now contains two answers to the same user message. Whichever one you persist becomes ground truth; the other becomes a ghost. If the user saw the first (lost-in-transit) answer in a streamed UI but you stored the second, your transcript no longer matches the user's reality — and every downstream consumer of that transcript, including your eval pipeline, scores a turn the user never saw.
Eval pipelines scoring the wrong copy. If two outputs exist and your trace ingestion grabs whichever arrived last, your eval is grading an artifact that was never shipped. The dashboard moves, the cause is unfindable, and you spend a sprint chasing a regression in a response no human ever read.
- https://redis.io/blog/what-is-idempotency-in-redis/
- https://mightybot.ai/blog/fault-tolerant-ai-agent-pipelines/
- https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Headers/Idempotency-Key
- https://platform.openai.com/docs/guides/webhooks
- https://developers.openai.com/commerce/guides/production
- https://zuplo.com/learning-center/implementing-idempotency-keys-in-rest-apis-a-complete-guide
- https://repost.aws/questions/QULVjTn9bOQGOlHcsYB1f3Fw/llm-calls-hangs-and-succeed-on-the-second-attempt
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://news.ycombinator.com/item?id=46933954
