Idempotency Is Not Optional in LLM Pipelines
A batch inference job finishes after six minutes. The network hiccups on the response. Your retry logic kicks in. Two minutes later the job finishes again — and your invoice doubles. This is the tamest version of what happens when you apply traditional idempotency thinking to LLM pipelines without adapting it to stochastic systems.
Most production teams discover the problem the hard way: a retry that was supposed to recover from a transient error triggers a second payment, sends a duplicate email, or writes a contradictory record to the database. The fix is not better retry logic — it is a different mental model of what idempotency even means when your core component is probabilistic.
Why the Classic Definition Breaks Down
In deterministic systems, idempotency is simple: running the same operation multiple times produces the same result as running it once. Safe to retry because the output is fixed.
LLMs violate this at the token level. Temperature-based sampling, top-p filtering, and provider-side batching effects mean that identical prompt input produces different output on every call. Even at temperature=0, subtle floating-point non-determinism across hardware regions and provider updates can shift results. The classic definition — same input, same output — simply does not apply.
This matters because most retry infrastructure was built for the deterministic world. Exponential backoff, idempotency keys, response caching — all assume that "retrying the same request" is a coherent concept. In LLM systems you need to be precise about what you mean by "same."
There are two fundamentally different retry intents, and conflating them is where production bugs are born.
The Two-Retry Trap
Retry for transient reliability means: the operation succeeded (or would have succeeded), but the response was lost in transit. A network timeout, a brief 503, a slow response that crossed your deadline — the underlying work already happened, and you want the result you're owed without triggering the operation again.
Retry to sample a different answer means: the LLM returned output that failed validation — wrong format, contradicted a constraint, triggered a guardrail — and you're hoping the next sample will be better.
These require opposite handling. For transient retries, you want strict idempotency: the exact same result returned from a cache, no recomputation. For sampling retries, you need the opposite — a genuinely new generation, which means your retry must have a different effective input (usually by appending the error feedback to the prompt).
The failure mode engineers hit is treating sampling retries like transient retries. An agent sends a tool call, the tool times out, the agent retries assuming the original call failed. But the first call completed — it just didn't report back in time. Now you have two tool executions, two side effects, one confused state machine.
The downstream consequences scale with what the tool does. Sending a Slack message twice is annoying. Charging a payment card twice is a customer service incident. Writing two contradictory records to a database is a debugging nightmare that can persist for weeks before someone notices the inconsistency.
The Idempotency Key Pattern and Where It Breaks
The standard fix is the idempotency key: include a stable unique identifier with every mutating operation. On the server side, store the result of the first execution against that key. On retry, return the cached result without re-executing. Stripe popularized this; the HTTP standard has formalized it as the Idempotency-Key header.
For deterministic APIs this works well. For LLM pipelines there are three complications:
Keys must be scoped to logical operations, not API calls. An agent orchestration step might involve five LLM calls plus three tool calls. The idempotency key should cover the unit of work that must not be duplicated — typically the action with external side effects — not the individual inference request.
Sampling retries break key semantics. If you cache the result of a failed parse attempt under an idempotency key, subsequent retries return the same bad output. The key must be invalidated or the retry must use a different key. The cleanest pattern: include a retry count or a hash of the previous error in the key, so each sampling retry is treated as a distinct operation while still being recoverable if the network fails.
Provider-level caching is not idempotency. Prefix caching, prompt caching, and semantic caching all reduce cost and latency, but they are not designed as idempotency mechanisms. They may return cached responses, but they do not provide the delivery guarantees that idempotency keys do. Confusing the two creates false confidence that retries are safe.
Semantic Deduplication for Agent Actions
When the thing you are trying to deduplicate is not a raw API call but an agent action — "send an email," "create a ticket," "submit a form" — content hashing on the raw request is often insufficient. The agent might generate slightly different tool call parameters each time (different phrasings, equivalent IDs) while expressing the same intended action.
Semantic deduplication adds a similarity check before execution:
- Embed the proposed tool call (or a normalized representation of it) and store it with the idempotency key.
- When a new tool call arrives, check it against recent stored calls from the same agent session.
- If cosine similarity exceeds a threshold (typically 0.9 for near-identical, 0.85 for equivalent-intent), treat it as a duplicate.
Production deployments report 20-45% cache hit rates on real agent traffic using this approach — and the benefits compound in agentic loops where the same sub-goal is reached via different reasoning paths. The threshold calibration matters: too low and legitimate different actions get blocked; too high and you miss paraphrase duplicates.
The practical pattern is a funnel: fast string matching first, then embedding similarity for near-misses, then LLM-based verification only for borderline cases where you need semantic judgment. This keeps lookup latency under 50ms — the threshold at which deduplication overhead starts to erode its own benefits.
What Corrupted State Machines Actually Look Like
Multi-agent systems are the highest-risk surface area for idempotency failures because failures at any step can cascade. A common pattern:
Agent A orchestrates a multi-step workflow: retrieve data, analyze it, write a summary, send a notification. Each step is handled by a specialist sub-agent. Agent A waits for completion signals. If Agent B's completion signal is delayed — the notification service was slow — Agent A retries the whole workflow. Agent B's notification is now in flight twice. The summary has been written twice. If the write operation was not guarded by an idempotency key, there are now two records.
The failure is invisible until a human or downstream system notices the duplicate. LLM outputs being slightly different on each run means the two summary records are not bitwise identical — a simple deduplication query won't catch them. This is why idempotency in multi-agent systems is a workflow-level concern, not just an API concern.
The saga pattern addresses this directly. Each agent step publishes a completion event and registers a compensating action — the rollback logic that undoes the step if a later step fails. On retry, the orchestrator replays only incomplete steps, not already-completed ones. Each step's handler is idempotent: it checks a completion record before executing, and skips if already done.
This is standard distributed systems engineering. The twist for LLM pipelines is that the "completion record" must track semantic completion, not just execution completion. A tool call that ran to completion but returned an unparseable response is not semantically complete — you want to re-execute it, not skip it on retry. The completion record needs a status field that distinguishes "ran" from "succeeded."
Designing for Idempotency from the Start
The patterns that hold up in production share a few characteristics.
Separate ingestion from processing. Webhook handlers, queue consumers, and API endpoints should acknowledge receipt immediately and push work onto an internal queue. Business logic runs asynchronously from that queue, with deduplication happening at the queue level rather than in the handler. This decouples reliability of acknowledgment from reliability of execution.
Make tool calls pure when possible. A tool call that reads state and returns a result is free — retry it as many times as you need. A tool call that writes state is expensive — guard it with an idempotency key. Designing agents to prefer reads over writes, and to push writes to the end of a workflow, naturally reduces the surface area where idempotency failures matter.
Log intent before execution. Before executing any irreversible action, write the intent to a durable log with the idempotency key. If the process crashes after execution but before recording completion, the intent log tells you what happened and whether to compensate or skip on recovery. Without this, recovery logic degenerates into "try again and hope."
Distinguish retry policy from sampling policy. In your agent loop configuration, these should be separate knobs. The retry policy governs how many times you re-attempt a failed network call with the same payload. The sampling policy governs how many times you generate a new response to satisfy a quality constraint. Mixing them in a single "max_retries" parameter is the source of most of the failure patterns described here.
The Cost Multiplication Problem
One underappreciated consequence of idempotency failures is cost multiplication under load. During a traffic spike or a provider incident, retry storms are common. If your retry logic is not idempotent, every retry executes the full pipeline — including inference, tool calls, and any downstream API charges. Output tokens are priced higher than input tokens at most providers. A $50 batch job that retries three times under a network hiccup costs $200.
At scale this compounds. Teams running agentic workflows at volume report that retry storms during incidents account for 30-60% of unexpected monthly inference spend. The fix is not more aggressive rate limiting — it is idempotency infrastructure that ensures retries return cached results rather than triggering fresh executions.
The layered caching architecture — semantic cache first, then prefix cache, then full inference — provides some protection, but only if the cache hit logic is integrated with the idempotency key system. A cache hit should satisfy the idempotency guarantee: the client receives the same logical result it would have received had the original request succeeded. If caching and idempotency are separate systems with separate keys, retries can slip through the cache and still trigger fresh inference.
Practical Starting Point
For teams beginning to retrofit idempotency into existing pipelines, the priority order is:
- Tool calls with external side effects (payments, emails, records) — guard these first with idempotency keys at the tool layer, not the agent layer.
- Agent workflow steps — add completion records and skip-if-done logic before worrying about semantic deduplication.
- Sampling retries — ensure these use distinct keys or invalidate the cached result, and always append the previous error to the prompt.
- Monitoring — add a metric that counts "idempotency key hits" at each layer. A sudden drop in hit rate during a retry storm indicates your deduplication is not covering a code path.
The mental model shift is the hard part. Traditional idempotency is a property of operations. In LLM pipelines, idempotency is a property of workflows — and it requires being explicit about what "the same operation" means when the core component is stochastic.
- https://redis.io/blog/what-is-idempotency-in-redis/
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
- https://arxiv.org/abs/2503.11951
- https://temporal.io/blog/compensating-actions-part-of-a-complete-breakfast-with-sagas
- https://dev.to/fazal_mansuri_/idempotency-in-apis-why-your-retry-logic-can-break-everything-and-how-to-fix-it-345k
- https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/
- https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- https://medium.com/@komalbaparmar007/llm-tool-calling-in-production-rate-limits-retries-and-the-infinite-loop-failure-mode-you-must-2a1e2a1e84c8
- https://developers.openai.com/commerce/guides/production
- https://machinelearningmastery.com/handling-race-conditions-in-multi-agent-orchestration/
