Skip to main content

Agent Idempotency Is an Orchestration Contract, Not a Tool Property

· 10 min read
Tian Pan
Software Engineer

The support ticket arrives at 9:41 a.m.: "I was charged three times." The trace looks clean. One user message, one planner turn, three calls to charge_card — each with a distinct tool-use ID, each returning 200 OK, each writing a different Stripe charge. The tool has an idempotency key. The backend has a dedup table. The payment processor honors Idempotency-Key. Every layer is idempotent. The customer still paid three times.

This is the shape of the bug that will land on your desk if you build agents long enough. It is not a bug in any tool. It is a bug in the contract between the agent loop and the tools, and that contract almost always lives only in a senior engineer's head.

The reflex response is to add another retry key somewhere. The actual fix is to recognize that idempotency is not a property you bolt onto each tool. It is a protocol that has to be threaded through the orchestration boundary, and the orchestration boundary in an agent is unusually hostile territory: the entity deciding when to retry is a non-deterministic language model that has no memory of what it already did.

The tool is idempotent. The agent is not.

Tool-level idempotency is a well-understood discipline. You give the endpoint a client-supplied Idempotency-Key, you stash the key with the response in a dedup table, and any subsequent call with the same key returns the cached result instead of redoing the work. Stripe made this pattern famous. Every production engineer who has wired up a payment integration has implemented it at least once.

That discipline implicitly assumes that the client knows when it is retrying. In a classic HTTP retry loop, it does: the client caught the timeout, kept the original key, and called again. The key is stable across retries because the client is the same process with the same variable in memory.

An agent loop breaks that assumption. When the planner emits charge_card a second time, it is not retrying. It is deciding again. The model has no hidden variable holding "the key I used last time." It has the transcript, the system prompt, and whatever tokens it is about to sample. If the first tool result didn't make it back into context cleanly — the call timed out, the response got truncated, the user interrupted, a subagent crashed mid-step, an approval UI rendered and the user clicked approve twice — the model will cheerfully re-plan the same action and your orchestration layer will cheerfully execute it, with a brand new tool-use ID, because at the tool's contract level, a brand new ID means a brand new request.

Three charges, three Stripe idempotency keys, three dedup-table misses, three successful 200s. The tool is idempotent. The agent is not.

The orchestration boundary is where the key has to live

The fix is stated simply and implemented with difficulty: the idempotency key must be derived by the orchestrator, not the tool call. It must be stable across LLM re-plans, across crash-recovery replays, across human-in-the-loop interruptions, and it must be threaded into the tool invocation by the runtime rather than synthesized by the model.

The construction that works in practice is something like (agent_run_id, step_id, tool_name, business_scope). The agent_run_id pins the key to the specific user request. The step_id is the logical step, not a tool-use counter — two re-plans of "charge the customer" should collapse to the same step_id. The tool_name scopes it so that a refund_card in the same run doesn't collide with charge_card. The business_scope — customer ID, order ID, the thing the action is about — is the final guard against the agent deciding to invoke the same tool for a genuinely different purpose later in the run.

Crucially, the key is not derived from the model's tool arguments. Hashing arguments is a seductive shortcut — "same args, same key" — and it fails the first time the model paraphrases its own plan. The user says "try again" and the model re-issues charge_card with the amount rounded differently, or with the currency in lowercase, or with a memo string that drifts by one token. New hash, new key, duplicate charge. The key has to come from the structural context of the agent run, not from any string the model emitted.

The model is an unreliable client, so the runtime has to act like a reliable one

Once you accept that the orchestrator owns the key, the runtime picks up responsibilities that the model cannot discharge. It has to decide, when the model emits what looks like the same tool call twice, whether to:

  • Coalesce: treat the second emission as a retry of the first, suppress execution, and return the cached result to the model as if the call had just happened.
  • Refuse: fail the call back to the model with a structured "you already did this" error and let the planner recover.
  • Bypass: a human operator has explicitly requested re-execution, so mint a fresh key and accept the double side effect.

Which of these is correct is not something the model should vote on. It is a policy the system owner configured up front, probably per-tool. A charge_card policy is "coalesce aggressively, never bypass without human sign-off." A send_notification_to_self policy might be "always allow — the user asked twice because they wanted two pings."

The hard part is that the model will happily argue otherwise. Ask it "should I retry the charge?" and it will produce a plausible-sounding justification for whichever answer completes the narrative of the conversation. That is the anti-pattern: letting the acknowledgment-in-prose ("Sure, let me try that again.") act as authorization to mint a new key. The authorization has to live outside the language model, in the runtime's policy table, keyed by tool name.

Crash recovery, human approval, and subagents all break naive keys

Three specific situations turn a straightforward agent into a duplicate-side-effect factory.

Crash recovery. Durable-execution runtimes like Temporal and Restate pitch themselves as the answer to agent reliability precisely because they solve this one. On replay, a durable workflow re-executes its history up to the crash point; without idempotency, the tool calls fire again. With idempotency threaded through, the replay hits the dedup table, reads the original response, and moves on. This only works if the key is reconstructible from the workflow state — which is exactly the orchestrator-owned derivation above. A key stored only in LLM context vanishes with the LLM context on replay.

Human-in-the-loop approvals. Several agent frameworks have shipped bugs where a human approval flow produced duplicate tool results for a single tool call, or where an interrupt-and-resume cycle re-executed an entire tool node. The shape is always the same: the approval UI pauses the loop, something nudges the loop back into motion, and the "something" looks to the runtime like a fresh emission. If the key is derived from (run_id, step_id) rather than from the tool-use ID the model minted, the resume is a dedup hit instead of a duplicate charge.

Subagents and parallel tool calls. When a planner fans out into subagents, each subagent typically gets its own tool-use ID space. Without a correlation ID that spans the whole run, two subagents assigned overlapping responsibilities can both invoke the same tool for the same business purpose. The fix is the agent_run_id in the key, plus a propagation discipline so that every subagent inherits it. This is the same correlation-ID discipline microservices teams learned ten years ago, renamed for agents.

The invariant to code against

The cleanest way to state the contract is as an invariant: at the orchestration boundary, "this tool was called twice" must be indistinguishable from "this tool was called once." The runtime, not the tool, is responsible for upholding it. The tool's idempotency guarantee becomes a backup — defense in depth against the runtime missing a case — not the primary mechanism.

Engineered against this invariant, a number of downstream decisions become easier. Retries at the runtime layer are safe because the key is already minted. Partial-failure recovery — the call succeeded at the backend but the response was lost — becomes a dedup read instead of a compensating transaction. Human-in-the-loop flows can be paused and resumed without duplicate side effects. Crash-recovery replays do the right thing. Most importantly, the agent loop's non-determinism stops compounding: even if the model re-plans the same action ten times, at most one of those plans produces a side effect.

Teams that skip this discipline usually discover it in production through a specific sequence of incidents. First, a duplicate payment after a tool timeout. Then a duplicate email after a crash-recovery. Then a duplicate order when two subagents both thought they owned checkout. Each incident gets a post-mortem blaming the specific tool, and each tool gets a beefier idempotency guard. The incidents keep happening because the bug was never at the tool layer.

What "good" looks like

A well-designed agent runtime treats the idempotency key as a first-class workflow artifact, not an HTTP header the tool adapter happens to set. Concretely:

  • The key is derived by the orchestrator from run-level state, not from model-emitted arguments.
  • The key is persisted before the tool call is issued, so a crash between "key minted" and "tool invoked" is recoverable.
  • Every tool adapter receives the key as a runtime parameter, not a model-controllable one, and passes it through to the backend.
  • The dedup lookup happens at the runtime boundary, before the tool adapter runs, so coalesced calls are effectively free.
  • Re-plans from the model are mapped onto existing keys whenever the structural context matches, and the model is given the cached result instead of a fresh execution.
  • Policy per tool decides whether missing-key or ambiguous-key cases should coalesce, refuse, or escalate to a human.

A healthy eval for this is boring by design. Run a scripted agent against a flaky tool backend that randomly returns timeouts, duplicates, and slow success. Measure: how many side effects did the external system observe, per user request? The answer must be exactly one, regardless of how many tool-use IDs the transcript contains, how many subagents participated, or how many times the workflow crashed and resumed. If that number ever exceeds one, the contract is broken somewhere, and no amount of per-tool idempotency will close the gap.

Stop ending the conversation at "the tool is idempotent"

The single sentence that causes the most production damage in agent systems is "but the tool is idempotent." It ends design discussions early, it closes incident tickets prematurely, and it lets teams ship agents that look reliable in demos and produce duplicate side effects in production. The tool being idempotent is necessary. It was never sufficient.

Treat the agent loop as a non-deterministic client talking to an at-least-once execution layer, and build the idempotency protocol where it actually belongs: at the orchestration boundary, owned by the runtime, derived from structural state, threaded through every subagent and every resume. Do that, and "tool called twice" stops being an outage and starts being a line in the dedup table — which is where it belonged all along.

References:Let's stay in touch and Follow me for more thoughts and updates