Skip to main content

The Idempotency Problem in Agentic Tool Calling

· 11 min read
Tian Pan
Software Engineer

The scenario plays out the same way every time. Your agent is booking a hotel room, and a network timeout occurs right after the payment API call returns 200 but before the confirmation is stored. The agent framework retries. The payment runs again. The customer is charged twice, support escalates, and someone senior says the AI "hallucinated a double charge" — which is wrong but feels right because nobody wants to say their retry logic was broken from the start.

This isn't an AI problem. It's a distributed systems problem that the AI layer imported wholesale, without the decades of hard-won patterns that distributed systems engineers developed to handle it. Standard agent retry logic assumes operations are idempotent. Most tool calls are not.

Why Agent Retries Are Structurally Broken

Every major agent framework — LangChain, LlamaIndex, OpenAI's Agents SDK, Anthropic's Claude — includes automatic retry behavior for transient failures. That's correct. Transient failures happen constantly in distributed systems, and silently dropping a request is worse than retrying.

The problem is where the retry logic lives. Agent frameworks retry at the LLM layer: the model sees a tool call failed, decides to try again, and calls the tool again. Nothing in this loop tracks whether the tool already executed its side effects. The framework sees a timeout and reasons: "I didn't get a result, so I should try again." The tool may have already written to the database, charged the card, sent the email, or created the ticket.

This failure mode appears across industries. Agents managing CRM systems create duplicate support tickets from a single customer complaint. Inventory management agents double-deduct stock from the same order. Financial agents send duplicate refunds. Each case follows the same pattern: a timeout or transient error triggers a retry, the tool executes again, and the system ends up in a state the agent never intended.

The "fix the prompt" instinct kicks in at this point. Engineers add instructions like "only call the payment tool once" or "check if the order exists before creating it." This doesn't work. The agent that generated the duplicate charge was following instructions correctly — it genuinely didn't know the first call succeeded. The problem isn't the model's reasoning; it's the absence of external state that would let the tool report "I already did this."

What Idempotency Actually Means for Tool Calls

Idempotency means calling an operation multiple times produces the same result as calling it once. A GET request is naturally idempotent: reading a record never changes it, so reading it ten times is safe. A DELETE is idempotent in practice: deleting a record that doesn't exist returns the same logical result as deleting one that does. A POST that creates a record or charges a card is not idempotent by default — every call creates a new thing.

The pattern for making non-idempotent operations safe is well-established in payment APIs. When a client sends a request, it includes an idempotency key — a unique identifier the client generates and owns. The server stores the key alongside the operation result. On a retry with the same key, the server checks its store and returns the cached result without re-executing. The client gets a consistent response on the third retry as on the first.

For agent tool calls, this pattern requires explicit design decisions across three layers:

The agent runtime layer must generate and maintain idempotency keys per workflow step. The right key is derived from durable state: {workflowRunId}:{stepId} works well in practice. This ensures keys survive restarts and are deterministic — the same key is regenerated on resume, not a new one generated on retry.

The tool execution layer must pass the idempotency key to the downstream service, check the deduplication store before executing, and cache results with sufficient TTL. If the key exists and the previous call succeeded, return the cached response. If the key exists and the previous call failed with a permanent error, return that error without re-executing.

The tool interface itself must accept idempotency keys and implement the deduplication logic. Tools that call external APIs should pass the key through to those APIs. Tools that write to internal databases should use the key as part of a unique constraint.

When all three layers cooperate, the agent framework can retry as aggressively as it wants. The economic result — one charge, one record, one email — matches the agent's intent.

Saga Patterns for Multi-Step Workflows

Single-tool idempotency is the simpler problem. The harder case is multi-step workflows where an agent calls several tools in sequence and something fails partway through.

Consider an agent that processes an order: reserve inventory, charge the customer, send a confirmation email. Each step is idempotent in isolation. But if the payment succeeds and the confirmation fails, the customer has been charged without confirmation. If the agent retries the whole sequence, the payment runs again (assuming you implemented idempotency at that layer, the charge is deduplicated), but the inventory was already reserved — a second reservation attempt might fail if inventory is now zero.

This is the problem the saga pattern solves. A saga is a sequence of steps where each step has a corresponding compensating action that reverses its effects if a later step fails. Rather than atomic rollback (which requires distributed transactions and their associated costs), sagas implement eventual consistency through explicit compensation.

For the order processing workflow, the saga looks like this:

  • Reserve inventory → compensating action: release reservation
  • Charge payment → compensating action: issue refund
  • Send confirmation → compensating action: send cancellation notice

If the confirmation step fails permanently, the saga executor runs the compensating actions in reverse order: issue refund, release inventory. The customer sees a failed order, not a charged order with no confirmation.

The critical implementation detail is that compensating actions must themselves be idempotent. Issuing a refund on retry should not issue a second refund. This sounds obvious until you're debugging a refund loop at 2am.

Two implementation patterns exist for saga execution. The orchestration model uses a central coordinator — often a durable workflow engine or a dedicated orchestrator agent — that directs each step and triggers compensation on failure. This gives clear visibility into workflow state but creates a single point of coordination. The choreography model has each step emit events that trigger the next step, with compensation triggered by failure events. This is more loosely coupled but significantly harder to observe and debug.

For agentic workflows, orchestration generally wins. Agents need to track which tools they've called and with what results. A workflow engine that persists this state provides the foundation for deterministic retry and compensation without rebuilding it from scratch in the agent's prompt.

Designing Idempotent Tool Interfaces

The burden shouldn't fall entirely on the agent runtime. Tools themselves should be designed with idempotency in mind, and the design decisions are not obvious.

Accept client-generated idempotency keys. Don't generate them server-side. When the client owns the key, it can regenerate the same key on retry from the same input state. Server-generated keys require the client to store the key from the first response — which may not exist if the first request timed out before the response arrived.

Return unambiguous error signals. When an agent tool call fails, the agent needs to know whether to retry. Not all errors are retryable. A 500 status from a flaky network is retryable. A 400 status from a malformed request is not — retrying it will fail again. A 409 status from a duplicate key violation might mean the operation already succeeded. Each error type should signal its retry semantics explicitly, rather than leaving the agent to guess.

Implement preview endpoints for destructive operations. Before an agent deletes records, cancels subscriptions, or runs bulk updates, it should be able to ask "what would this do?" without committing. The preview endpoint returns a description of affected resources; the execution endpoint requires a token from the preview response. This pattern — common in infrastructure tooling — prevents agents from discovering side effects through irreversible execution.

Use natural business identifiers where possible. For operations tied to a business entity — an order, a subscription, a customer — using the entity ID as part of the idempotency key gives you natural deduplication. "Create notification for order ord_12345" is idempotent by construction if the tool checks for existing notifications for that order before creating a new one. This approach is more robust than UUID-based keys because it aligns the deduplication logic with business semantics.

Return operation IDs for async work. If a tool starts a long-running operation — a report generation, a data migration — it should return an operation ID immediately and let the caller poll for completion. This separates the question "did the operation start?" from "did the operation complete?" and allows the agent to recover from timeouts without risking duplicate execution.

Delivery Semantics and the Exactly-Once Illusion

Distributed systems engineers distinguish three delivery guarantees:

At-most-once: The operation executes zero or one times. No duplicates, but possible data loss. Use this for operations where running twice is worse than not running at all.

At-least-once: The operation executes one or more times. No data loss, but possible duplicates. This is what agent retry logic provides by default.

Exactly-once: The operation executes precisely once. This is what everyone wants, and it's famously hard to achieve in distributed systems.

The way to approximate exactly-once semantics in practice is to combine at-least-once delivery with idempotent execution. The delivery mechanism retries until it gets an acknowledgment. The execution layer uses idempotency keys to ensure repeated delivery has no additional effect. The combination produces exactly-once effect from the system's perspective, even though the call may arrive multiple times.

This is the architecture that message queues like Kafka advertise when they claim "exactly-once semantics." They don't eliminate duplicate messages at the network layer. They combine deduplication at the consumer layer with transactional commits to ensure each message produces one logical effect, regardless of how many times it arrives.

Agent frameworks are converging on the same architecture. The agent runtime delivers tool calls with at-least-once semantics (retrying on failure). The tool execution layer implements idempotency (checking keys before executing). The result is exactly-once business effects, which is what the agent intended.

What Production-Grade Agent Infrastructure Looks Like

The pattern emerging from teams running agents in production in 2025-2026 is consistent across companies and stacks:

Every tool call that produces side effects carries an idempotency key derived from durable workflow state. The agent framework persists step results in a durable store so that on restart, the agent can reconstruct what has already executed and what remains. Compensating transactions exist for every step that has downstream dependencies. Error responses include explicit retry guidance — retryable: true/false and retry_after_seconds — so the agent doesn't apply uniform retry logic to non-uniform failure modes.

Teams that skip this infrastructure eventually hit production incidents that look like AI failures but are actually distributed systems failures. An agent that "made up" a duplicate order didn't hallucinate — it retried a non-idempotent operation and lost track of the result. An agent that "couldn't cancel" a subscription after failing to confirm it didn't miss the instruction — it lacked compensating logic and left the workflow in an inconsistent state.

The framing matters because it determines the fix. Blaming the model leads to prompt engineering. Blaming the architecture leads to idempotency keys, saga executors, and tool interface redesigns — changes that actually solve the problem.

Getting Started Without Rewriting Everything

You don't need to implement Temporal or build a saga executor to make progress. The highest-leverage changes are:

Start with your most dangerous tools — the ones that write money, send communications, or modify records in ways that are hard to undo. Add idempotency key support to those tools first. Have your agent runtime generate keys from workflow IDs and step numbers. Test by running the tool twice with the same key and verifying the second call returns the same result without re-executing.

Add retry metadata to error responses from your tools. Even just distinguishing retryable: true from retryable: false prevents the agent from endlessly retrying permanent failures.

For multi-step workflows, identify the steps that have irreversible effects and write compensating actions for them — even simple ones. A compensation that logs "manual review required" is better than no compensation at all.

The core insight of idempotency is that state machines are more reliable than intent. An agent that reasons "I think I already charged this customer" is less reliable than a system that records "this idempotency key returned success on 2026-04-18T14:23:00Z." Replace reasoning about past actions with records of past actions, and your agent's retry behavior stops creating problems.

References:Let's stay in touch and Follow me for more thoughts and updates