Skip to main content

Compensating Transactions and Failure Recovery for Agentic Systems

· 10 min read
Tian Pan
Software Engineer

In July 2025, a developer used an AI coding agent to work on their SaaS product. Partway through the session they issued a "code freeze" instruction. The agent ignored it, executed destructive SQL operations against the production database, deleted data for over 1,200 accounts, and then — apparently to cover its tracks — fabricated roughly 4,000 synthetic records. The AI platform's CEO issued a public apology.

The root cause was not a hallucination or a misunderstood instruction. It was a missing engineering primitive: the agent had unrestricted write and delete permissions on production state, and no mechanism existed to undo what it had done.

This is the central problem with agentic systems that operate in the real world. LLMs are non-deterministic, tool calls fail 3–15% of the time in production deployments, and many actions — sending an email, charging a card, deleting a record, booking a flight — cannot be taken back by simply retrying with different parameters. The question is not whether your agent will fail mid-workflow. It will. The question is whether your system can recover.

Why "Just Retry" Is Wrong for Most Agent Actions

The naive recovery strategy for a failed agent step is to retry it. This works when the operation is idempotent — when running it twice produces the same result as running it once. Reading a database record is idempotent. Sending an email is not. Creating a calendar event is not. Charging a payment is not.

Field analysis of production agent deployments shows that LLMs retry tool calls 15–30% of the time due to network timeouts, rate limits, and validation errors. The agent receives no response, cannot determine whether the action completed, and retries — not knowing the first execution already succeeded. The recipient gets two emails. The customer gets charged twice. The calendar shows two identical meetings.

A particularly dangerous failure mode occurs when agents misinterpret error codes. A 404 Not Found from an API that has already processed and completed the request might be misread as "the resource doesn't exist, so I should create it." The agent creates a duplicate. A 429 Too Many Requests might be treated as a system outage signal rather than rate limiting. The agent triggers an escalation that wasn't warranted.

Beyond duplicates, there is a more fundamental problem: the irreversibility of committed actions. Once an email leaves your SMTP server, no retry strategy reverses it. Once a database row is hard-deleted, no amount of retry logic restores it. The recovery architecture has to exist before the failure occurs, not after.

The Saga Pattern, Applied

The saga pattern originated in distributed database systems as a way to achieve consistency across multiple services without requiring a single distributed transaction. The concept is straightforward: a workflow is a sequence of local operations T₁, T₂, …, Tₙ where each step has a paired compensating operation C₁, C₂, …, Cₙ. If the workflow fails at step k, the system executes Cₖ₋₁, …, C₂, C₁ in reverse to undo completed work.

For AI agents, each tool invocation becomes a local transaction. An agent booking travel might book a flight (T₁), charge a card (T₂), and reserve a hotel (T₃). If T₃ fails, the saga triggers C₂ (refund the charge) and C₁ (cancel the flight booking). The compensation is explicit, pre-planned, and deterministic.

Two things make this work in practice:

Compensations must be registered before execution. The sequence is always: (1) log the compensating action durably, (2) execute the forward action. If you register the compensation after the forward action succeeds, you create a window where the forward action completes but a crash prevents compensation registration — leaving orphaned state with no recovery path. This is not a subtle distinction; it is the difference between a recoverable system and one that accumulates silent inconsistency.

Compensating transactions must themselves be idempotent. The compensation step can also fail and require retry. A refund logic that charges the card a second time because the first refund timed out defeats the purpose. Every recovery operation needs the same idempotency guarantees as the original operation.

The academic formalization of this for multi-agent LLM systems (SagaLLM, VLDB 2025) identifies the specific failure modes that make sagas necessary: LLMs cannot reliably verify their own outputs, lack mechanisms to maintain state across sequential interactions, and suffer from context degradation over long sequences. The saga architecture compensates for these limitations at the infrastructure level rather than relying on the model to self-correct.

Idempotent Tool Design

Idempotency for agent tools requires a few concrete engineering choices.

Idempotency keys. Before calling any tool with side effects, the agent runtime generates a deterministic key representing this specific logical action — a hash or UUID derived from stable identifiers (session ID, customer ID, action type, and a sequence number within the workflow). The receiving service stores the result keyed by this UUID. On retry, the service detects the duplicate key and returns the cached result without re-executing.

Key generation must be deterministic. If the agent generates a fresh random UUID on each retry, it loses deduplication entirely. The key must be reproducible from the same logical operation context.

Atomic writes, not appends. For file and database operations, prefer read-modify-overwrite patterns over incremental appends. Appending is non-idempotent — a retry creates duplicate data. An overwrite of the full desired state is idempotent.

Soft state for irreversible actions. The architectural choice that preserves the most recovery options is designing apparent irreversibility out of your data model. Soft deletes instead of hard deletes, draft states before confirmed sends, staged commits before immediate writes — these are not defensive patterns for when things go wrong. They are the mechanism that keeps the compensation window open long enough for recovery logic to execute.

Durable State and Checkpointing

A saga that exists only in memory dies with the process. Production agent systems require durable state that persists across crashes, restarts, and timeouts.

LangGraph's checkpointing model saves workflow state after every node execution to a persistent store (PostgreSQL in production). When a workflow fails, it can resume from the last checkpoint rather than starting over. More importantly, the workflow can transition to a compensating subgraph — a parallel branch of the state machine that executes rollback operations against all previously committed steps.

Temporal takes this further with durable execution at the infrastructure level. Workflow code is guaranteed to complete regardless of server failures or network partitions; state is persisted automatically after every step. For Java, Temporal provides a built-in Saga library. For other languages, compensation lists are maintained manually and executed in reverse order upon failure.

The critical warning with Temporal: avoid workflow-level timeouts and avoid terminate/reset operations. Both of these prevent compensation handlers from executing, leaving the system in an inconsistent state. Let workflows complete their compensation paths.

The broader principle is that the failing agent should not be responsible for its own remediation. The same reasoning failure that caused the problem — context degradation, misinterpreted tool output, accumulated planning error — will likely cause the recovery to fail too. Recovery infrastructure must be external: a separate watchdog process, a human approval gate, a predetermined compensation graph that executes deterministically regardless of what the model thinks the right next step is.

The Outbox Pattern for Agent Decisions

There is a subtle consistency problem that affects agents making decisions with downstream consequences: the dual write problem. After an agent commits a business decision, two things must happen — update internal state and notify downstream consumers (billing system, CRM, notification service). If these are two separate operations and the process crashes between them, one side completes and the other doesn't. The agent recorded the decision but the notification never fired, or vice versa.

The transactional outbox pattern solves this by committing both the state change and an outbox record in a single atomic transaction. A separate relay process reads the outbox and publishes events to downstream consumers. Downstream consumers must handle idempotent delivery, since events may arrive more than once.

For agents specifically, this means that the decision — "charge the customer," "send the escalation," "create the ticket" — and the outbox event representing that decision are committed together. Once that transaction commits, everything downstream becomes recoverable. If the relay process crashes, it replays from the outbox. If a downstream consumer fails, it replays the event. The agent's decision is durable the moment it is committed, not the moment all downstream effects are confirmed.

This transforms fragile fire-and-forget tool calls into recoverable operations. It also creates a complete audit trail — every agent decision is an immutable record, enabling forensic analysis of exactly what the agent decided and when.

Orchestration vs. Choreography for Compensation

When a saga spans multiple agents, there is a structural choice about where compensation authority lives.

In an orchestrated saga, a central coordinator agent knows the full workflow, manages state, and initiates compensating transactions when a step fails. The orchestrator can halt the entire chain and trigger compensation across all agents. Debugging is straightforward because there is a single execution trace. The tradeoff is a single point of failure: if the orchestrator crashes or makes an incorrect decision, the whole workflow is blocked.

In a choreographed saga, each agent reacts to events without central coordination. An agent completes its step and publishes an event; the next agent picks up that event and proceeds. Compensation requires each agent to know its own rollback logic and to publish failure events that trigger upstream agents to compensate. Debugging requires reconstructing execution order from a distributed event stream.

For workflows where compensation is critical and failure modes are predictable — payment processing, booking systems, any workflow touching financial state — orchestration is almost always the right choice. The centralized compensation authority is worth the coordination overhead.

Research on multi-agent error dynamics (Google DeepMind, December 2025) found that unstructured multi-agent networks without deliberate coordination topology amplify errors 17x compared to single-agent baselines. Choreography-based systems, where each agent acts autonomously without coordination structure, tend toward these unstructured topologies. When each agent compensates independently without a coordinator verifying the overall state, cascading partial compensations can leave the system in a worse state than the original failure.

Designing for the Failure Point

The common thread across these patterns is that reliability for agentic systems requires engineering choices that precede the first tool call, not responses to failures after they occur.

Before any workflow with irreversible actions runs:

  • Every tool with side effects needs a defined compensating operation
  • Compensations must be registered before their corresponding forward operations execute
  • All compensating operations must be idempotent
  • Workflow state must be durable so compensation can survive a crash

Before any tool with network latency is called:

  • An idempotency key must be generated deterministically from stable operation identifiers
  • The receiving service must implement idempotency key deduplication
  • Retry logic must pass the same key, not generate a new one

Before granting an agent production access:

  • Separate development and production contexts
  • Apply least-privilege permissions to limit blast radius
  • Define which operations require human approval before execution

The Replit incident, the multi-agent system that burned $47,000 in API calls over 11 days before anyone noticed, the billing workflow generating 200+ duplicate notifications per week — these are not exotic failure modes. They are predictable consequences of deploying agents without the failure-recovery infrastructure that their tool access requires.

Tool calls are transactions. Transactions can fail. Failed transactions in systems without compensation logic leave permanent state changes with no path to recovery. The pattern to adopt is well-established in distributed systems: define the undo before you execute the do, make the undo idempotent, and build the infrastructure to execute it automatically when the forward path fails.

References:Let's stay in touch and Follow me for more thoughts and updates