Skip to main content

Agent Disaster Recovery: When Working Memory Dies With the Region

· 12 min read
Tian Pan
Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.

The State Surface Your Runbook Doesn't Cover

Walk through where an agent's state actually lives, by category, and ask which entries in your DR runbook cover them.

The first surface is working memory in the inference provider. Modern providers cache prompt prefixes, tool schemas, and session state to keep input-token costs down. A long-running agent leans on those caches; some providers even expose session affinity headers so requests with the same identifier route to the same model instance. When the region the agent is bound to fails, the cache state is lost. The failover region can serve traffic, but it serves it cold — and any agent logic that assumed conversational continuity now sees a stranger.

The second surface is the worker's local disk. The agent writes scratchpad files for plan-and-act traces, intermediate tool outputs that didn't fit in context, downloaded artifacts it was about to upload elsewhere. Local disk is faster than durable storage and cheaper than putting every artifact in S3. It is also gone the instant the worker dies. A failover region cannot read it.

The third surface is in-flight tool results. The agent called a payment API three minutes ago. The tool result came back. The result was passed to the model, which decided what to do next — but that decision and the tool result it depended on exist only inside an open inference call. The model returns. The worker accepting that return is dead. The result was never persisted. The failover worker has no idea the payment ran.

The fourth surface is the partial plan-and-act trace. The agent has executed three of seven planned steps. The plan exists as text in the prompt history, accumulated turn by turn. There is no structured representation of "what's done, what's next, what was decided." If a different worker tries to resume, it has to re-derive the plan from scratch — and a different model call, even with the same inputs, may decide differently than the first.

Each of these surfaces is invisible to the SRE org because each was introduced as an implementation detail of the agent runtime, not as a stateful service in its own right. The DR plan can't fail over what it can't name.

Idempotency at the Task Level, Not the Request Level

Most service frameworks give you request-level idempotency: an HTTP request comes in with a key, the handler runs, the result gets cached, replays return the same response. That primitive doesn't extend to agents because an agent's "task" is not a request. It's a sequence of model calls and tool invocations that may take minutes or hours, span retries, and produce side effects at multiple steps along the way.

The discipline that has to land is idempotency keys generated at task creation, not within the agent. When the user submits "book the flight and email the itinerary," the orchestrator mints a task ID before the agent does anything. Every tool call the agent emits inherits that task ID and a step ordinal: task=abc, step=1, action=search_flights. The payment API and the email service are configured to deduplicate on (task_id, step_ordinal). If the agent dies after step 3 and the failover region picks up the task at step 3, the deduplication key is identical — the second attempt either no-ops or returns the cached result.

This sounds simple and it is, in principle. The trap is that practitioners ship agents where the idempotency key is generated inside the agent's plan-and-act loop — sampled from the model output, or hashed from the current prompt, or derived from the call's wall clock time. All three break under failover because the resumed task will produce a different key. The team finds out when the customer is billed twice.

The fix is to treat the task ID and the step ordinal as orchestrator-owned primitives the agent must use but cannot generate. The orchestrator hands the agent a key for each tool slot before the call is made. If the agent crashes mid-call, the resumer reuses the same key. The downstream API enforces dedup. The customer gets billed once.

Checkpoint Before Every Tool Call, Not at Session End

The naive checkpoint cadence is "save state when the session ends." That is the cadence chat-style products use, and it works for chat because the state is a transcript with no external side effects. It fails for agents because agents emit irreversible side effects mid-session — payments, emails, database writes, tickets opened — and the state between two side effects is the state you actually need to recover.

The correct cadence is durable checkpoints written before every tool call. Right before the agent invokes a tool, the orchestrator persists: the current plan, the prompt prefix, the tool name, the arguments, the idempotency key, and the step ordinal. If the worker dies at any point — before the call, mid-call, after the call but before the result is processed — the failover worker has enough state to either resume or, if the side effect's status is unknown, reconcile by querying the downstream API with the idempotency key.

This pattern is what durable-execution frameworks like Temporal, Restate, and Azure's Durable Task runtime were built for, and the agent-platform vendors have started to converge on it. LangGraph's durable execution mode checkpoints every state transition; Temporal's agent SDKs make every tool call a "step" that is automatically recorded. Microsoft's Foundry Agent Service publishes explicit failover guidance: replicate the agent's persistent state across regions and document the manual steps to switch. The pattern is the same wherever you find it: the unit of durability is the tool call, not the session.

The cost frame nobody surfaces is that durable agent state is checkpoint overhead the team always under-budgets. Each pre-call checkpoint adds latency (tens to a hundred milliseconds depending on storage choice) and storage cost (every step written, kept for some retention window). The team that doesn't budget this in the design phase ships an agent that hits its latency SLO in dev and busts it in prod, then degrades the checkpoint frequency to "every N steps," then has its first regional incident and learns that the lost steps were the expensive ones.

Failover Policy: Fail-Safe Abort Beats Fail-Forward Replay

When a partially-completed agent task surfaces in the failover region, the orchestrator has two choices: replay the task forward from the last checkpoint, or abort the task and notify the user.

Replay is what frameworks default to and what looks elegant on a slide. The agent's last good state was step 4. The failover worker reads step 4, calls the model with the resumed prompt, executes step 5. Beautiful. The catch is that step 5's input depends on the result of step 4 — and if step 4 was an external side effect whose status is uncertain, replay makes assumptions the team will not vouch for at 3 a.m.

For agents whose side effects are external and irreversible, the safer policy is fail-safe abort with user notification. The failover region observes a half-finished task whose last action was "charged customer $470." Rather than auto-resuming and risking a second charge or a missed follow-up, the orchestrator marks the task aborted, surfaces a clear message to the user ("we encountered an issue completing your booking; here's what we know was charged; please confirm before we proceed"), and writes the partial state to a reconciliation queue for an operator to triage.

Replay is appropriate for tasks whose side effects are genuinely idempotent end-to-end, where step ordinals plus dedup keys make double-execution impossible, and where the team has chaos-tested that property. Replay is not appropriate as the default for any task that touches money, messaging, or third-party APIs the team doesn't control, because the team that defaults to replay finds out about the bad cases through customer support, not through their own monitoring.

The policy decision belongs in the agent's task-type metadata: each task type declares whether resume-on-failover is safe. New task types start as fail-safe-abort and graduate to replay only after they have an end-to-end idempotency proof and a chaos drill that demonstrated it.

The Drill: Chaos-Engineering Mid-Tool-Call

The DR plan that hasn't been drilled is a hope, not a plan. For agents, the right drill is one that maps to the failure mode the runbook silence is hiding: kill the worker mid-tool-call.

Concretely, the drill stages a representative agent task — one that touches at least one external side effect — and at a randomized point during execution, terminates the worker process. The orchestrator has to recover the task, and the assertion is binary: the recovery path produces either a single completion or a clean abort. Never a double side effect, never a silent partial.

This pairs with broader region-failover testing. AWS's Fault Injection Service offers cross-region scenarios that can sever connectivity to a primary region; ARC orchestrates the regional switch; the agent platform's job is to be the layer that survives that switch with task-level guarantees. The drill that reveals the most is the one where the worker dies during a tool call whose outcome is uncertain — because that is the case the production code path was not designed for, and that is the case the regional outage will produce.

A related drill that practitioners under-run is prompt-cache eviction. The agent's latency budget assumes a warm prompt cache in the inference provider's region. Failover lands the agent in a cold region. Latency triples. Tool-call timeouts that worked in the warm region now fire in the cold region, producing spurious retries on top of an already-degraded path. Drill it: warm a cache, force the failover, measure the cold-start latency profile, and confirm timeouts and retry budgets are tuned for the cold case, not the warm one.

The Org Failure Mode: SRE and the AI Team Don't Share an Asset Register

The architectural problem has an organizational shadow. The SRE org owns DR for services. The AI team owns the agent runtime. The gap between them is the worker's local scratch directory, the inference provider's session cache, and the in-flight tool results — none of which were on the asset register when SRE wrote the DR plan, and none of which the AI team thought of as state because they thought of them as implementation details.

The fix is mechanical: a joint review where the agent runtime's state surfaces are added to the same asset register as databases, queues, and object stores. Each surface gets a row: where does it live, what's its replication strategy, what is the RPO and RTO, what's the failover procedure, who owns the runbook entry. Surfaces that fail this review get either elevated to durable status (move scratchpad to object storage, replicate session state) or explicitly demoted to "ephemeral, lost on failover" with an agent-level policy that handles tasks affected by that loss.

The cost frame here is that this work feels like SRE process overhead to the AI team and AI implementation detail to the SRE team, so neither org reaches for it. It gets done after the first regional incident produces a six-figure customer-credit bill — the same tax every organization pays for every state surface it didn't put on the register the first time.

Closing: Name the State Surface Before the Outage Names It For You

The DR plan you have works for the stack you used to run. The agent your team shipped last quarter introduced state surfaces the plan doesn't describe, and those surfaces are now the dominant failure mode of any regional outage scenario. The order of operations is fixed: name the state, choose a durability strategy per surface, generate idempotency keys at task scope, checkpoint before every external side effect, default to fail-safe abort, and drill the worker-death-mid-tool-call case until the recovery path is binary.

None of this requires a new framework. The durable-execution patterns are decades old; what's new is applying them to a runtime whose state model the team didn't fully realize they were running. The team that names the state surface owns the runbook for it. The team that doesn't will read about it in a customer-support ticket.

References:Let's stay in touch and Follow me for more thoughts and updates