Skip to main content

Agent Disaster Recovery: When Working Memory Dies With the Region

· 12 min read
Tian Pan
Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.

The State Surface Your Runbook Doesn't Cover

Walk through where an agent's state actually lives, by category, and ask which entries in your DR runbook cover them.

The first surface is working memory in the inference provider. Modern providers cache prompt prefixes, tool schemas, and session state to keep input-token costs down. A long-running agent leans on those caches; some providers even expose session affinity headers so requests with the same identifier route to the same model instance. When the region the agent is bound to fails, the cache state is lost. The failover region can serve traffic, but it serves it cold — and any agent logic that assumed conversational continuity now sees a stranger.

The second surface is the worker's local disk. The agent writes scratchpad files for plan-and-act traces, intermediate tool outputs that didn't fit in context, downloaded artifacts it was about to upload elsewhere. Local disk is faster than durable storage and cheaper than putting every artifact in S3. It is also gone the instant the worker dies. A failover region cannot read it.

The third surface is in-flight tool results. The agent called a payment API three minutes ago. The tool result came back. The result was passed to the model, which decided what to do next — but that decision and the tool result it depended on exist only inside an open inference call. The model returns. The worker accepting that return is dead. The result was never persisted. The failover worker has no idea the payment ran.

The fourth surface is the partial plan-and-act trace. The agent has executed three of seven planned steps. The plan exists as text in the prompt history, accumulated turn by turn. There is no structured representation of "what's done, what's next, what was decided." If a different worker tries to resume, it has to re-derive the plan from scratch — and a different model call, even with the same inputs, may decide differently than the first.

Each of these surfaces is invisible to the SRE org because each was introduced as an implementation detail of the agent runtime, not as a stateful service in its own right. The DR plan can't fail over what it can't name.

Idempotency at the Task Level, Not the Request Level

Most service frameworks give you request-level idempotency: an HTTP request comes in with a key, the handler runs, the result gets cached, replays return the same response. That primitive doesn't extend to agents because an agent's "task" is not a request. It's a sequence of model calls and tool invocations that may take minutes or hours, span retries, and produce side effects at multiple steps along the way.

The discipline that has to land is idempotency keys generated at task creation, not within the agent. When the user submits "book the flight and email the itinerary," the orchestrator mints a task ID before the agent does anything. Every tool call the agent emits inherits that task ID and a step ordinal: task=abc, step=1, action=search_flights. The payment API and the email service are configured to deduplicate on (task_id, step_ordinal). If the agent dies after step 3 and the failover region picks up the task at step 3, the deduplication key is identical — the second attempt either no-ops or returns the cached result.

This sounds simple and it is, in principle. The trap is that practitioners ship agents where the idempotency key is generated inside the agent's plan-and-act loop — sampled from the model output, or hashed from the current prompt, or derived from the call's wall clock time. All three break under failover because the resumed task will produce a different key. The team finds out when the customer is billed twice.

The fix is to treat the task ID and the step ordinal as orchestrator-owned primitives the agent must use but cannot generate. The orchestrator hands the agent a key for each tool slot before the call is made. If the agent crashes mid-call, the resumer reuses the same key. The downstream API enforces dedup. The customer gets billed once.

Checkpoint Before Every Tool Call, Not at Session End

The naive checkpoint cadence is "save state when the session ends." That is the cadence chat-style products use, and it works for chat because the state is a transcript with no external side effects. It fails for agents because agents emit irreversible side effects mid-session — payments, emails, database writes, tickets opened — and the state between two side effects is the state you actually need to recover.

The correct cadence is durable checkpoints written before every tool call. Right before the agent invokes a tool, the orchestrator persists: the current plan, the prompt prefix, the tool name, the arguments, the idempotency key, and the step ordinal. If the worker dies at any point — before the call, mid-call, after the call but before the result is processed — the failover worker has enough state to either resume or, if the side effect's status is unknown, reconcile by querying the downstream API with the idempotency key.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates