One post tagged with "disaster-recovery"

Agent Disaster Recovery: When Working Memory Dies With the Region

April 28, 2026 · 12 min read

Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.

About Tian Pan