Write-Ahead Logging for AI Agents: Borrowing Database Recovery Patterns for Crash-Safe Execution
Your agent is on step 7 of a 12-step workflow — it has already queried three APIs, written two files, and sent a Slack notification — when the process crashes. What happens next? If your answer is "restart from step 1," you're about to re-send that Slack message, re-write those files, and burn through your LLM token budget a second time. Databases solved this exact problem decades ago with write-ahead logging. The pattern translates to agent architectures with surprising fidelity.
The core insight is simple: before an agent executes any step, it records what it intends to do. Before it moves on, it records what happened. This append-only log becomes the single source of truth for recovery — not the agent's in-memory state, not a snapshot of the world, but a sequential record of intentions and outcomes that can be replayed deterministically.
The Durability Gap in Agent Architectures
Most agent frameworks treat execution as ephemeral. A function runs, calls an LLM, invokes some tools, returns a result. If anything fails mid-execution, the framework retries the entire function from scratch. This works fine for stateless request-response patterns, but agents are not stateless.
A typical production agent workflow involves:
- Multiple LLM calls that build on each other's outputs
- Tool invocations with real-world side effects (sending emails, creating database records, calling external APIs)
- Branching logic that depends on intermediate results
- Human-in-the-loop approval gates that pause execution for hours or days
The compound reliability problem is brutal. If each step in a 10-step workflow has 99% reliability, the overall success rate drops to 90.4%. At 95% per step, you're at 59.9% for the full workflow. And that assumes failures are clean — in practice, partial failures are worse. They leave the system in an inconsistent state where some side effects have fired and others haven't.
Retry-from-scratch doesn't just waste tokens — it creates duplicate side effects. Your agent sends the email twice, creates duplicate records, charges the customer again. The database world solved this in the 1990s. It's time the agent world caught up.
How Write-Ahead Logging Works (and Why It Maps to Agents)
In a database, WAL follows a simple discipline: before modifying any data page on disk, write the intended change to a sequential log. The log is append-only, durable, and ordered. If the database crashes mid-transaction, recovery reads the log forward and either completes or rolls back each transaction based on what the log contains.
The mapping to agent execution is direct:
- Data pages become agent state (the accumulated context, intermediate results, and memory that the agent has built up)
- Transactions become workflow steps (each tool call, LLM invocation, or decision point)
- The log becomes an execution journal that records both the intent and the outcome of each step
The critical property is the same: the log is written before the action is taken, and the outcome is recorded before the agent moves to the next step. This gives you three recovery capabilities that retry-from-scratch cannot provide.
First, skip replay for completed steps. If step 5 already has an outcome recorded in the journal, you don't re-execute it — you replay the cached result and continue from step 6. This is how Temporal's event-sourcing model works: it replays the event history to reconstitute application state, using stored return values instead of re-executing activities.
Second, exactly-once semantics for side effects. Because the journal records which tool calls completed successfully, recovery knows not to re-execute them. The side-effectful work — sending the Slack message, charging the credit card — happens exactly once even if the surrounding workflow restarts multiple times.
Third, deterministic recovery from non-deterministic operations. LLM calls are inherently non-deterministic — the same prompt can produce different outputs. Without a journal, retrying a workflow might take a completely different execution path the second time. With the journal, the stored LLM output is replayed during recovery, preserving the original execution path.
Checkpoint Granularity: The Engineering Tradeoff
Not all checkpoint strategies are equal. The granularity at which you persist state determines your recovery precision, your storage costs, and your write overhead. There are three natural boundaries for agent workflows.
Per-tool-call checkpointing records the outcome after every individual tool invocation. This gives you the finest recovery granularity — you never re-execute a single tool call — but imposes the highest write overhead. For workflows where each tool call has significant side effects or high latency (external API calls, database writes), this is usually the right choice.
Per-plan-step checkpointing records state at logical boundaries in the agent's plan. If your agent decomposes a task into "research → draft → review → publish," you checkpoint between each phase. This reduces write overhead but means a crash during the "draft" phase requires re-executing all tool calls within that phase. It works well when individual tool calls are cheap and idempotent, but the overall workflow is expensive to restart.
Per-decision-point checkpointing records state only when the agent makes a branching decision based on LLM output. This is the coarsest useful granularity — it ensures you don't re-execute the LLM call that determined the execution path, but accepts re-execution of deterministic work between decisions. It's appropriate for workflows where most steps are fast and side-effect-free, with expensive LLM reasoning happening at key junctions.
The right choice depends on your cost function. If re-executing a step costs 0.001 per write, checkpoint everything. If your steps are cheap but you're doing thousands of them, checkpoint at logical boundaries to avoid turning your journal into a bottleneck.
Durable Execution: WAL Principles as a Runtime
- https://eunomia.dev/blog/2025/05/11/checkpointrestore-systems-evolution-techniques-and-applications-in-ai-agents/
- https://www.restate.dev/blog/durable-ai-loops-fault-tolerance-across-frameworks-and-without-handcuffs
- https://www.inngest.com/blog/durable-execution-key-to-harnessing-ai-agents
- https://temporal.io/blog/durable-execution-meets-ai-why-temporal-is-the-perfect-foundation-for-ai
- https://docs.langchain.com/oss/python/langgraph/persistence
- https://temporal.io/blog/error-handling-in-distributed-systems
