Write-Ahead Logging for AI Agents: Borrowing Database Recovery Patterns for Crash-Safe Execution
Your agent is on step 7 of a 12-step workflow — it has already queried three APIs, written two files, and sent a Slack notification — when the process crashes. What happens next? If your answer is "restart from step 1," you're about to re-send that Slack message, re-write those files, and burn through your LLM token budget a second time. Databases solved this exact problem decades ago with write-ahead logging. The pattern translates to agent architectures with surprising fidelity.
The core insight is simple: before an agent executes any step, it records what it intends to do. Before it moves on, it records what happened. This append-only log becomes the single source of truth for recovery — not the agent's in-memory state, not a snapshot of the world, but a sequential record of intentions and outcomes that can be replayed deterministically.
The Durability Gap in Agent Architectures
Most agent frameworks treat execution as ephemeral. A function runs, calls an LLM, invokes some tools, returns a result. If anything fails mid-execution, the framework retries the entire function from scratch. This works fine for stateless request-response patterns, but agents are not stateless.
A typical production agent workflow involves:
- Multiple LLM calls that build on each other's outputs
- Tool invocations with real-world side effects (sending emails, creating database records, calling external APIs)
- Branching logic that depends on intermediate results
- Human-in-the-loop approval gates that pause execution for hours or days
The compound reliability problem is brutal. If each step in a 10-step workflow has 99% reliability, the overall success rate drops to 90.4%. At 95% per step, you're at 59.9% for the full workflow. And that assumes failures are clean — in practice, partial failures are worse. They leave the system in an inconsistent state where some side effects have fired and others haven't.
Retry-from-scratch doesn't just waste tokens — it creates duplicate side effects. Your agent sends the email twice, creates duplicate records, charges the customer again. The database world solved this in the 1990s. It's time the agent world caught up.
How Write-Ahead Logging Works (and Why It Maps to Agents)
In a database, WAL follows a simple discipline: before modifying any data page on disk, write the intended change to a sequential log. The log is append-only, durable, and ordered. If the database crashes mid-transaction, recovery reads the log forward and either completes or rolls back each transaction based on what the log contains.
The mapping to agent execution is direct:
- Data pages become agent state (the accumulated context, intermediate results, and memory that the agent has built up)
- Transactions become workflow steps (each tool call, LLM invocation, or decision point)
- The log becomes an execution journal that records both the intent and the outcome of each step
The critical property is the same: the log is written before the action is taken, and the outcome is recorded before the agent moves to the next step. This gives you three recovery capabilities that retry-from-scratch cannot provide.
First, skip replay for completed steps. If step 5 already has an outcome recorded in the journal, you don't re-execute it — you replay the cached result and continue from step 6. This is how Temporal's event-sourcing model works: it replays the event history to reconstitute application state, using stored return values instead of re-executing activities.
Second, exactly-once semantics for side effects. Because the journal records which tool calls completed successfully, recovery knows not to re-execute them. The side-effectful work — sending the Slack message, charging the credit card — happens exactly once even if the surrounding workflow restarts multiple times.
Third, deterministic recovery from non-deterministic operations. LLM calls are inherently non-deterministic — the same prompt can produce different outputs. Without a journal, retrying a workflow might take a completely different execution path the second time. With the journal, the stored LLM output is replayed during recovery, preserving the original execution path.
Checkpoint Granularity: The Engineering Tradeoff
Not all checkpoint strategies are equal. The granularity at which you persist state determines your recovery precision, your storage costs, and your write overhead. There are three natural boundaries for agent workflows.
Per-tool-call checkpointing records the outcome after every individual tool invocation. This gives you the finest recovery granularity — you never re-execute a single tool call — but imposes the highest write overhead. For workflows where each tool call has significant side effects or high latency (external API calls, database writes), this is usually the right choice.
Per-plan-step checkpointing records state at logical boundaries in the agent's plan. If your agent decomposes a task into "research → draft → review → publish," you checkpoint between each phase. This reduces write overhead but means a crash during the "draft" phase requires re-executing all tool calls within that phase. It works well when individual tool calls are cheap and idempotent, but the overall workflow is expensive to restart.
Per-decision-point checkpointing records state only when the agent makes a branching decision based on LLM output. This is the coarsest useful granularity — it ensures you don't re-execute the LLM call that determined the execution path, but accepts re-execution of deterministic work between decisions. It's appropriate for workflows where most steps are fast and side-effect-free, with expensive LLM reasoning happening at key junctions.
The right choice depends on your cost function. If re-executing a step costs 0.001 per write, checkpoint everything. If your steps are cheap but you're doing thousands of them, checkpoint at logical boundaries to avoid turning your journal into a bottleneck.
Durable Execution: WAL Principles as a Runtime
The WAL pattern has evolved into a full programming model called durable execution, now offered by platforms like Temporal, Restate, and Inngest. You write agent logic as normal sequential code, and the runtime transparently journals every non-deterministic operation. If the process crashes, the runtime replays the journal to reconstruct state and continues from the last completed step.
This cleanly separates concerns:
- Orchestration logic (the workflow) is deterministic and stateless. It makes decisions, branches, loops — but never directly calls external services.
- Side-effectful operations (activities, in Temporal's terminology) are wrapped in durable contexts that the runtime can journal, retry, and replay.
The practical impact is significant. Developers write agent code as if it runs forever without crashing — no manual checkpoint logic, no recovery handlers, no idempotency key management. The runtime handles all of it. When an LLM call returns a result, it's journaled. When a tool call completes, its output is journaled. Recovery replays these stored results instead of re-executing the operations.
This model also solves the human-in-the-loop problem elegantly. When an agent needs approval before proceeding, the workflow simply awaits a signal. The runtime persists the workflow state, and the process can be completely shut down. When the approval arrives — hours or days later — the runtime replays the journal to reconstruct the agent's state and continues execution. No long-running connections, no resource consumption during the wait.
Side Effects Are the Hard Part
The WAL pattern guarantees that your agent recovers correctly, but it doesn't automatically make your side effects safe. If your agent calls an external API that doesn't support idempotency keys, replaying the journal won't prevent duplicate effects — the journal will correctly skip the re-execution, but only if the original execution was properly recorded before the crash.
There's a narrow window of vulnerability: the agent executes a tool call, the side effect fires, and the process crashes before the outcome is journaled. This is the same problem databases face with torn writes, and the solutions are similar.
Idempotency keys are the primary defense. Every tool call should include a unique key derived from the workflow ID and step number. If the tool call is replayed due to a crash in the journaling window, the external service recognizes the duplicate key and returns the cached result instead of executing again. This pushes the exactly-once guarantee into the external service.
Compensation actions handle the case where idempotency isn't possible. If your agent sent an email and then crashed, you can't un-send the email — but you can record the send in the journal and skip it on recovery. For truly non-idempotent operations, the saga pattern provides a structured way to define compensating actions (send a correction email, reverse a charge) that restore consistency.
Read-write separation simplifies recovery by categorizing tool calls. Read operations (fetching data, querying APIs) are inherently safe to retry. Write operations (sending messages, creating records) need idempotency protection. Structuring your agent's tools with this distinction makes recovery logic cleaner and reduces the surface area that needs careful handling.
What This Looks Like in Practice
The production ecosystem has converged on journal-based recovery as the standard approach. LangGraph persists graph state as checkpoints after every node execution, with PostgreSQL as the recommended production backend. Temporal records every activity call and return value in its event history. Restate wraps non-deterministic operations in durable contexts that are automatically journaled.
The implementation pattern across these platforms is remarkably consistent:
- Agent state is serialized and stored after each step
- Recovery replays stored results rather than re-executing operations
- Side-effectful operations are isolated in wrapped contexts with retry policies
- Workflow identity (thread ID, workflow ID) provides the correlation key for checkpoint retrieval
Teams adopting these patterns report 3-5x reduction in duplicate side effects during failure scenarios, and significant cost savings from not re-executing expensive LLM calls. The write overhead for journaling is typically negligible — a few kilobytes of serialized state per step, compared to the megabytes of tokens flowing through the LLM calls themselves.
When WAL Is Overkill
Not every agent needs durable execution. If your agent is a single LLM call with no side effects — a chatbot that answers questions — retry-from-scratch is perfectly fine. The WAL pattern earns its complexity when:
- Workflows span more than 3-5 steps
- Steps have expensive or non-idempotent side effects
- Execution time exceeds a few minutes
- Human-in-the-loop gates pause execution
- Failure costs are high (financial transactions, customer-facing actions)
For simple agents, a basic try-catch with exponential backoff is sufficient. The operational complexity of running a durable execution platform — additional infrastructure, learning curve, debugging journal replays — isn't justified until your failure costs exceed your infrastructure costs.
The decision framework is straightforward: estimate the cost of re-executing your entire workflow from scratch on failure (token costs + duplicate side effects + engineering time to handle inconsistencies). If that number is high enough to justify the infrastructure investment, adopt durable execution. If your workflows are short and idempotent, keep it simple.
The Database Playbook, Applied
The database community spent decades perfecting crash recovery. Write-ahead logging, checkpointing, and replay-based recovery are battle-tested patterns that handle the exact failure modes agent architectures are now encountering for the first time. You don't need to implement WAL from scratch — durable execution platforms have already translated these patterns into agent-friendly runtimes.
The key insight to carry forward: your agent's execution journal is as critical as your database's transaction log. It's not a debugging aid or an observability luxury — it's the mechanism that makes your agent's promises to the outside world reliable. Treat it with the same seriousness, and your agents will recover from crashes the way databases have for decades: correctly, efficiently, and without duplicating the work that was already done.
- https://eunomia.dev/blog/2025/05/11/checkpointrestore-systems-evolution-techniques-and-applications-in-ai-agents/
- https://www.restate.dev/blog/durable-ai-loops-fault-tolerance-across-frameworks-and-without-handcuffs
- https://www.inngest.com/blog/durable-execution-key-to-harnessing-ai-agents
- https://temporal.io/blog/durable-execution-meets-ai-why-temporal-is-the-perfect-foundation-for-ai
- https://docs.langchain.com/oss/python/langgraph/persistence
- https://temporal.io/blog/error-handling-in-distributed-systems
