Deterministic Replay: How to Debug AI Agents That Never Run the Same Way Twice
Your agent failed in production last Tuesday. A customer reported a wrong answer. You pull up the logs, see the final output, maybe a few intermediate print statements — and then you're stuck. You can't re-run the agent and get the same failure because the model won't produce the same tokens, the API your tool called now returns different data, and the timestamp embedded in the prompt has moved forward. The bug is gone, and you're left staring at circumstantial evidence.
This is the fundamental debugging problem for AI agents: traditional software is deterministic, so you can reproduce bugs by recreating inputs. Agent systems are not. Every run is a unique snowflake of model sampling, live API responses, and time-dependent state. Without specialized tooling, post-mortem debugging becomes forensic guesswork.
Deterministic replay solves this by recording every source of non-determinism during execution and substituting those recordings during replay — turning your unreproducible agent run into something you can step through like a debugger.
Why "Just Set Temperature to Zero" Doesn't Work
The first instinct most engineers have is to eliminate randomness at the source: set temperature=0 and call it deterministic. This is a myth that has cost teams countless debugging hours.
Temperature only controls the token sampling step — it makes greedy decoding select the highest-probability token. But the logits feeding that selection are themselves variable. The primary culprit is batch composition. Cloud inference providers bundle multiple requests onto the same GPU for throughput. Your request shares computational resources with other users' requests, and the specific batch it lands in affects floating-point intermediate results. A token that was winning by a margin of 0.0000001 might lose in a different batch context. The argmax flips, a different word is chosen, and the rest of the generation diverges onto a completely new path.
Hardware heterogeneity makes this worse. A cloud provider's GPU fleet includes multiple architectures — H100s, A100s, sometimes older cards. Different GPU architectures implement matrix operations with slightly different numerical behavior. Your request might hit an H100 on one run and an A100 on the next. Bit-for-bit reproducibility across a heterogeneous fleet is a practical impossibility.
Even OpenAI and Anthropic acknowledge this. OpenAI's seed parameter improves reproducibility but explicitly doesn't guarantee it. Anthropic's documentation states that even with temperature=0.0, results will not be fully deterministic. Research investigating non-determinism across five LLMs configured for deterministic output found accuracy variations up to 15% across runs, with a gap between best and worst possible performance reaching 70%.
For single LLM calls, this variability might be tolerable. For multi-step agent workflows — where each step's output becomes the next step's input — small divergences compound. A slightly different word choice in step 2 can lead to a completely different tool call in step 3, which feeds different data into step 4. By step 10, the two runs are unrecognizable.
The Architecture: Record Everything, Replay Exactly
Deterministic replay borrows from a well-established pattern in systems engineering: record-replay debugging. The same idea that game engines use for replays and distributed systems use for fault diagnosis applies directly to agent systems.
The core architecture has two modes:
Record mode runs the agent normally against real LLMs and real tools, but intercepts and logs every external interaction. Replay mode feeds those logged interactions back to the agent in exact sequence, replacing live calls with recorded responses.
What you need to record at each step:
- LLM interactions: The complete prompt sent, all sampling parameters (temperature, top_p, max_tokens), model identifier and version, and the exact response tokens returned
- Tool calls: The function name, arguments passed, and the complete response — including errors, timeouts, and partial results
- System state: Timestamps (so time-dependent logic replays identically), random seeds, and any environment variables the agent reads
- Decision metadata: If your agent has routing logic, planners, or classifiers, capture their inputs and outputs as distinct events
Each recorded event gets a monotonically incrementing step ID, a run ID for correlation, and a structured input/output payload. The format should be append-only and streamable — JSONL works well in practice.
The critical insight is that during replay, you don't re-invoke the model or the tools. You substitute deterministic stubs — proxy objects that return the recorded response verbatim when called with matching inputs. The model stub returns the exact tokens recorded. The tool stub returns the exact API response recorded. If the agent tries to make a call that wasn't in the recording, the replay engine fails loudly rather than silently falling through to a live system.
Checkpoint Replay vs. Full Replay
There are two flavors of deterministic replay, and the right choice depends on your debugging needs.
Full replay starts from the beginning of the agent run and replays every step sequentially. This gives you the complete execution path but can be slow for long-running agents — imagine replaying a 50-step research agent just to inspect the failure at step 47.
Checkpoint replay captures the complete agent state at each step boundary, letting you jump directly to any point in the execution. This is what systems like LangGraph's Time Travel implement: each state transition is persisted as a checkpoint, and you can fork execution from any saved checkpoint to explore alternative paths.
Checkpoint replay is more powerful for interactive debugging. You can:
- Jump directly to the failure point without replaying earlier steps
- Fork from a checkpoint and inject different inputs to test hypotheses
- Compare execution paths by branching from the same checkpoint with different model responses
The tradeoff is storage. Full replay only stores the event log — inputs and outputs of external calls. Checkpoint replay stores the complete agent state at every step, which can include large context windows, accumulated tool results, and internal planning state. For a RAG agent processing documents, a single checkpoint might be several megabytes of accumulated context.
In practice, a hybrid approach works well: store the complete event log (relatively compact) for every run, and store full checkpoints only for runs that triggered alerts, failed, or were flagged for review.
The Storage Math: What Replay Actually Costs
A common objection to deterministic replay is storage cost. Let's do the math.
A typical agent step involves an LLM call with a prompt (maybe 2-4K tokens of context plus the new message) and a response (a few hundred tokens). In raw text, that's roughly 10-15 KB per step. Add tool call payloads (API request and response bodies) at maybe 1-5 KB per tool interaction. For a 10-step agent run, you're looking at 100-200 KB of trace data.
At cloud storage prices, that's essentially free — pennies per million agent runs. The real cost isn't raw storage but the observability infrastructure around it: indexing traces for search, retaining them for compliance windows, and running the replay engine itself.
The expensive cases are agents that process large documents or images. If your agent fetches a 50-page PDF and passes chunks through multiple LLM calls, the trace for a single run can hit several megabytes. The solution is reference-based recording: store the document once, and record a content-addressed hash in the trace rather than duplicating the full content at every step.
For teams already running observability platforms like Langfuse, LangSmith, or Arize, the marginal cost of enabling replay-grade tracing is modest. The data you need for replay (prompts, responses, tool calls) overlaps heavily with what you're already collecting for monitoring and evaluation. The gap is usually completeness — monitoring might sample or truncate, while replay requires verbatim capture.
Building a Replay System: Practical Patterns
If you're building replay into an existing agent system, here are the patterns that matter.
Separate recording from execution. Your agent code shouldn't know whether it's being recorded. Wrap your LLM client and tool interfaces with recording proxies that intercept calls transparently. This means no if recording: log(...) scattered through your agent logic — the instrumentation lives in the infrastructure layer.
Use content-addressed storage for large payloads. Hash documents, images, and large API responses, store them once, and reference them by hash in your traces. This deduplicates across runs that process the same inputs and keeps your trace files compact.
Version your trace format. Your trace schema will evolve. Include a version field in every trace file so your replay engine can handle old and new formats. Breaking changes in trace format that invalidate your historical replay library are painful.
Build divergence detection. When replaying, compare the agent's internal decisions against what was recorded. If the agent's routing logic or prompt construction is itself deterministic (given the same inputs), any divergence between the recorded path and the replayed path signals a code change that altered behavior. This turns replay into a regression testing tool — replay last week's production traffic against this week's code and flag any divergences.
Don't forget time. If your agent uses timestamps in prompts, logging, or decision logic, your replay engine must intercept system clock calls and substitute the recorded timestamps. A prompt that includes "Today is April 12, 2026" will produce different model behavior if replayed tomorrow with "Today is April 13, 2026."
Replay as Regression Testing
The most underappreciated use of deterministic replay isn't debugging — it's regression testing. Traditional software testing relies on unit tests and integration tests with deterministic assertions. But how do you test a system whose outputs are inherently variable?
Replay gives you a concrete answer: capture production runs as golden traces, then replay them against new code versions. You're not checking for exact output matches (the model might produce different tokens). Instead, you check for structural equivalence: Did the agent call the same tools in the same order? Did it extract the same key information? Did it reach the same conclusion?
This approach catches a category of bugs that's nearly invisible otherwise: prompt regressions. You update a system prompt to improve one scenario and inadvertently break five others. Without replay-based regression testing, you'd only discover this when users report problems. With it, you replay your golden trace library after every prompt change and catch regressions before deployment.
The same pattern works for model upgrades. Before switching from one model version to another, replay a representative sample of production traces and compare behavioral metrics. This is more reliable than benchmark scores because it measures performance on your actual workload, not synthetic tasks.
What Replay Won't Save You From
Deterministic replay has real limitations worth understanding.
It can't replay what it didn't record. If your agent has untraced side effects — writing to a database, sending emails, modifying external state — those won't be captured or replayed. Comprehensive instrumentation is a prerequisite.
It doesn't explain why the model said what it said. Replay shows you the exact sequence of inputs and outputs but doesn't provide mechanistic interpretability. You'll know that step 5 produced a wrong answer, but not which attention heads or neurons drove that response.
Trace storage becomes a data governance problem. Your traces contain the full prompts and responses from production, which often include user data. Trace storage must comply with your data retention policies, access controls, and privacy regulations. In regulated industries, this is a feature (audit trail), but it requires deliberate handling.
Model deprecation breaks historical replay. If you recorded traces against GPT-4 and the model is later deprecated, you can still replay using the recorded outputs — but you can't do counterfactual replay (injecting new inputs and getting model responses) against a model that no longer exists. For long-lived compliance requirements, this means your recorded traces are the permanent record, not a reproducible experiment.
Despite these limitations, deterministic replay remains the closest thing we have to a debugger for non-deterministic systems. It won't give you gdb for LLMs, but it gives you something almost as valuable: the ability to say "this is exactly what happened, step by step" when your agent fails in production.
For teams running agents at scale, that capability transforms incident response from guesswork into engineering. Start by instrumenting your most failure-prone agent workflow, capture a week of production traces, and replay the failures. You'll learn more about your agent's behavior in that first week than months of log-reading ever taught you.
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://dev.to/sreeni5018/debugging-non-deterministic-llm-agents-implementing-checkpoint-based-state-replay-with-langgraph-5171
- https://mikulskibartosz.name/why-temperature-0-isnt-deterministic
- https://www.flowhunt.io/blog/defeating-non-determinism-in-llms/
- https://arxiv.org/html/2505.17716v1
- https://oneuptime.com/blog/post/2026-04-01-ai-workload-observability-cost-crisis/view
- https://www.getmaxim.ai/articles/the-5-best-agent-debugging-platforms-in-2026/
