Skip to main content

The Incident Ticket With No Repro Steps: Reproducibility as Something You Engineer

· 10 min read
Tian Pan
Software Engineer

The incident ticket is specific in the way only real incidents are. At 02:14 the support agent closed a customer account that should have been put on a 30-day grace period. The customer noticed. The ticket lands on your desk with a single line under "Steps to reproduce": unknown.

You open the trace. You can see the agent called close_account instead of set_grace_period. You can see the tool succeeded. What you cannot see is why the model chose that branch — and when you replay the same customer message through the same agent, it does the right thing. Twice. The postmortem now has a paragraph-shaped hole where the root cause should be, and the only honest thing you can write is "could not reproduce."

This is the quiet failure mode of agent systems in production. Not the outage — the outage you can see. It is the incident that happened exactly once, left a victim, and erased its own repro steps on the way out. Traditional software has a comfortable assumption baked into every debugging workflow: the same input produces the same output, so a bug you can describe is a bug you can re-trigger. Agents break that assumption, and most teams discover it the first time they try to write a postmortem and realize they have a log of what happened but no way to make it happen again.

"It Worked When I Tried It" Is The Default, Not The Exception

When you re-run the customer's message and the agent behaves, the instinct is relief. It should be alarm. A non-reproducible failure is not a failure that went away — it is a failure you have lost the ability to study.

Four things move independently between the moment of the incident and the moment you investigate, and any one of them is enough to make "I tried it and it was fine" meaningless.

The model is non-deterministic even when you think you have pinned it. Setting temperature=0 does not buy you determinism: production LLM inference is not batch-invariant, so the same prompt yields different numerical results depending on what else happened to share the GPU batch. One widely-discussed analysis found that a popular open model at temperature=0 produced 80 distinct completions across 1,000 identical requests — the variation coming entirely from batch-dependent floating-point reduction order in normalization, matrix multiply, and attention kernels. Determinism is achievable, but it costs roughly 60% more latency and you have to engineer it in. By default, "same input" does not mean "same output."

The retrieved context has moved on. The documents your RAG step pulled at 02:14 are not the documents it pulls now. A knowledge-base article was edited, a vector index was rebuilt, a customer record changed state. The agent at incident time saw a world that no longer exists.

The prompt has shipped twice since. Someone fixed an unrelated wording issue Tuesday and tightened a tool description Wednesday. The agent that made the bad call is not running anymore. You are debugging its successor.

And the conversation state — the memory, the prior turns, the tool outputs that conditioned the decision — was a specific arrangement that you did not save. You saved the final answer. You saved a log line. You did not save the situation.

Put those together and the conclusion is uncomfortable but freeing: reproducibility is not a property your stack hands you. It is a property you build, and if you do not build it on purpose, you do not have it.

A Log Tells You What Happened. A Record Lets You Do It Again.

This distinction is the whole game, and it is worth being precise about it.

A log is testimony. It says "the agent called close_account at 02:14." It is past-tense, lossy, and human-readable by design — which means it was already summarized before it was written down. You can read a log and form a theory. You cannot execute a log.

A record is a re-execution kit. It contains every non-deterministic input the agent consumed, captured at the moment it was consumed, in enough fidelity that you can feed those exact inputs back into the agent and watch the same decision unfold. You do not read a record. You run it.

The systems community has known this distinction for decades. Deterministic record-and-replay — the technique behind tools that capture a process execution and re-run it bit-for-bit later — works precisely because most of a program's behavior is deterministic given its inputs. You do not need to store every intermediate state; you only need to store the non-deterministic inputs (the bytes from the network, the timing, the random draws) and then re-execute. Replay debuggers built on this idea were designed specifically to catch "heisenbugs": rare production failures that vanish the moment you try to observe them. An agent's bad tool call is a heisenbug. It deserves the same treatment.

The mistake teams make is assuming their observability vendor already does this. Tracing dashboards are excellent at the log job — timeline, spans, latency, token counts. Very few of them capture a record. A trace that shows you the prompt as rendered, but not the prompt template version plus the variables that filled it, is a log. A trace that shows the retrieved chunks as text, but not the index snapshot they came from, is a log. Useful, but you still cannot run it.

What Goes In The Replayable Bundle

If reproducibility is engineered, here is the unit of engineering: a decision context bundle attached to every agent run — not every failed run, every run, because you do not know which run will become an incident until it is too late to start recording.

The bundle pins every input the agent saw:

  • Model identity — the exact model version and provider, the decoding parameters, and the seed if you set one. "GPT-class model, temperature 0" is not an identity. The specific versioned endpoint is.
  • Prompt provenance — not the rendered prompt string but the template version (a content hash or a version tag) and the full set of variables interpolated into it. The rendered string is a derived artifact; the template plus variables is the source.
  • Retrieved context, frozen — the actual chunks returned, plus the identity of the index or snapshot they came from and the query that retrieved them. If your knowledge base supports point-in-time reads, store the timestamp; if it does not, store the chunk content directly.
  • Tool I/O — every tool call's arguments and, critically, the exact response the tool returned. On replay you do not re-call the payment API. You replay its recorded response, because the external world has moved on and re-calling it is both unfaithful and dangerous.
  • Conversation and memory state — the full message history and whatever memory entries were in scope at decision time, captured as they were, not as they are now.

The bundle should be content-addressed and immutable. When the prompt ships twice, the old template version is still resolvable by hash. When the index is rebuilt, the old chunks are still pinned in the bundle. The incident at 02:14 becomes a thing you can hold in your hand, not a thing you have to reconstruct from memory and luck.

This is not free. Bundles have storage cost, and capturing tool responses verbatim raises real questions about PII and retention — you are now storing customer data inside a debugging artifact, and it needs the same access controls and expiry as any other copy. Treat that as a design constraint, not a reason to skip it. A 30-day retention window on bundles covers the overwhelming majority of incidents and bounds the exposure.

The Replay Environment Is Where The Bundle Pays Off

A bundle you cannot execute is just a more expensive log. The other half of the work is a replay environment: a mode in which the agent runs against pinned inputs instead of live ones.

In replay mode, the retrieval step does not query the live index — it serves the frozen chunks from the bundle. The tool layer does not hit live APIs — it returns recorded responses keyed by the call arguments. The prompt is rendered from the pinned template version, not the current one. The model is invoked at the version recorded in the bundle. The only thing you are allowed to vary is the thing you are testing.

That last point is what turns replay from forensics into a tool. Once you can pin everything, you can change exactly one variable and see what it does:

  • Replay the incident against the current prompt to confirm whether your fix actually closes the hole — or whether it only appeared to because the model rolled the dice differently.
  • Replay against a newer model version before you migrate, so a model upgrade stops being a leap of faith.
  • Replay with the retrieved context perturbed, to learn whether the bad decision depended on a specific document or was robust to the agent's judgment.

Determinism makes all of this sharper. You will not always get bit-for-bit identical model output on replay — and that is acceptable, as long as you are honest about it. If your replay reproduces the bad tool call nine times out of ten, you have a reproducible bug. If it reproduces it once in fifty, you have learned the failure is rare and probabilistic, which is itself a finding worth putting in the postmortem. Either way you have replaced "could not reproduce" with a number.

Reproducibility Is A Decision You Make Before The Incident

The hole in the postmortem is not a tooling gap you can close after the fact. By the time the ticket is filed, the inputs are already gone — the index moved, the prompt shipped, the conversation state was never saved. There is no retroactive fix. The only version of this that works is the version where the bundle already existed before 02:14, because you decided every run would carry one.

So the practical takeaway is a checklist you run now, not during the next incident:

  1. Capture a decision context bundle on every run, not just failures. Content-addressed, immutable, with a bounded retention window.
  2. Version your prompts and pin retrieval snapshots so "the prompt" and "the context" are resolvable identities, not moving targets.
  3. Record tool responses verbatim and replay them — never re-call live services during investigation.
  4. Build a replay mode that pins every input and lets you vary exactly one.
  5. Report reproducibility as a rate, not a yes/no. "Reproduces 9/10 against the pre-fix prompt" is a root cause. "Could not reproduce" is a hole.

Non-determinism is not going away; it is intrinsic to how these systems work, and fighting it everywhere is the wrong battle. The right battle is making your agents auditable in spite of it. A team that engineers reproducibility ships postmortems with causes in them. A team that inherits whatever its stack happens to provide ships postmortems with the word "unknown" — and quietly accepts that the next 02:14 will be a surprise too.

References:Let's stay in touch and Follow me for more thoughts and updates