Skip to main content

From a Bug to a Behavior Rate: The AI Postmortem Without a Reproducer

· 10 min read
Tian Pan
Software Engineer

A user files a ticket. The agent told a paying customer their refund would be processed in seven hours when the documented SLA is seven days. Screenshot attached. You pull the trace, find the exact prompt, the exact tool calls, the exact model and seed. You replay it. The model says seven days. You replay it again. Seven days. You replay it a hundred times. It says seven days ninety-eight times and "by end of day" twice, and never once says seven hours. The screenshot is unambiguous. The replay disagrees. The postmortem due Friday now has a "Root Cause" section and no root cause to put in it.

This is the shape of most AI incidents that reach a postmortem. Not the obvious outages — those have stack traces and 500-rate graphs and recover the way every SRE has been trained to expect. The hard ones are the single bad output that left a victim, erased its own conditions on the way out, and refuses to come back when you summon it. Every postmortem template you have ever used assumes a reproducer. Agents do not give you one.

The instinct from the SRE playbook is to keep digging until the replay matches the report. That instinct is wrong here. Same prompt plus same model plus same seed is not a sufficient condition for the same output, and the gap between "what the user saw" and "what I see when I run it again" is not a debugging failure — it is a property of the system you shipped. The postmortem you are writing has to absorb that property rather than fight it.

Why "Same Inputs, Same Outputs" Was Never True

The cleanest articulation of why is from the work on batch-invariance in LLM inference. The model does not see "your request." It sees a batch — yours plus whoever else's traffic the scheduler co-located in the same forward pass. Reduction kernels in matrix multiplication, RMSNorm, and attention are not associative under floating-point arithmetic, so the order in which a sum is accumulated changes the last few bits of the result. Change the batch size, change the bits. Change a few bits in the logits, occasionally change which token wins the argmax. Temperature zero closes the sampling-side randomness; it does not touch the execution-side randomness, and the execution side is a function of other users' load at the millisecond your request landed.

Stack on top of that: mixture-of-experts routing that depends on co-batched tokens, KV-cache warm/cold paths that produce subtly different intermediate states, fleet routing across GPU generations where the same logical model lives behind different kernels, provider-side patch updates that ship without a version bump in the API. The Thinking Machines work showed that with explicit batch-invariant kernels you can drive determinism to zero variance — and you pay roughly 60 percent in throughput for it. No commercial inference endpoint runs that way. Your production traffic shares a fleet with thousands of other tenants whose load profile you cannot see, and that load profile is one of the inputs to your output.

The honest version is this: when a user reports "the agent said X" and you cannot make it say X again, you have not failed to find the bug. You have correctly measured that X is a low-probability output of a distribution, and that the distribution happened to roll an X for that user at that moment. The bug, if there is one, is not located in the trace. It is located in the shape of the distribution.

The Postmortem Template Is Asking the Wrong Question

Look at any SRE postmortem template. There is a "Root Cause" field. There is a "How will we prevent this from happening again?" field. Both are phrased as if the incident is a defect with a location — a line of code, a config value, a misconfigured permission — and the fix is to change that location. That framing has carried our industry for two decades, and it does not survive contact with probabilistic systems.

For an AI incident with no reproducer, three of the template's load-bearing assumptions break at once. The first is that the cause is singular. It almost never is — the bad output was the joint result of the prompt, the retrieval state at that minute, the model's checkpoint that hour, the batch the request landed in, and the temperature of the sampler. The second is that the cause is in your code. Half the variables are inside the provider's fleet, not yours, and you do not have logs for them. The third is that "we fixed it" means "this exact failure cannot recur." For a probabilistic system, the failure is a rate, not a switch. You do not flip it off. You move it down.

The template has to absorb those three breakages or every postmortem ends in the same paragraph: "Could not reproduce. No action items." Which is a story, not an investigation, and the next incident with the same shape will be filed identically because nothing changed.

What an Investigatable AI Postmortem Looks Like

The shift is from "find the bug" to "characterize the behavior rate." It is a smaller mental move than it sounds, and it changes what you collect, what you measure, and what you write down.

Capture the provider response metadata, not just the response. The output text is what the user saw, but the metadata is what makes the incident investigatable. Every modern LLM API returns fields most teams discard: the system fingerprint or model version string at request time, the safety filter trip flags, finish reason, prompt-token and completion-token counts, cache hit indicators where the provider exposes them. Provider fleets ship silent updates — research on API stability across model patches has found that more than half of prompt-and-model combinations regress when a provider updates the underlying weights, and the API contract gives you no version bump to anchor against. If you have the system fingerprint from the incident window and the fingerprint from your replay, you have a hypothesis to chase. If you only have the response text, you do not.

Sample the distribution instead of replaying the point. A single replay is one draw from the underlying behavior. A useful investigation runs the same input fifty or a hundred times — varying nothing the user could vary — and reports the histogram. The question stops being "did this happen?" and becomes "how often does this happen, and is that frequency consistent with the report?" If the bad output appears once in a thousand replays, the user got unlucky, and the action item is making the long tail safer. If it appears once in ten, you have a much larger problem than the screenshot suggested and the rest of the user base is being silently exposed.

Treat one bad output as evidence of a rate, not a deterministic bug. This is the inversion. The report is a sample, not the population. The investigation's job is to estimate the rate that produced the sample. A confidence interval is more honest than a binary cause, and an action item phrased as "reduce the malformed-tool-call rate from 0.4 percent to under 0.1 percent" is more defensible — and more testable — than "fix the bug that caused this output."

Log the inputs you do not control. The retrieval index version that served the request. The vector store's freshness lag at that minute. The tool schema hash. The traffic mix in the surrounding window — was there a spike of multi-intent queries that pushed your scheduler into a regime it rarely hits? The minute-by-minute drift score on a held-out canary set. These are the variables that move silently inside an LLM stack, and a postmortem template that does not enumerate them will not catch them. Most teams do not capture them not because the engineering is hard, but because the SRE template they inherited never had fields for them.

The Section No Template Has Yet

Write a section called something like "What We Could Not Determine." Put in it: which variables you could not recover from the incident window, why, and what you changed in the capture pipeline so that next time you would have them. This is the section that breaks the story-vs-investigation deadlock. An incident that produced no reproducer is still useful if it produced a delta in your observability surface area. An incident that produced no reproducer and no observability change is a story you told the team, not a postmortem.

Then write a "What We Changed Anyway" section. The temptation is to defer action items until the cause is known. For AI incidents that means deferring forever. The action items that move the rate are usually independent of pinning the exact failure: tightening the system prompt against the failure mode, adding a deterministic guard rail downstream of the model that catches the malformed pattern, adding the bad output to the eval suite so future regressions surface, reducing the entropy of the surrounding context. None of these require knowing exactly why the bad output happened that one time. They lower the rate at which it can happen at all, and lowering the rate is the only thing that matters for a probabilistic system.

The final section is the one most teams forget: write down what would have to be true for this to recur, and what your monitoring would have to look like to catch the second instance before a user did. If you cannot answer the second part, your monitoring is not yet measuring the right thing, and that is itself an action item.

A Note on Honesty

There is a cultural failure mode worth naming. When a postmortem cannot find a root cause, the gravitational pull is to soften the language until the absence is invisible. "Investigation ongoing." "Likely a transient issue." "Working with the model provider to confirm." The blameless tradition exists to make it safe to write down what actually happened, and what actually happened in many AI incidents is that the system produced a low-probability bad output, you could not reproduce it, and you do not know why. Writing that sentence is the precondition for taking the system seriously as a probabilistic one. Hiding it keeps your team thinking they are debugging the wrong way.

The next decade of incident response is going to be built by teams who got comfortable saying "we could not reproduce this, and here is what we changed anyway." The teams who keep waiting for a deterministic root cause before they act are going to keep filing the same incident, with the same victim, and the same paragraph-shaped hole.

References:Let's stay in touch and Follow me for more thoughts and updates