Skip to main content

The Postmortem Template With No Row for the Model's Inference

· 11 min read
Tian Pan
Software Engineer

The first time an agent caused a real outage on my team, the postmortem author opened the template, scrolled past the timeline, stared at the "Root Cause" field for a long minute, and typed: "The runbook for queue-stuck recovery was incorrect." The runbook was fine. The agent had read the runbook, decided the queue's symptoms matched a different scenario, and run a recovery script for that other scenario instead. The action items that came out of that document — "tighten the runbook wording," "add a confirmation prompt to the recovery script" — were entirely useless against the actual failure mode, which was that an inferential system had inferred wrong and there was no field in the template that knew how to say so.

I've watched this exact failure repeat across teams since. The template is calibrated for deterministic systems. Code did the wrong thing, so you fix the code. Config was misset, so you fix the config. The schema of the postmortem document is the schema of the team's theory of failure, and when that theory cannot represent "the agent's plan was wrong," the document flattens the actual failure into the closest thing the template can represent — usually a documentation gap or a missing guardrail — and the action items chase a deterministic fix for a probabilistic failure. The same incident class then recurs, and the team writes it up the same way the next time.

This isn't a "we need better AI" problem. It's a documentation-of-organizational-learning problem. The postmortem template is the schema the learning has to fit into, and if the schema is wrong the learning never lands.

Why "fix the runbook" keeps showing up in the action items

The Google SRE postmortem template — the lineage almost every modern incident template descends from — has rows for trigger, contributing factors, root cause, detection, and what we'll change. Each row is shaped by an assumption: the failure was caused by a discrete thing in the system, and the discrete thing can be pointed at and fixed. The Five Whys ladder reinforces this. You start at the observable symptom and walk down a chain of "why" questions until you reach something the team can change. The whole structure is engineered to convert an incident into a list of code or config or process diffs.

That structure breaks the moment one of the actors in the incident is an inferential system whose output is sampled from a distribution. There is no "the code did the wrong thing" — the code was an LLM call that returned a plausible token sequence given its inputs. There is no "the config was misset" — the prompt was the same prompt that handled the previous hundred incidents correctly. The Five Whys chain runs aground at "why did the model output that?" because the answer is not a fact about your system; it's a fact about a distribution you don't own.

So the author does what the template asks them to do: they map the failure to the closest field that exists. "The model's inference was wrong" becomes "the runbook was ambiguous" or "the guardrail was missing." Those statements are also true, in some sense — the runbook could be less ambiguous, the guardrail could exist — but they were not the root cause, and the action items they generate do not prevent the next instance of the same failure class.

The Firetiger ingest outage in March 2026 ran into a version of this. The postmortem noted that their agent triage system "knew just what the problem was, and would have rolled back if they had given it the tools and connections to do so." That sentence is fascinating: the agent's correct inference was treated as a fact about the incident, not as a thing the template could grade. There was no field for "what the agent inferred and whether it was right." The agent's role appears as a footnote, not as a load-bearing component of the failure model.

What the missing rows look like

The fix is to extend the template so that an agent's role in an incident is a first-class part of the document, not a paragraph buried in the narrative. Concretely, four fields:

What the agent inferred. A direct statement of the agent's reading of the situation — "the operations agent diagnosed the queue stall as a deadlock." Not what it did, but what it believed. This is the LLM-equivalent of the "system state at the time of failure" row, and it has to be captured verbatim from the trace, not paraphrased.

What the correct inference would have been. The factual reading the agent missed — "the queue was stalled because the upstream producer had stopped publishing, not because of a deadlock." This is the gap the action items have to close.

What signal the agent missed. The specific piece of context that would have changed the inference if it had been visible — "the producer's heartbeat metric was not in the agent's tool surface." This is the actionable handle. It moves the conversation from "the model was wrong" (unactionable) to "the model was reasoning over an incomplete picture" (actionable).

What change to context or tool would have produced the correct inference. The intervention — "add the producer heartbeat to the queue-diagnostic tool's returned payload" — phrased as a tool or context change, not as a prompt tweak. "Improve the prompt" is the new "fix the runbook"; it shifts the work onto the model's reasoning rather than onto the information environment the model reasons in.

These four fields turn an unactionable lament ("the agent reasoned wrong") into a tractable engineering problem ("the agent reasoned over the wrong inputs"). They also do something subtler: they move the postmortem's center of gravity off the LLM call and onto the surrounding system. The model didn't fail in isolation; it failed inside a context pipeline, a tool surface, and an escalation policy that the team owns and can change.

A four-way root-cause split that beats "AI did the thing"

Once the template has rows for the agent's inference, the next gap shows up in the categorization of the root cause. Most templates collapse anything LLM-related into one bucket — "AI failure" or, increasingly, "model error" — which is the categorical equivalent of writing "computer broke" in 1995. It hides four distinct failure modes that have four distinct fixes:

  • Agent failure. The model received correct context and the right tools and still produced the wrong plan. This is genuinely a model-capability gap and the right action item is an eval addition that catches the next instance, not a code change.
  • Tool failure. The model's plan was correct given the information it had, but the tool it called returned wrong data or didn't surface the relevant field. The fix is to the tool, not to the agent.
  • Data failure. The context the agent reasoned over was stale, incomplete, or corrupted. The fix is in the retrieval layer, the cache, or the upstream system of record.
  • Human-agent handoff failure. The agent escalated to a human (or failed to) at the wrong threshold, the human approved a plan they should have rejected, or a queue back-pressure issue meant the right human never saw it. The fix is in the escalation policy, the queue, or the reviewer's information surface.

A single incident often has more than one. The Firetiger writeup is a clean example: a race in CI, a build-attribution bug, and an agent that wasn't wired up to act on what it correctly inferred — three categories braided together. A template that forces you to attribute contribution across these four buckets produces action items that target the right layer. A template that lets you write "AI failure" produces a backlog item that nobody can close.

Action items that close probabilistic gaps

The other place deterministic templates fall over is at the action-item stage. The implicit contract of an action item is that you can close it and the failure will not recur. That contract works for "deploy a hotfix to the rate limiter." It does not work for "the model gave a confidently wrong diagnosis," because no single change brings the failure rate of that class to zero — you're moving a distribution, not patching a bug.

The right shape for agent-failure action items is borrowed from how ML teams treat regression: you commit to measurable failure-rate reduction on a named eval, not to eliminating the failure. Three concrete patterns:

Eval coverage for the missed scenario. Take the trace from the incident, anonymize it, and add it to the eval suite. The action item closes when the suite has a green run on a model version that includes this case. This is the LLM equivalent of a regression test, with one important difference: it doesn't guarantee the failure won't recur, only that the recurrence will be detected during evaluation rather than in production.

Guardrail addition. A deterministic check at the tool or output layer that catches the specific shape of the wrong inference. "If the recovery plan would truncate a table, require explicit human approval regardless of agent confidence." Guardrails are how you make a probabilistic system safer without making it smarter.

Escalation threshold tuning. Quantitative changes to the policy that decides when the agent acts autonomously versus when it pauses for a human. If the agent's confidence calibration was wrong in this incident, the action item is to lower the autonomy threshold for the specific class of action, with a measurable target: "agent-only execution for queue-recovery falls from 80% to 30%, re-evaluated after fifty incidents."

None of these read like "fix the bug." They read like "tighten the distribution." That's the shape of the work, and the template has to make room for it.

The quarterly check that keeps the schema honest

Templates rot. The org adds a row, ships it, and a year later half the postmortems still skip the new fields because the incident commander is in a rush and defaults to muscle memory. The countermeasure isn't a stricter template — it's a quarterly review.

The review's job is to read every agent-involved postmortem from the last quarter and grade them on two questions. First, did the document actually use the agent-specific fields, or did the author flatten the failure into the deterministic rows? Second, are the failure rates the action items targeted actually trending down on the named evals, or are the same classes recurring? The first question keeps the schema alive. The second one keeps the action items from becoming rituals.

A team that runs this check once a quarter will see, after a year or so, which incident categories are converging (the eval-driven action items are working) and which are flat or worsening (the team is treating a research problem like an engineering one and the template is letting them do it). That signal is more useful than any individual postmortem because it tells you whether the organization's learning loop is actually closed.

The schema is the learning

Postmortem culture is, at the end, an organizational claim that the team learns from incidents. The template is the schema the learning has to fit into. An incomplete schema produces incomplete learning, and the gap is most expensive at exactly the moment the team is trying hardest to learn — the few hours after a customer-visible failure, when the on-call writer is still adrenalized and the document they produce will shape the next six months of work.

If your template has no row for what the agent inferred, no category that distinguishes agent failure from tool failure, and no action-item type that addresses a probability rather than a bug, the document you ship after the next AI incident will accurately describe what happened to a system you no longer have. The schema is the bottleneck. Fix the schema, and the postmortems start telling you the truth about the system you're actually running.

References:Let's stay in touch and Follow me for more thoughts and updates