Skip to main content

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

This isn't an argument against blameless culture or the SRE workflow. Both transfer cleanly. The argument is that the fields in the document — the explicit prompts the author has to answer — need a second set built for the variables an LLM system actually has. Without those prompts, the postmortem author stares at an empty "Contributing Factors" box and writes "the model degraded" because no field on the page asks them to look at the judge config.

The Six Fields the SRE Template Doesn't Have

A traditional SRE postmortem treats the system under test as deterministic given inputs, code, and config. The postmortem template's contributing-factor section assumes you can identify what changed by reading deploy logs and infra dashboards. For an LLM system, six variables move outside that envelope, and each one needs its own line in the document.

Prompt revision live during the incident. Not the prompt as it exists in main today — the prompt as it was rendered at the moment the bad output happened. Prompts mutate via system-prompt edits, few-shot library rotation, retrieved-context injection, and dynamic instruction blocks composed per-request. "We changed a few words" looks like a one-line diff in the change log; behaviorally it can be a complete recalibration. The field has to capture both the literal rendered prompt and the version IDs of every component that composed it.

Model version and routing slice. A single product surface often runs against multiple model tiers via a router (cheap-model fast path, expensive-model fallback for hard queries, region-pinned variants for compliance). When quality drops in a slice, the postmortem needs to know which slice the affected requests landed in. "We use Model X" is wrong twice — the product uses several, and the affected users may have been routed to a different one yesterday than today.

Judge configuration and eval state. The eval suite that didn't catch the regression is itself a system, with its own prompt, model, scoring rubric, and dataset version. When an incident reveals the eval missed the failure, you need the judge config snapshot from the eval run that signed off the last release — not the current judge config, which has probably already been edited in the panic. Without this snapshot, the question "did the eval pass on the actual broken state, or was the eval scoring the wrong thing?" is unanswerable.

Retrieval index state and freshness lag. RAG-backed features depend on an index whose contents change continuously: documents added, deleted, re-embedded, reranked. The postmortem needs the index version (or commit/snapshot ID) that was serving during the incident, the lag against the source-of-truth at that moment, and whether any partial reindex was in flight. "The index was stale" is a type of bug the SRE template treats as an infra fault; in an LLM system, stale-by-six-hours and stale-by-four-days produce different behaviors and have different post-mitigation actions.

Tool schema versions and the composition graph. Agents that call external tools depend on the tool catalog's schemas — what arguments each tool accepts, what shape it returns, what permission scopes it carries. Tool providers change response shapes without coordinating with you. A vendor that adds a nullable field, deprecates an enum value, or quietly tightens a rate limit can flip your planner from working to broken between two requests. The postmortem needs the schema versions in effect at the time of the incident plus the specific composition (tool A → tool B → tool C) that produced the failure.

Traffic mix and input distribution. SRE postmortems rarely log the input distribution because deterministic systems don't depend on it the same way. LLM systems do: a long-tail spike of multi-intent queries, an unusual surge of cold-cache requests after a marketing email, a regional shift that moves traffic to a different model variant — these cause behavioral incidents that look like model regressions and aren't. The field needs the request-class histogram for the incident window compared against a baseline.

The Incident-Class Taxonomy That Has to Be Added

The SRE template asks "what type of incident was this?" and offers the usual menu: outage, latency degradation, data corruption, security event, capacity exhaustion. None of those name the failure modes that dominate LLM production incidents. The postmortem template needs five additional incident classes, each with its own diagnostic checklist.

Silent quality regression. No error fired, no SLO breached, no alarm went off — and a meaningful fraction of users got worse answers for hours. This is the hardest class because the SRE detection pipeline has nothing to escalate. It surfaces through user complaints, support-ticket spikes, or downstream-product metrics that took days to move. The postmortem prompt: "What user signal first detected this, and what was the lag from onset to detection?"

Judge-induced false negative. The eval suite passed; the model still shipped a regression. The bug is in the judge, not the model: a judge prompt that doesn't penalize the failure mode, a judge model whose biases match the candidate model's biases (so it scores confident wrong answers as correct), a rubric whose criteria are stale. The postmortem prompt: "Did the eval that signed off this release run a current judge config, and when was the judge last calibrated against human labels?"

Retrieval staleness incident. The index was serving a view of the world that had drifted from the source-of-truth, and the model gave answers consistent with the stale view. The model behaved correctly; the index lied to it. The postmortem prompt: "What was the freshness lag during the incident, what source mutations were not yet reflected, and which queries landed on the stale segment?"

Tool-shape drift. A vendor changed a response shape, an internal tool added a field, a tool deprecation rolled out and the planner kept calling it. The incident manifests as malformed planner output, refused tool calls, or quiet downgrade to a fallback path. The postmortem prompt: "What schemas were in effect during the incident, what schema changes shipped (yours or vendors') in the prior 14 days, and which tool calls in the trace exhibited the drift signature?"

Prompt-rollout skew. Two regions, two cohorts, or two canary buckets ran different prompt revisions during the incident, and the failure correlated with one bucket. This class is particularly nasty in multi-region deployments where prompt changes propagate at different speeds and the team thinks of "the prompt" as a singular artifact. The postmortem prompt: "What was the prompt-revision distribution across regions/cohorts during the incident, and did the failure rate correlate with any specific revision?"

A team without these classes in their template will reach for the closest SRE class, mislabel the incident, and over the course of a quarter accumulate a "miscellaneous quality" bucket where every interesting AI failure goes to die.

Timeline Reconstruction Needs Replay, Not Just Logs

The SRE timeline reconstruction is built on log lines: events, deploy markers, alert fires, on-call actions. For an LLM incident, log lines aren't enough — you need to be able to re-run the failure. Without the ability to replay the broken state against a candidate fix, the postmortem's "prevention" section degrades into wishful thinking.

Replayable trace storage is the dependency. Each request gets a span tree with the rendered prompt, retrieved chunks, tool inputs and outputs, model output, and every version ID in effect. The trace has to survive long enough to be useful in a postmortem (often weeks, sometimes months for slow-burn quality issues). Most teams discover during their first big incident that their trace store retention is 24 hours and the bug started 11 days ago.

Three observability primitives carry their weight:

  • Prompt-version pinning visible per span. Every span attribute includes the prompt revision IDs that composed the request, not just a high-level "prompt vX.Y" tag. When the incident is "this paragraph in the system prompt was the regression," you need the diff of that paragraph tied to that span.
  • Judge-config snapshots tied to each eval run. An eval pass-mark from three weeks ago doesn't prove the model was good then unless you also have the judge state that scored it. Snapshots are cheap; reconstructing a deleted judge config from memory during an incident is not.
  • Index-version stamps on retrieval spans. Every retrieval span carries the index commit (or snapshot ID) it queried. When the incident is retrieval staleness, you can pin the bad answer to the exact index state and prove (or disprove) the staleness hypothesis.

The cost of this telemetry is real but bounded. The cost of not having it is incidents that close as "investigating" and reopen six weeks later when the same pattern reappears.

What the Template Should Actually Look Like

A practical LLM postmortem template extends the SRE skeleton rather than replacing it. Keep the timeline, contributing factors, mitigation, prevention, and action items. Add a fixed section near the top called "System State at Incident Start" with explicit fields for each of the six variables above. Add the five new incident classes to the dropdown the author picks from. Add a "Replay" subsection under timeline reconstruction that links to the saved trace bundle.

The structural decision that matters most: make the AI-specific fields mandatory and pre-filled by automation wherever possible. The postmortem author shouldn't have to remember to record the prompt revision; the incident response tooling should pull it from the trace store and stamp it into the document at incident-creation time. Mandatory fields the author would never think to ask are exactly the fields that catch the silent contributing factors.

The org failure mode this prevents: a postmortem author who didn't write the prompt, didn't tune the judge, didn't own the index pipeline, and didn't deploy the tool change, but is on the hook to explain what happened. Without the template prompting them, they default to whatever they personally know best — usually the model — and the postmortem's "root cause" becomes "the model regressed," which is false in roughly the same way "the database was slow" is false in classic SRE without the actual query plan attached.

The Architectural Takeaway

The contributing factors section of an SRE postmortem is the lever that drives prevention work: it points to the variable that moved, and the action items target that variable. When the section can't capture the variables that move in your system, the prevention work targets the wrong thing — usually a generic "improve monitoring" or "add more eval cases" item that nobody can verify closed the gap.

A postmortem template is a forcing function. It tells the author what questions to answer before they declare the incident understood. The SRE template forces the right questions for systems whose state lives in code and configuration. An LLM system's state lives in prompts, models, judges, indexes, schemas, and traffic — and until the template forces those questions explicitly, every postmortem will under-capture the variables that actually decided the incident, and the next one will look exactly like the last.

The work isn't glamorous. It's a one-page template extension, a handful of trace attributes, a snapshot job for judge configs, and a discipline of writing the AI-specific fields before the narrative section so the author can't skip them. But it's the difference between a postmortem you can act on and a postmortem you can only file.

References:Let's stay in touch and Follow me for more thoughts and updates