Skip to main content

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

This isn't an argument against blameless culture or the SRE workflow. Both transfer cleanly. The argument is that the fields in the document — the explicit prompts the author has to answer — need a second set built for the variables an LLM system actually has. Without those prompts, the postmortem author stares at an empty "Contributing Factors" box and writes "the model degraded" because no field on the page asks them to look at the judge config.

The Six Fields the SRE Template Doesn't Have

A traditional SRE postmortem treats the system under test as deterministic given inputs, code, and config. The postmortem template's contributing-factor section assumes you can identify what changed by reading deploy logs and infra dashboards. For an LLM system, six variables move outside that envelope, and each one needs its own line in the document.

Prompt revision live during the incident. Not the prompt as it exists in main today — the prompt as it was rendered at the moment the bad output happened. Prompts mutate via system-prompt edits, few-shot library rotation, retrieved-context injection, and dynamic instruction blocks composed per-request. "We changed a few words" looks like a one-line diff in the change log; behaviorally it can be a complete recalibration. The field has to capture both the literal rendered prompt and the version IDs of every component that composed it.

Model version and routing slice. A single product surface often runs against multiple model tiers via a router (cheap-model fast path, expensive-model fallback for hard queries, region-pinned variants for compliance). When quality drops in a slice, the postmortem needs to know which slice the affected requests landed in. "We use Model X" is wrong twice — the product uses several, and the affected users may have been routed to a different one yesterday than today.

Judge configuration and eval state. The eval suite that didn't catch the regression is itself a system, with its own prompt, model, scoring rubric, and dataset version. When an incident reveals the eval missed the failure, you need the judge config snapshot from the eval run that signed off the last release — not the current judge config, which has probably already been edited in the panic. Without this snapshot, the question "did the eval pass on the actual broken state, or was the eval scoring the wrong thing?" is unanswerable.

Retrieval index state and freshness lag. RAG-backed features depend on an index whose contents change continuously: documents added, deleted, re-embedded, reranked. The postmortem needs the index version (or commit/snapshot ID) that was serving during the incident, the lag against the source-of-truth at that moment, and whether any partial reindex was in flight. "The index was stale" is a type of bug the SRE template treats as an infra fault; in an LLM system, stale-by-six-hours and stale-by-four-days produce different behaviors and have different post-mitigation actions.

Tool schema versions and the composition graph. Agents that call external tools depend on the tool catalog's schemas — what arguments each tool accepts, what shape it returns, what permission scopes it carries. Tool providers change response shapes without coordinating with you. A vendor that adds a nullable field, deprecates an enum value, or quietly tightens a rate limit can flip your planner from working to broken between two requests. The postmortem needs the schema versions in effect at the time of the incident plus the specific composition (tool A → tool B → tool C) that produced the failure.

Traffic mix and input distribution. SRE postmortems rarely log the input distribution because deterministic systems don't depend on it the same way. LLM systems do: a long-tail spike of multi-intent queries, an unusual surge of cold-cache requests after a marketing email, a regional shift that moves traffic to a different model variant — these cause behavioral incidents that look like model regressions and aren't. The field needs the request-class histogram for the incident window compared against a baseline.

The Incident-Class Taxonomy That Has to Be Added

The SRE template asks "what type of incident was this?" and offers the usual menu: outage, latency degradation, data corruption, security event, capacity exhaustion. None of those name the failure modes that dominate LLM production incidents. The postmortem template needs five additional incident classes, each with its own diagnostic checklist.

Silent quality regression. No error fired, no SLO breached, no alarm went off — and a meaningful fraction of users got worse answers for hours. This is the hardest class because the SRE detection pipeline has nothing to escalate. It surfaces through user complaints, support-ticket spikes, or downstream-product metrics that took days to move. The postmortem prompt: "What user signal first detected this, and what was the lag from onset to detection?"

Judge-induced false negative. The eval suite passed; the model still shipped a regression. The bug is in the judge, not the model: a judge prompt that doesn't penalize the failure mode, a judge model whose biases match the candidate model's biases (so it scores confident wrong answers as correct), a rubric whose criteria are stale. The postmortem prompt: "Did the eval that signed off this release run a current judge config, and when was the judge last calibrated against human labels?"

Retrieval staleness incident. The index was serving a view of the world that had drifted from the source-of-truth, and the model gave answers consistent with the stale view. The model behaved correctly; the index lied to it. The postmortem prompt: "What was the freshness lag during the incident, what source mutations were not yet reflected, and which queries landed on the stale segment?"

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates