Skip to main content

3 posts tagged with "postmortem"

View all tags

From a Bug to a Behavior Rate: The AI Postmortem Without a Reproducer

· 10 min read
Tian Pan
Software Engineer

A user files a ticket. The agent told a paying customer their refund would be processed in seven hours when the documented SLA is seven days. Screenshot attached. You pull the trace, find the exact prompt, the exact tool calls, the exact model and seed. You replay it. The model says seven days. You replay it again. Seven days. You replay it a hundred times. It says seven days ninety-eight times and "by end of day" twice, and never once says seven hours. The screenshot is unambiguous. The replay disagrees. The postmortem due Friday now has a "Root Cause" section and no root cause to put in it.

This is the shape of most AI incidents that reach a postmortem. Not the obvious outages — those have stack traces and 500-rate graphs and recover the way every SRE has been trained to expect. The hard ones are the single bad output that left a victim, erased its own conditions on the way out, and refuses to come back when you summon it. Every postmortem template you have ever used assumes a reproducer. Agents do not give you one.

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM-powered feature starts returning nonsense. The on-call engineer pages the ML team. They look at the output, compare it to what the prompt was supposed to produce, and within the hour the ticket is resolved: "bad prompt — tweaked and redeployed." Incident closed. Postmortem written. Action items: "improve prompt engineering process."

Two weeks later, the same class of failure happens again. Different prompt, different feature — but the same invisible root cause.

The "fix the prompt" reflex is the AI engineering equivalent of blaming the last developer to touch a file. It gives postmortems a clean ending without requiring anyone to understand what actually broke. And unlike traditional software, where this reflex is merely lazy, in AI systems it's structurally dangerous — because non-deterministic systems fail in ways that prompt changes cannot fix.