Skip to main content

AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause

· 10 min read
Tian Pan
Software Engineer

Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."

That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.

5-why analysis assumes that if you trace the causal chain long enough, you reach a single deterministic root cause you can fix. "The function returned null because the input was empty because the upstream service dropped the field because..." That chain exists. You can follow it.

With AI failures, the chain terminates at "the model predicted this output given this input" — and that prediction is stochastic. Ask the same question again with the same prompt and you may get a different answer. The "bug" doesn't reproduce. You can't point to a line of code. The failure occurred probabilistically, somewhere inside billions of parameters you don't control.

This post is about how to handle that — concretely.

Why 5-Why Fails on Stochastic Systems

The 5-why method is a root-cause extraction technique. It works when causes are deterministic and singular. AI failures violate both properties.

Non-determinism at the model layer: Even at temperature 0, most production LLMs introduce stochasticity through batch normalization, floating-point non-determinism across hardware, and quantization effects. At temperature > 0, you're explicitly sampling from a probability distribution. The "same" prompt does not guarantee the same output.

Multiple contributing causes: An AI incident typically has at least three interlocked causes — a model behavior, a prompt gap, and a data context issue — none of which alone would have triggered the failure. 5-why looks for one root, not a conjunction.

Silent failure modes: Traditional incidents have a moment of detection: an error rate spikes, a crash alert fires, a timeout threshold is breached. AI incidents often surface through user complaints, downstream data quality checks, or manual review — sometimes days after the failure began. By then, the exact conditions that triggered it may be irrecoverable.

The upshot: you cannot treat an AI post-mortem like a software bug post-mortem. You need a different analytical frame.

A Taxonomy of AI Failure Types

Before you can write a useful post-mortem, you need to correctly classify what kind of failure occurred. The remediation is completely different depending on type.

Model failures are failures embedded in the trained parameters themselves. Amazon's experimental recruiting tool penalized résumés containing words like "women's" because it was trained on a decade of résumés from a male-dominated industry. The bug was in the training data and the learned weights — not in any code path you could inspect. You can't patch it; you retrain or you scrap it.

Prompt failures are failures in how you're instructing the model. The intent of your prompt doesn't match what the model actually optimizes for. A prompt that says "summarize this contract" without specifying audience, length, or which clauses matter most will produce inconsistent output. The model isn't broken; the instruction is underspecified. These are fixable through prompt iteration and evaluation.

Data failures are failures caused by the input distribution shifting away from what the model was built for. Zillow's home-valuation algorithm worked reasonably well until post-pandemic market volatility broke every historical pattern it had been trained on. The model's logic was internally consistent — the real-world distribution had changed. These require monitoring and retraining triggers, not prompt fixes.

Most production AI incidents are a combination: a prompt with insufficient constraints (prompt failure) operating on inputs from a shifted distribution (data failure) that hits an edge case in the model's training (model failure). Your post-mortem needs to identify which layer was primary and what each layer contributed.

What Telemetry You Need to Reconstruct the Failure

The reason most AI post-mortems are inconclusive is that teams didn't capture enough context at inference time. You can't reconstruct an AI failure from latency metrics and error rates alone.

Capture the full request-response pair: Every inference call should store the complete prompt (including system prompt and injected context), the raw model response, model version and configuration (temperature, max tokens, top-p), and a trace ID linking it to the user session and upstream request.

Trace multi-stage pipelines end-to-end: Most production LLM systems aren't single inference calls. They're retrieval pipelines, chained agent calls, tool executions, and reranking steps. Each stage needs its own span — the retrieval query and results, the reranking scores, each LLM call in the chain, each tool invocation and its output. When an incident occurs, you need to reconstruct the entire causal chain, not just the final generation.

Log the context window contents: Retrieval-augmented systems fail when the retrieved context is wrong, stale, or irrelevant. Capturing what documents were in the context window at inference time — not just what query was used to retrieve them — is the difference between being able to explain the failure and writing "the model hallucinated" in your post-mortem.

Preserve model versioning metadata: If you're calling an API provider, log the exact model version, not just the model name. "gpt-4" is not a stable artifact; providers make silent updates. When a behavioral shift happens, you need to know whether the model changed underneath you.

Implement structured evaluation logging alongside inference: Rather than waiting for incidents to analyze quality, run lightweight automated evaluations as part of your inference pipeline. Log a structured quality score (accuracy, relevance, constraint satisfaction) alongside each response. This turns your production traffic into a continuous dataset — and it's how you detect degradation before users file complaints.

How to Write an Incident Review That Actually Teaches Something

The biggest failure mode in AI post-mortems isn't the analysis — it's the framing. Teams write "the model produced an incorrect output" as both the incident description and the root cause, then conclude "we will monitor more carefully." That post-mortem is worse than useless; it trains your team to accept AI failures as unanalyzable acts of God.

Reconstruct the conditions, not just the output: For each AI incident, the post-mortem should answer: what was in the context window? What retrieval results populated the prompt? What model version and parameters were used? What was the exact user input? If you can't answer these, your observability infrastructure is the real problem — and that should be the primary corrective action.

Classify the failure and state it explicitly: "Model failure: the model's training data did not include updated policy information that contradicts the default it learned." "Prompt failure: the system prompt did not specify that the model should express uncertainty when it lacks policy information." "Data failure: the context injection pipeline retrieved a cached document from six weeks ago." Naming the type forces precision and makes the corrective action obvious.

Measure the blast radius: How many users received the incorrect output? What was the time window between failure onset and detection? What downstream systems or decisions were affected? Blast radius isn't just a severity label — it's a quantitative estimate: affected users × detection window × probability of acting on the output. A failure that affected 12 users over 4 hours has a different remediation priority than one that affected 12,000 users over 4 weeks.

Include what you won't know: Honest post-mortems acknowledge the limits of reconstruction. If you don't have a trace for the incident session, say so. If the model's non-determinism means you can't confirm what the model would have said, say so. Epistemic humility isn't weakness — it's a signal to the team that investing in observability infrastructure would have changed the outcome.

Building Runbooks That Don't Just Say "Monitor More Carefully"

A runbook for an AI system has to do something a traditional runbook doesn't: distinguish between the customer clock and the diagnosis clock.

The customer clock is time-to-remediation: how quickly can you stop the harm? The answer is almost always "roll back to the last known good model version or tighten the output filters." This should be a five-minute operation. The runbook step is: identify the deployed model version that preceded the behavioral shift, flip the deployment, verify that problematic outputs are no longer appearing, and update incident status.

The diagnosis clock is time-to-understanding: how quickly can you identify what changed and why? This takes longer and should not block the customer clock. Diagnosis runs in parallel with remediation.

Structure your AI runbook around failure type:

For suspected prompt failures: examine recent prompt changes in version control, run the affected prompt against a regression test suite (you should have one), compare output distributions before and after the change, isolate the specific constraint or instruction that's missing.

For suspected data failures: check feature distributions and data freshness for the inputs involved, run statistical tests for distribution shift (Population Stability Index is a standard metric), identify whether the failure correlates with a specific input segment or time window.

For suspected model failures: check whether the underlying model version changed (API providers update models silently), compare model behavior on the same inputs across versions, determine whether the failure reproduces in a sandboxed environment with fixed parameters.

Build explicit escalation criteria: When does an AI incident get treated as a P0? A useful threshold: any failure where a user received factually incorrect output that they were likely to act on in a high-stakes domain (medical, financial, legal) is a P0 regardless of volume. A failure that affected 10,000 users with low-stakes consequences might be P2. Define this before an incident, not during one.

Closing the Loop: From Post-Mortem to Prevention

The output of an AI post-mortem should be one or more of these concrete artifacts — not generic action items:

A regression test case: The specific input, context, and expected behavior that captured the failure. This goes into your golden dataset and runs against every future model version and prompt change.

An observability gap identified: The specific piece of telemetry you didn't have that would have made the incident faster to diagnose. A runbook for capturing it. A timeline for when it will be in place.

A classification update: A tagged entry in your internal AI incident database (even a spreadsheet) noting the failure type, affected system, blast radius, and resolution. Over time, this becomes the dataset that lets you see patterns — clusters of prompt failures in a specific domain, recurring data drift in a particular input segment.

A guardrail or constraint added: If the failure was a prompt failure, a new constraint or few-shot example added to the system prompt. If it was a model failure, a new output validator. If it was a data failure, a new freshness check or staleness rejection rule.

The teams that get better at AI reliability over time are the ones that treat each post-mortem as an investment in the evaluation and observability infrastructure — not just a record that something went wrong. The goal isn't to make the model perfect; it's to make the system's failure modes visible, bounded, and learnable.

AI systems will produce incorrect outputs. That's not the incident — that's the physics of probabilistic systems. The incident is when you don't know it happened until a user tells you, you can't reconstruct why, and you have no principled response. That's a process failure, and it's entirely fixable.

References:Let's stay in touch and Follow me for more thoughts and updates