AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause
Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."
That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.
5-why analysis assumes that if you trace the causal chain long enough, you reach a single deterministic root cause you can fix. "The function returned null because the input was empty because the upstream service dropped the field because..." That chain exists. You can follow it.
With AI failures, the chain terminates at "the model predicted this output given this input" — and that prediction is stochastic. Ask the same question again with the same prompt and you may get a different answer. The "bug" doesn't reproduce. You can't point to a line of code. The failure occurred probabilistically, somewhere inside billions of parameters you don't control.
This post is about how to handle that — concretely.
Why 5-Why Fails on Stochastic Systems
The 5-why method is a root-cause extraction technique. It works when causes are deterministic and singular. AI failures violate both properties.
Non-determinism at the model layer: Even at temperature 0, most production LLMs introduce stochasticity through batch normalization, floating-point non-determinism across hardware, and quantization effects. At temperature > 0, you're explicitly sampling from a probability distribution. The "same" prompt does not guarantee the same output.
Multiple contributing causes: An AI incident typically has at least three interlocked causes — a model behavior, a prompt gap, and a data context issue — none of which alone would have triggered the failure. 5-why looks for one root, not a conjunction.
Silent failure modes: Traditional incidents have a moment of detection: an error rate spikes, a crash alert fires, a timeout threshold is breached. AI incidents often surface through user complaints, downstream data quality checks, or manual review — sometimes days after the failure began. By then, the exact conditions that triggered it may be irrecoverable.
The upshot: you cannot treat an AI post-mortem like a software bug post-mortem. You need a different analytical frame.
A Taxonomy of AI Failure Types
Before you can write a useful post-mortem, you need to correctly classify what kind of failure occurred. The remediation is completely different depending on type.
Model failures are failures embedded in the trained parameters themselves. Amazon's experimental recruiting tool penalized résumés containing words like "women's" because it was trained on a decade of résumés from a male-dominated industry. The bug was in the training data and the learned weights — not in any code path you could inspect. You can't patch it; you retrain or you scrap it.
Prompt failures are failures in how you're instructing the model. The intent of your prompt doesn't match what the model actually optimizes for. A prompt that says "summarize this contract" without specifying audience, length, or which clauses matter most will produce inconsistent output. The model isn't broken; the instruction is underspecified. These are fixable through prompt iteration and evaluation.
Data failures are failures caused by the input distribution shifting away from what the model was built for. Zillow's home-valuation algorithm worked reasonably well until post-pandemic market volatility broke every historical pattern it had been trained on. The model's logic was internally consistent — the real-world distribution had changed. These require monitoring and retraining triggers, not prompt fixes.
Most production AI incidents are a combination: a prompt with insufficient constraints (prompt failure) operating on inputs from a shifted distribution (data failure) that hits an edge case in the model's training (model failure). Your post-mortem needs to identify which layer was primary and what each layer contributed.
What Telemetry You Need to Reconstruct the Failure
The reason most AI post-mortems are inconclusive is that teams didn't capture enough context at inference time. You can't reconstruct an AI failure from latency metrics and error rates alone.
Capture the full request-response pair: Every inference call should store the complete prompt (including system prompt and injected context), the raw model response, model version and configuration (temperature, max tokens, top-p), and a trace ID linking it to the user session and upstream request.
Trace multi-stage pipelines end-to-end: Most production LLM systems aren't single inference calls. They're retrieval pipelines, chained agent calls, tool executions, and reranking steps. Each stage needs its own span — the retrieval query and results, the reranking scores, each LLM call in the chain, each tool invocation and its output. When an incident occurs, you need to reconstruct the entire causal chain, not just the final generation.
Log the context window contents: Retrieval-augmented systems fail when the retrieved context is wrong, stale, or irrelevant. Capturing what documents were in the context window at inference time — not just what query was used to retrieve them — is the difference between being able to explain the failure and writing "the model hallucinated" in your post-mortem.
Preserve model versioning metadata: If you're calling an API provider, log the exact model version, not just the model name. "gpt-4" is not a stable artifact; providers make silent updates. When a behavioral shift happens, you need to know whether the model changed underneath you.
- https://www.datadoghq.com/blog/engineering/llms-for-postmortems/
- https://opentelemetry.io/blog/2024/llm-observability/
- https://portkey.ai/blog/the-complete-guide-to-llm-observability/
- https://dev.to/waxell/when-your-ai-agent-has-an-incident-your-runbook-isnt-ready-1ag6
- https://www.cmswire.com/customer-experience/exploring-air-canadas-ai-chatbot-dilemma/
- https://insideainews.com/2021/12/13/the-500mm-debacle-at-zillow-offers-what-went-wrong-with-the-ai-models/
- https://dev.to/kuldeep_paul/how-to-debug-llm-failures-a-practical-guide-for-ai-engineers-5c0b
- https://runplane.ai/ai-runtime-governance/ai-blast-radius
- https://arxiv.org/html/2509.14404v1
- https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html
