The AI Incident Postmortem Nobody Writes: A Four-Layer Diagnosis Framework
When a recommendation engine surfaced offensive content last quarter, the post-incident review produced a familiar outcome: a two-hour call where ML engineers pointed at the retrieval corpus, data engineers pointed at the prompt, product engineers pointed at monitoring, and infrastructure pointed at the model version that nobody remembered upgrading. Three action items were created. None had owners. The incident closed. The same failure mode shipped again six weeks later.
This is not a story about one team. It is the default ending for AI incidents at most organizations. Responsibility for what an AI feature does in production is distributed across enough parties that a standard postmortem cannot pin causation. The 5-why analysis that works well for database timeouts breaks when the failure is "the model gave the wrong answer" — because the correct next question is never obvious.
AI incidents increased 56.4% from 2023 to 2024. Only 20% of organizations have a tested AI incident response plan. The gap between those two numbers is where production systems silently degrade.
Why Standard Postmortems Fail for AI
A traditional postmortem assumes one root cause and one owning team. The runbook says: find the service that failed, find the engineer on-call for it, trace back through the system until you hit the cause, assign remediation.
AI features break this model in at least three ways.
First, causation is distributed. When an LLM gives a wrong answer, the output is the product of the model weights, the prompt, the retrieval context, the serving infrastructure, and the integration logic. Any of these layers can be the proximate cause while the others appear normal. An analysis of 1,200 production LLM deployments found that 40% of failures trace to model drift and 60% trace to tool versioning — meaning in most incidents, the model itself is not the culprit. But most postmortems start by interrogating the model.
Second, failure signals are soft. Hard errors — HTTP 500s, timeouts, null responses — trigger alerts and get caught quickly. But AI degradation usually looks like elevated latency, lower click-through rates, subtly wrong answers that users tolerate until they stop using the feature entirely. An LLM provider published a postmortem documenting three simultaneous quality regressions that went undetected for weeks because HTTP success rates looked normal throughout. One bug caused wrong-language responses; another caused incorrect arithmetic. Both surfaced as clean API responses.
Third, ownership is genuinely ambiguous. ML engineers own the model logic and prompts. Data engineers own the retrieval corpus and feature pipeline. Product engineers own the integration and user-facing behavior. Infrastructure owns the serving layer, model versions, and deployment configuration. All four parties have partial accountability and partial visibility. None has the full picture. When an incident bridges two of these layers — a prompt that was fine until the retrieval corpus schema changed — the organization has no instinct for who runs the postmortem.
The Four-Layer Diagnosis
The fix is not a new RACI chart. Most AI teams already have ownership matrices; they're just not structured around the question an incident demands, which is: which layer broke?
A useful AI postmortem starts with a structured diagnostic across four layers before assigning any blame or action items.
Layer 1: Model Layer
Was there a change to what the model produces? This includes model drift (performance degradation as the world changes relative to training data), hardware-level errors (TPU or GPU arithmetic bugs, documented in production postmortems from multiple LLM providers), or a model version change that nobody flagged. The distinguishing signal: outputs are wrong but inputs look normal. Useful checks: did model version or config change near the incident window? Have prediction distributions shifted? Is the failure consistent or stochastic?
Layer 2: Data Layer
Did the information flowing into the model change? Schema changes upstream from the retrieval pipeline are responsible for a significant fraction of AI incidents — not because the data team made a mistake, but because AI features have implicit contracts with data structure that are rarely written down. Data drift (input distributions shifting from training distribution), corpus freshness issues, or a silent change in how features are computed all surface here. The distinguishing signal: inputs have changed, outputs reflect that change correctly. The model did exactly what it was trained to do — with different data.
Layer 3: Integration Layer
Did the product-side logic around the model break? Prompt injection, rate-limit mishandling, incorrect context assembly, fallback logic that failed to activate, or changes to how the feature interprets model output all live in this layer. This is typically owned by product engineering but is often the last place anyone looks because it involves "just the plumbing." Zalando's AI-powered postmortem analysis system exhibited a classic integration-layer failure: it attributed incidents to whatever technology was mentioned in the postmortem text, producing systematically wrong root cause analysis. The model itself was fine. The integration logic made surface-level attribution instead of causal attribution.
Layer 4: Infrastructure Layer
Did the deployment environment change? Load balancer configuration, caching behavior, multi-region failover, traffic routing, and scaling logic all live here. Infrastructure changes are frequently invisible to ML and product teams, which is why they produce the most confusing incidents. One documented production incident involved a load balancing change that silently degraded 16% of requests on one model for weeks. Nobody connected the infrastructure change to the quality regression until a manual audit surfaced the timing overlap.
The diagnostic question for each layer is not "who is responsible" but "did this layer change or degrade near the incident window." A change in the infrastructure layer does not make infrastructure the guilty party — it makes infrastructure the relevant layer for investigation.
Audit Trails That Let You Answer the Diagnostic
The four-layer diagnostic only works if you can observe each layer at incident time. Most AI monitoring is oriented toward aggregate metrics — error rates, latency percentiles, throughput. These measure the integrated output of all four layers at once. They don't let you isolate which layer failed.
- https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://arxiv.org/html/2504.08865v2
- https://arxiv.org/html/2601.20727
- https://wetranscloud.com/blog/mlops-governance-audit-trails-compliance-explainability/
- https://elevateconsult.com/insights/designing-the-ai-governance-operating-model-raci/
- https://rootly.com/incident-postmortems/blameless
- https://www.mdpi.com/2624-800X/6/1/20
- https://stackgen.com/blog/managing-complex-incidents-ai-sre-agents/
