Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems
Your LLM-powered feature starts returning nonsense. The on-call engineer pages the ML team. They look at the output, compare it to what the prompt was supposed to produce, and within the hour the ticket is resolved: "bad prompt — tweaked and redeployed." Incident closed. Postmortem written. Action items: "improve prompt engineering process."
Two weeks later, the same class of failure happens again. Different prompt, different feature — but the same invisible root cause.
The "fix the prompt" reflex is the AI engineering equivalent of blaming the last developer to touch a file. It gives postmortems a clean ending without requiring anyone to understand what actually broke. And unlike traditional software, where this reflex is merely lazy, in AI systems it's structurally dangerous — because non-deterministic systems fail in ways that prompt changes cannot fix.
Why Traditional Postmortem Culture Works and Why AI Breaks It
Blameless postmortem culture in SRE exists for a specific reason: when people fear punishment, they hide information. Hiding information prevents learning. Systems that don't learn from failures repeat them. Google's SRE book formalized this into practice, and the industry largely adopted it — not because it feels good, but because systems improve when engineers can say "here's what we broke and why" without career risk.
The blameless model assumes something subtle though: that there is a discrete, reproducible failure event to analyze. A service crashed at 14:23 UTC. A config push introduced a null pointer. A database migration dropped a column. These failures have forensic trails. You can replay the sequence of events. You can establish causality with high confidence.
LLM failures don't work this way. The same input can produce different outputs across invocations. An issue that appeared in 3% of responses last Tuesday might appear in 8% this week without any deployment. The failure event isn't discrete — it's a distribution shift. And that's where traditional postmortem templates fall apart.
When your postmortem form asks "what changed before the incident?" and the honest answer is "nothing we deployed — the model drifted," engineers who want to close tickets will reach for the nearest available explanation. The prompt is always there. It's always imperfect. And it can always be blamed.
The Five Failure Categories That Prompts Can't Fix
Research into LLM system failures reveals that the majority of production incidents don't originate in prompt quality. An empirical analysis of open-source LLM failures found that environment faults (infrastructure coupling, dependency failures) account for roughly 46% of failures, user configuration faults account for 25%, and model-centric issues — where prompt changes would actually help — are the minority.
Here's a working failure taxonomy for AI systems:
Infrastructure and environment failures. Rigid coupling between your inference layer and external dependencies, context window exhaustion from upstream payload growth, latency spikes in embedding retrieval that cause timeouts mid-generation. None of these are prompt problems.
Data and retrieval failures. Stale documents in your RAG corpus, schema drift in the knowledge base, embedding model updates that invalidate stored vectors. The model is doing its job correctly given what it was handed — what it was handed is wrong.
Model degradation failures. Provider-side model updates (the same model name, different weights), concept drift where your fine-tuned model's target distribution has shifted, silent embedding space changes after model updates. You have no deployment event to point to in your postmortem because the change happened inside a black box.
Agentic coordination failures. In multi-agent systems, failures stratify into three layers: agent-level (the LLM made a poor decision), structure-level (the orchestration logic passed bad state between agents), and platform-level (the runtime failed to handle partial results correctly). Blaming the prompt for a structure-level failure is like blaming the application code for a race condition in the networking layer.
Security surface failures. Prompt injection attacks — where malicious content in retrieved documents or user inputs overrides system instructions — have an 84% attack success rate in agentic systems, compared to 50% in single-agent systems. Every agent-to-agent handoff is an injection surface. These require security-specific response playbooks, not prompt edits.
Knowing which category an incident falls into determines what you actually fix. Without a taxonomy, every incident looks like a prompt problem because the prompt is the one thing engineers feel they can control.
Building a Blameless Framework for Non-Deterministic Systems
Adapting SRE postmortem culture to AI systems requires changes to both the investigation process and the documentation format. The psychological safety principle remains the same — engineers who fear blame hide information — but the technical methodology needs to account for probabilistic behavior.
Replace "what changed?" with "what shifted?"
Traditional five-why analysis anchors to a change event. For AI incidents, you need a parallel track: what distribution shifted? Instrument your systems to log not just whether requests succeeded, but the distribution of outputs — confidence scores, response length distributions, refusal rates, semantic similarity to expected responses. When you can plot these over time, drift becomes visible without a deployment event.
Classify before you analyze.
Before any root cause analysis, classify the incident against your taxonomy. The classification determines your investigation path. If it's an environment fault, you're debugging infrastructure. If it's a retrieval failure, you're auditing your corpus. If it's model degradation, you're checking provider changelogs and setting up shadow testing against the previous checkpoint. Starting with classification prevents the "nearest available explanation" failure mode.
Document the failure distribution, not the failure instance.
Traditional postmortems document "the system returned X when it should have returned Y." AI postmortems need to document "the system returned X in N% of requests to this endpoint over the period P, versus the expected baseline of M%." This reframing matters: it prevents the team from convincing itself the incident was an isolated edge case when it was actually a systematic shift.
Separate "what happened" from "what to fix."
AI-assisted postmortem tools are increasingly useful for the grunt work: summarizing incident timelines, correlating signals across logs, generating first drafts of the impact section. But the root cause analysis — the "why" chain that leads from symptom to structural cause — cannot be automated. The system that generated the incident cannot reliably diagnose its own failure modes. Human review of the causal chain is non-optional.
The Runbook Gap That Every Team Discovers Too Late
Traditional SRE has mature runbooks for infrastructure incidents: restart the service, scale the pool, roll back the deployment. AI systems require a parallel set of playbooks that most teams don't build until they've been burned.
A minimal AI incident runbook library needs at least four entries:
- Model rollback procedure: How to switch to a previous model checkpoint or fall back to a ruled-based baseline. Who approves this? What's the latency impact?
- Retrieval quarantine procedure: How to exclude specific documents or entire corpora from retrieval while you investigate data quality issues. Does your system support hot exclusion or does it require a reindex?
- Output distribution alert thresholds: What refusal rate or semantic drift value triggers an alert? What's the response protocol for a p95 degradation vs. a complete outage?
- Injection surface audit checklist: For agentic systems, which trust boundaries are validated and which are implicit? Where can external content influence agent behavior?
The absence of these runbooks isn't a process failure — it's a signal that the team hasn't yet formalized what "reliability" means for their AI system. That formalization is the real output of a mature postmortem process.
Observability as a Prerequisite to Honest Postmortems
The "fix the prompt" reflex is partly a rationalization problem and partly an instrumentation problem. Engineers reach for the prompt explanation when they don't have better evidence. If you don't have output distribution dashboards, you can't demonstrate drift. If you don't have retrieval logs, you can't trace which documents influenced a bad response. If you don't have per-step traces in your agent workflows, you can't distinguish agent-level failures from orchestration failures.
2025 observability data is pointed: 62% of production AI teams named observability as their top improvement priority, and fewer than one in three were satisfied with their current solutions. The gap is significant because document processing and customer-facing AI (the highest-volume production uses) expose errors directly to users — meaning failures that might be hidden in a backend system become visible incidents immediately.
The technical requirements are more demanding than traditional application monitoring. Logging that an LLM request succeeded tells you almost nothing. You need to trace every agent step, monitor output quality distributions online, capture retrieval results alongside generated responses, and correlate business metrics (conversion, resolution rates, user escalations) with model behavior over time.
Without this instrumentation, your postmortems will always bottom out at "model behavior" — not because that's the true root cause, but because it's the last thing you can see.
What Mature AI Incident Culture Looks Like
Teams that have moved past the "fix the prompt" reflex share a few observable characteristics.
They treat model drift like latency degradation: with dashboards, alert thresholds, and response runbooks tied to business impact. They distinguish between a 5% regression in output quality (investigate, don't page) and a 30% regression (page, invoke runbook). They have SLIs defined for model behavior, not just infrastructure uptime.
They run postmortems on failure distributions, not failure instances. An isolated bad output is noise. A 2% shift in the rate of outputs falling outside quality thresholds over three weeks is a signal — and it gets its own postmortem even if no user ever filed a complaint.
They version prompts like code, with full deployment histories, so "what changed?" has a real answer. And they accept that some incidents will have no clean root cause — the model behaved strangely given a distribution of inputs, the distribution was unusual, nothing was deployed — and they document that honestly rather than inventing a narrative.
The goal of blameless postmortem culture was never to make engineers feel good about failures. It was to create the psychological safety needed for honest investigation. In AI systems, honest investigation requires admitting that the failure taxonomy is more complex than "bad prompt," the observability requirements are more demanding than traditional logging, and the runbooks need to be written before the incident, not during it.
The prompt is almost never the root cause. But it will keep being blamed until teams build the systems that let them see what actually is.
- https://sre.google/sre-book/postmortem-culture/
- https://postmortems.pagerduty.com/culture/blameless/
- https://rootly.com/incident-postmortems/blameless
- https://arxiv.org/html/2601.22208v1
- https://arxiv.org/html/2509.23735
- https://arxiv.org/html/2403.04123v1
- https://arxiv.org/html/2509.14404v1
- https://arxiv.org/html/2601.13655v1
- https://cloud.google.com/blog/products/devops-sre/applying-sre-principles-to-your-mlops-pipelines
- https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/
- https://cleanlab.ai/ai-agents-in-production-2025/
- https://www.dynatrace.com/news/blog/ai-observability-business-impact-2025/
- https://alldaystech.com/guides/artificial-intelligence/model-drift-detection-monitoring-response
