AI Incident Response Playbooks: Why Your On-Call Runbook Doesn't Work for LLMs
Your monitoring dashboard shows elevated latency, a small error rate spike, and then nothing. Users are already complaining in Slack. A quarter of your AI feature's responses are hallucinating in ways that look completely valid to your alerting system. By the time you find the cause — a six-word change to a prompt deployed two hours ago — you've had a slow-burn incident that your runbook never anticipated.
This is the defining challenge of operating AI systems in production. The failure modes are real, damaging, and invisible to conventional tooling. An LLM that silently hallucinates looks exactly like an LLM that's working correctly from the outside.
Your existing on-call playbooks were built for deterministic systems. A service either returns 200 or it doesn't. A database query either succeeds or it fails with an exception you can grep for. The entire machinery of incident response — alerts, runbooks, escalation paths, postmortems — was designed around the assumption that failures leave a trace you can follow.
AI systems break that assumption completely.
The Four Ways Your On-Call Assumptions Break
Non-determinism is the first problem. The same prompt can return a correct answer at 9am and a hallucinated one at 9:15am. You cannot reproduce failures reliably. When your on-call handbook says "reproduce the issue in staging before escalating," that step may simply be impossible for an LLM failure. The system is stochastic by design.
No error codes is the second problem. A 500 error tells you something broke. A hallucination returns 200 with a JSON payload that looks structurally perfect. Your error rate metric sees nothing. Your latency metric sees nothing. The only signal is in the content of the response — which your infrastructure layer has no idea how to inspect.
The triage matrix is wrong. When a service goes down, you know where to look: the deployment, the dependency, the infrastructure. For an AI system degradation, the root cause could be any of four completely different categories — and each one requires a different team, different tools, and a different fix:
- Model: The underlying model has drifted, been silently updated, or exhibits different behavior on your specific input distribution.
- Prompt: A recent prompt change removed context that was implicitly anchoring the model's factual grounding.
- Data: The inputs coming into the system have shifted distribution — you're now getting query patterns the system wasn't designed for.
- Infrastructure: API rate limits, latency spikes, or provider-side degradation that causes retries and cascading failures.
Intermittency makes things worse. A bug that manifests 30% of the time is nearly undetectable by threshold-based alerting unless you're monitoring the right quality metrics. And most teams aren't.
Building the AI Incident Triage Decision Tree
The goal of the first fifteen minutes of an AI incident is not to fix the problem — it's to correctly classify which of the four categories you're in. Each category has a different owner, different tooling, and a different remediation path.
Step one: rule out infrastructure. Check API availability, request queue depth, latency percentiles, and timeout rates. Infrastructure failures tend to produce consistent failures (not intermittent), cluster geographically, and often correlate with provider status pages. If your error rate is truly elevated (not just quality-degraded) and the pattern is consistent, start here.
Step two: check for model changes. Look at which model version is deployed. Many teams deploy against a floating alias like gpt-4 rather than a pinned version like gpt-4-turbo-2024-04-09. Floating aliases can receive silent model updates from providers — and model drift accounts for roughly 40% of production agent failures. Cross-reference your incident timeline with recent model version changes.
Step three: check for prompt changes. Pull the deployment history for your prompt templates. If a prompt changed within two hours of the incident start, that's your primary suspect. The comparison isn't just syntactic — a six-word removal can dramatically change model behavior even when the remaining text looks identical.
Step four: check input distribution drift. Compute semantic embeddings of recent inputs and compare them against your historical baseline. If the distribution has shifted — new intent patterns, unusual query types, a traffic source you haven't seen before — you may be in a data-driven failure mode where the model is being asked things it was never calibrated to handle well.
Mitigation That Doesn't Require Root Cause
This is the most counterintuitive part: you should be able to stop the bleeding before you know why it's happening.
In deterministic systems, you usually need to understand the failure before you can mitigate it. In AI systems, you have several blunt instruments that reduce harm immediately:
Prompt rollback is your fastest option. If you have version-controlled prompt templates (and you should), rolling back to the prior version is a sub-minute operation that doesn't require a code deploy. This is why decoupling prompt deployment from code deployment matters — your on-call engineer should be able to revert a prompt without touching the application.
Model version pinning is your second-fastest option. If the incident correlates with a model change, pin to the last known-good version. Floating aliases are convenient for development; they're a reliability liability in production.
Feature flags on AI features give you a kill switch. You can drop the AI feature entirely or route traffic to a fallback path within seconds. The interesting thing here is that 71% of enterprises have no documented degradation plan for production AI — meaning when their AI feature fails, they have no pre-planned fallback. This should be a first-class concern before you ship an AI feature, not after it breaks.
Guardrails at the output layer can reduce the surface area of harm even if you don't know the root cause. Tighter schema validation, output filtering, and confidence-based routing can contain a degradation while the investigation runs in parallel.
Monitoring the Metrics That Actually Matter
If you're only watching latency and error rate for an AI system, you're flying blind. Here are the metrics that detect AI-specific incidents:
Hallucination rate is the most critical. Depending on your domain, baseline hallucination rates for commercial LLMs range from 15% to over 50%. The important thing is not the absolute rate — it's the delta from your established baseline. A spike from 4% to 22% in two hours is an incident; a stable 12% might just be your system's normal operating range.
Refusal rate tells you when safety settings have drifted. Too high means your model is refusing legitimate requests. Too low means safety is eroding. Track both the raw refusal rate and the false refusal rate separately.
Structured output compliance rate catches prompt regressions fast. If your model is supposed to return JSON matching a specific schema, track what percentage of responses comply. A sudden drop is almost always a prompt regression or model update.
Semantic similarity to baseline measures whether recent outputs are statistically similar to historical good outputs. This is more expensive to compute but catches quality degradation that simpler metrics miss — particularly useful for free-form generation tasks.
Input distribution embeddings give you early warning when your traffic is drifting toward out-of-distribution territory. Set up a weekly baseline and alert when the current distribution exceeds a threshold of divergence.
User feedback signals are the most direct quality indicator and the most underused. A sudden drop in thumbs-up ratings or a spike in "this answer was wrong" feedback is usually the first signal of an AI incident — often before your automated metrics detect anything.
Postmortems for Stochastic Systems
A standard postmortem template assumes you can replay the incident. For AI systems, this assumption fails. You need to extend the postmortem format to accommodate non-reproducible failures.
Document intermittency explicitly. Instead of "the system failed at 2:15pm," write "28% of requests returned hallucinated responses between 2:15pm and 3:42pm." This framing is more accurate and is also more useful for future analysis — it captures that the failure was probabilistic, not binary.
Add a root cause category field. Explicitly classify whether the incident was model, prompt, data, or infrastructure. This forces the postmortem author to commit to a classification rather than writing vague language like "the AI behaved unexpectedly."
Document quality metrics during the incident. What was the hallucination rate? What was the structured output compliance rate? These numbers belong in the postmortem the same way p99 latency belongs in a traditional postmortem. If you don't have these numbers, the "why we didn't catch it sooner" section writes itself.
Write action items that account for stochasticity. A traditional action item is "fix the bug." An AI postmortem action item might be: "Add hallucination rate monitoring with a 5% spike alerting threshold" or "Require semantic similarity evaluation in CI for all prompt changes." The point is not to eliminate stochasticity — that's impossible — but to make the degradation visible faster.
Add a "why wasn't this caught" section focused on eval gaps. The most common answer is that there was no production evaluation running that inspects output quality in real time. Naming that gap explicitly creates the pressure to close it.
The Operational Readiness Checklist
Before shipping an AI feature to production, the following should be true:
- All prompt templates are version-controlled and can be rolled back independently of the application code.
- Model versions are pinned to specific release identifiers, not floating aliases.
- Feature flags exist for every AI-powered feature with a defined fallback path.
- Hallucination rate, refusal rate, and structured output compliance are instrumented and have baseline values with alert thresholds.
- An on-call runbook specific to AI features documents the four-category triage path.
- At least one graceful degradation mode is tested — typically returning cached data, a simplified response, or routing to a non-AI path.
Most teams ship the feature and build the runbook after the first incident. That works, but it means the first incident is your learning exercise. The cost of that lesson varies by domain — for a customer service chatbot it's embarrassing, for a medical summarization tool it can be dangerous.
Where This Is Heading
The SRE discipline is in the middle of a significant expansion. Reliability engineering for AI systems requires all the classic skills — on-call rotation design, postmortem culture, runbook maintenance — plus a new set of domain-specific capabilities: eval infrastructure, semantic monitoring, prompt lifecycle management, and model drift detection.
The teams that get ahead of this build evaluation infrastructure before they need it. They establish quality baselines when the system is healthy. They write their AI-specific runbooks when there's no incident pressure. And they treat model versioning with the same rigor they bring to application versioning.
The teams that don't get ahead of it discover all of this in the worst possible context: at 2am, trying to explain to a stakeholder why the AI feature is generating wrong answers and why their monitoring showed nothing was wrong.
- https://www.mdpi.com/2624-800X/6/1/20
- https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
- https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html
- https://orq.ai/blog/why-do-multi-agent-llm-systems-fail
- https://agenta.ai/blog/prompt-drift
- https://www.getmaxim.ai/articles/prompt-versioning-best-practices-for-ai-engineering-teams/
- https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
- https://www.confident-ai.com/blog/llm-guardrails-the-ultimate-guide-to-safeguard-llm-systems
- https://itsoli.ai/when-ai-breaks-building-degradation-strategies-for-mission-critical-systems/
