AI Incident Response Playbooks: Why Your On-Call Runbook Doesn't Work for LLMs
Your monitoring dashboard shows elevated latency, a small error rate spike, and then nothing. Users are already complaining in Slack. A quarter of your AI feature's responses are hallucinating in ways that look completely valid to your alerting system. By the time you find the cause — a six-word change to a prompt deployed two hours ago — you've had a slow-burn incident that your runbook never anticipated.
This is the defining challenge of operating AI systems in production. The failure modes are real, damaging, and invisible to conventional tooling. An LLM that silently hallucinates looks exactly like an LLM that's working correctly from the outside.
Your existing on-call playbooks were built for deterministic systems. A service either returns 200 or it doesn't. A database query either succeeds or it fails with an exception you can grep for. The entire machinery of incident response — alerts, runbooks, escalation paths, postmortems — was designed around the assumption that failures leave a trace you can follow.
AI systems break that assumption completely.
The Four Ways Your On-Call Assumptions Break
Non-determinism is the first problem. The same prompt can return a correct answer at 9am and a hallucinated one at 9:15am. You cannot reproduce failures reliably. When your on-call handbook says "reproduce the issue in staging before escalating," that step may simply be impossible for an LLM failure. The system is stochastic by design.
No error codes is the second problem. A 500 error tells you something broke. A hallucination returns 200 with a JSON payload that looks structurally perfect. Your error rate metric sees nothing. Your latency metric sees nothing. The only signal is in the content of the response — which your infrastructure layer has no idea how to inspect.
The triage matrix is wrong. When a service goes down, you know where to look: the deployment, the dependency, the infrastructure. For an AI system degradation, the root cause could be any of four completely different categories — and each one requires a different team, different tools, and a different fix:
- Model: The underlying model has drifted, been silently updated, or exhibits different behavior on your specific input distribution.
- Prompt: A recent prompt change removed context that was implicitly anchoring the model's factual grounding.
- Data: The inputs coming into the system have shifted distribution — you're now getting query patterns the system wasn't designed for.
- Infrastructure: API rate limits, latency spikes, or provider-side degradation that causes retries and cascading failures.
Intermittency makes things worse. A bug that manifests 30% of the time is nearly undetectable by threshold-based alerting unless you're monitoring the right quality metrics. And most teams aren't.
Building the AI Incident Triage Decision Tree
The goal of the first fifteen minutes of an AI incident is not to fix the problem — it's to correctly classify which of the four categories you're in. Each category has a different owner, different tooling, and a different remediation path.
Step one: rule out infrastructure. Check API availability, request queue depth, latency percentiles, and timeout rates. Infrastructure failures tend to produce consistent failures (not intermittent), cluster geographically, and often correlate with provider status pages. If your error rate is truly elevated (not just quality-degraded) and the pattern is consistent, start here.
Step two: check for model changes. Look at which model version is deployed. Many teams deploy against a floating alias like gpt-4 rather than a pinned version like gpt-4-turbo-2024-04-09. Floating aliases can receive silent model updates from providers — and model drift accounts for roughly 40% of production agent failures. Cross-reference your incident timeline with recent model version changes.
Step three: check for prompt changes. Pull the deployment history for your prompt templates. If a prompt changed within two hours of the incident start, that's your primary suspect. The comparison isn't just syntactic — a six-word removal can dramatically change model behavior even when the remaining text looks identical.
Step four: check input distribution drift. Compute semantic embeddings of recent inputs and compare them against your historical baseline. If the distribution has shifted — new intent patterns, unusual query types, a traffic source you haven't seen before — you may be in a data-driven failure mode where the model is being asked things it was never calibrated to handle well.
Mitigation That Doesn't Require Root Cause
This is the most counterintuitive part: you should be able to stop the bleeding before you know why it's happening.
In deterministic systems, you usually need to understand the failure before you can mitigate it. In AI systems, you have several blunt instruments that reduce harm immediately:
Prompt rollback is your fastest option. If you have version-controlled prompt templates (and you should), rolling back to the prior version is a sub-minute operation that doesn't require a code deploy. This is why decoupling prompt deployment from code deployment matters — your on-call engineer should be able to revert a prompt without touching the application.
Model version pinning is your second-fastest option. If the incident correlates with a model change, pin to the last known-good version. Floating aliases are convenient for development; they're a reliability liability in production.
Feature flags on AI features give you a kill switch. You can drop the AI feature entirely or route traffic to a fallback path within seconds. The interesting thing here is that 71% of enterprises have no documented degradation plan for production AI — meaning when their AI feature fails, they have no pre-planned fallback. This should be a first-class concern before you ship an AI feature, not after it breaks.
Guardrails at the output layer can reduce the surface area of harm even if you don't know the root cause. Tighter schema validation, output filtering, and confidence-based routing can contain a degradation while the investigation runs in parallel.
- https://www.mdpi.com/2624-800X/6/1/20
- https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
- https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html
- https://orq.ai/blog/why-do-multi-agent-llm-systems-fail
- https://agenta.ai/blog/prompt-drift
- https://www.getmaxim.ai/articles/prompt-versioning-best-practices-for-ai-engineering-teams/
- https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
- https://www.confident-ai.com/blog/llm-guardrails-the-ultimate-guide-to-safeguard-llm-systems
- https://itsoli.ai/when-ai-breaks-building-degradation-strategies-for-mission-critical-systems/
