The On-Call Runbook for AI Systems That Nobody Writes
Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.
This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.
Why Traditional Runbooks Break Down for AI Systems
A conventional runbook is a procedural artifact. It says: when alert X fires, check Y, then do Z. This works when failures are deterministic. A crashed database either starts or it does not. A memory leak either causes OOM or it does not. You reproduce the issue, apply the fix, verify recovery.
LLM systems break this model in three ways.
The failure is behavioral, not operational. A model can be fully available, accepting requests, and returning responses at normal throughput — while producing outputs that are subtly wrong, excessively cautious, or gradually drifting from their expected character. The service health check passes. The model is misbehaving. Traditional monitoring has nothing to say about this.
Reproduction is unreliable. When you reproduce a software bug, you get the same bug. When you try to reproduce an LLM output failure, temperature, context window effects, and retrieved-data differences mean you often cannot recreate the exact behavior. You are debugging a probability distribution, not a deterministic function.
Causality is non-obvious. A latency spike could be caused by: a model generating longer outputs (upstream content change), KV cache pressure from concurrent long-context requests, a provider-side issue, a retrieval layer returning more documents, or a tool call returning unexpectedly large payloads. Each cause requires a different response. Traditional runbooks assume you can observe the cause directly; LLM runbooks require inference from indirect signals.
The Signals That Actually Matter
Effective AI on-call starts by expanding what you measure. Most teams monitor latency, error rate, and throughput. For LLM systems, those metrics tell you only part of the story.
Output length distribution. Track the mean and variance of response lengths over time. A sudden rise in average output length often precedes latency spikes — the model is generating more tokens, which costs more compute and adds more wall-clock time. A collapse to very short outputs can indicate over-refusal or a broken prompt that truncates responses. Neither shows up as an error.
Refusal rate. Every refusal is a request that returned nothing useful to the user. Track this as a first-class metric alongside error rate. Refusal spikes usually have one of three causes: a content policy change on the provider side, a prompt that started hitting safety boundaries at higher rates due to user traffic shift, or a guardrail misconfigured too aggressively. Each requires a different response, but all three look identical from a latency and error-rate dashboard.
Tool call error rate and duration. For agentic systems, tool failures are often the real root cause that surfaces as model degradation. If the retrieval tool starts returning empty results, the model has nothing to reason over. If an API tool starts timing out at 20% of calls, the agent retries and burns token budget. Monitor each tool independently — which tools are called, how often they fail, and what their P95 duration looks like.
Quality score drift. This is the hardest signal to instrument but the most important to have. Real-time quality evaluation — whether through LLM-as-a-judge, embedding-based similarity to known-good responses, or task-specific metrics — gives you a continuous signal on what the model is actually producing, not just whether it is producing. A 10% drop in quality score is often the first indicator of a regression that will not appear in any infrastructure metric.
Token cost per request. Unexpected cost spikes are a leading indicator of model behavior changes. If average tokens-per-response doubles, something changed — in the prompt, in the model, in the retrieved context, or in user traffic patterns. This is much cheaper to instrument than quality evaluation and often fires first.
The Incident Taxonomy for AI Systems
Before you can write a runbook, you need a classification system for what can go wrong. AI systems have failure modes that do not map cleanly to infrastructure incidents.
Model regression. The model's output quality has degraded relative to a baseline. This can happen because: the upstream provider silently updated the model version behind the same API endpoint; a prompt was changed without running regression evals first; or traffic shifted toward a distribution of inputs the model handles poorly. Detection requires comparing current outputs against a stored baseline, not checking service health.
Guardrail bypass or misconfiguration. Safety guardrails started refusing too much (over-refusal, which looks like degraded service to users) or too little (under-refusal, which is a safety incident). These are opposite problems that require opposite responses, but both show up as anomalous refusal rates.
Cost spike. Token consumption is accelerating beyond expected bounds. This is often caused by longer context being passed to the model (a retrieval change, a conversation history bug), a prompt that triggers verbose responses, or a traffic pattern shift. Cost spikes can be financially significant before they become user-visible.
Latency degradation. Response time at P99 is elevated. This has AI-specific causes: KV cache pressure from long concurrent contexts, token generation being slower due to output length increases, or tool calls adding latency to agentic pipelines. The latency-error correlation that works for traditional services often breaks here — the service is responding, just slowly.
Prompt injection or security incident. Adversarial inputs are causing the model to behave outside its intended operating envelope. This shows up as unusual output patterns, unexpected tool calls, or behavior that breaks conversation policy. Most monitoring stacks have no signal for this.
What Investigation Actually Looks Like at 3am
You have an alert. Here is the investigation sequence that works for AI systems, ordered by speed of execution:
- https://rootly.com/sre/ai-powered-sre-platforms-explained-rootly-leads-the-pack
- https://stackoverflow.blog/2025/10/24/your-runbooks-are-obsolete-in-the-age-of-agents/
- https://neptune.ai/blog/llm-observability
- https://www.datadoghq.com/product/ai/llm-observability/
- https://arize.com/
- https://www.fiddler.ai/blog/91-percent-of-ml-models-degrade-over-time/
- https://www.evidentlyai.com/blog/ai-failures-examples
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://incident.io/blog/what-is-ai-sre-complete-guide-2026
- https://www.langchain.com/articles/llm-observability-tools
- https://groundcover.com/learn/observability/ai-agent-observability
- https://www.braintrust.dev/articles/best-ai-observability-tools-2026
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://incidentdatabase.ai/blog/scalable-ai-incident-classification/
