The On-Call Runbook for AI Systems That Nobody Writes
Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.
This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.
Why Traditional Runbooks Break Down for AI Systems
A conventional runbook is a procedural artifact. It says: when alert X fires, check Y, then do Z. This works when failures are deterministic. A crashed database either starts or it does not. A memory leak either causes OOM or it does not. You reproduce the issue, apply the fix, verify recovery.
LLM systems break this model in three ways.
The failure is behavioral, not operational. A model can be fully available, accepting requests, and returning responses at normal throughput — while producing outputs that are subtly wrong, excessively cautious, or gradually drifting from their expected character. The service health check passes. The model is misbehaving. Traditional monitoring has nothing to say about this.
Reproduction is unreliable. When you reproduce a software bug, you get the same bug. When you try to reproduce an LLM output failure, temperature, context window effects, and retrieved-data differences mean you often cannot recreate the exact behavior. You are debugging a probability distribution, not a deterministic function.
Causality is non-obvious. A latency spike could be caused by: a model generating longer outputs (upstream content change), KV cache pressure from concurrent long-context requests, a provider-side issue, a retrieval layer returning more documents, or a tool call returning unexpectedly large payloads. Each cause requires a different response. Traditional runbooks assume you can observe the cause directly; LLM runbooks require inference from indirect signals.
The Signals That Actually Matter
Effective AI on-call starts by expanding what you measure. Most teams monitor latency, error rate, and throughput. For LLM systems, those metrics tell you only part of the story.
Output length distribution. Track the mean and variance of response lengths over time. A sudden rise in average output length often precedes latency spikes — the model is generating more tokens, which costs more compute and adds more wall-clock time. A collapse to very short outputs can indicate over-refusal or a broken prompt that truncates responses. Neither shows up as an error.
Refusal rate. Every refusal is a request that returned nothing useful to the user. Track this as a first-class metric alongside error rate. Refusal spikes usually have one of three causes: a content policy change on the provider side, a prompt that started hitting safety boundaries at higher rates due to user traffic shift, or a guardrail misconfigured too aggressively. Each requires a different response, but all three look identical from a latency and error-rate dashboard.
Tool call error rate and duration. For agentic systems, tool failures are often the real root cause that surfaces as model degradation. If the retrieval tool starts returning empty results, the model has nothing to reason over. If an API tool starts timing out at 20% of calls, the agent retries and burns token budget. Monitor each tool independently — which tools are called, how often they fail, and what their P95 duration looks like.
Quality score drift. This is the hardest signal to instrument but the most important to have. Real-time quality evaluation — whether through LLM-as-a-judge, embedding-based similarity to known-good responses, or task-specific metrics — gives you a continuous signal on what the model is actually producing, not just whether it is producing. A 10% drop in quality score is often the first indicator of a regression that will not appear in any infrastructure metric.
Token cost per request. Unexpected cost spikes are a leading indicator of model behavior changes. If average tokens-per-response doubles, something changed — in the prompt, in the model, in the retrieved context, or in user traffic patterns. This is much cheaper to instrument than quality evaluation and often fires first.
The Incident Taxonomy for AI Systems
Before you can write a runbook, you need a classification system for what can go wrong. AI systems have failure modes that do not map cleanly to infrastructure incidents.
Model regression. The model's output quality has degraded relative to a baseline. This can happen because: the upstream provider silently updated the model version behind the same API endpoint; a prompt was changed without running regression evals first; or traffic shifted toward a distribution of inputs the model handles poorly. Detection requires comparing current outputs against a stored baseline, not checking service health.
Guardrail bypass or misconfiguration. Safety guardrails started refusing too much (over-refusal, which looks like degraded service to users) or too little (under-refusal, which is a safety incident). These are opposite problems that require opposite responses, but both show up as anomalous refusal rates.
Cost spike. Token consumption is accelerating beyond expected bounds. This is often caused by longer context being passed to the model (a retrieval change, a conversation history bug), a prompt that triggers verbose responses, or a traffic pattern shift. Cost spikes can be financially significant before they become user-visible.
Latency degradation. Response time at P99 is elevated. This has AI-specific causes: KV cache pressure from long concurrent contexts, token generation being slower due to output length increases, or tool calls adding latency to agentic pipelines. The latency-error correlation that works for traditional services often breaks here — the service is responding, just slowly.
Prompt injection or security incident. Adversarial inputs are causing the model to behave outside its intended operating envelope. This shows up as unusual output patterns, unexpected tool calls, or behavior that breaks conversation policy. Most monitoring stacks have no signal for this.
What Investigation Actually Looks Like at 3am
You have an alert. Here is the investigation sequence that works for AI systems, ordered by speed of execution:
First, check what changed. Before touching anything else: was there a model version bump? A prompt change? A feature flag toggle? A retrieval index update? A content policy change from the provider? Most AI incidents are caused by something that changed recently. Correlate the incident start time with deploy history, flag changes, and provider status pages before diagnosing anything.
Second, characterize the symptom. Is it latency, quality, cost, refusals, or errors? The signal type narrows the cause space significantly:
- Latency spike with quality stable → likely infrastructure (KV cache, token throughput, tool timeout)
- Quality drop with latency stable → likely model or prompt regression
- Refusal spike → likely guardrail or content policy issue
- Cost spike → likely output length increase or traffic composition change
- Error rate spike → likely tool failure or API issue
Third, sample actual outputs. Pull a sample of recent requests and look at what the model is producing. Quality degradation, behavioral drift, and over-refusal are invisible to infrastructure monitoring but immediately obvious when you read outputs. Most teams skip this step during incidents because it is not automated — but it is often the fastest path to root cause.
Fourth, apply mitigations in order of speed. Feature flags that switch prompt versions or model endpoints are faster than redeployments. Conservative fallbacks (simpler prompts, smaller contexts, cached responses) are faster than root cause analysis. Your goal at 3am is to stop the bleeding, not to perfectly understand the pathology.
Writing Runbooks for Probabilistic Systems
A runbook for an AI system cannot say "check if the model is working." It needs to say "if refusal rate exceeds 2x the 7-day rolling baseline for more than 10 minutes, check these three things in this order."
Structure your AI runbooks around decision trees with explicit probabilistic branches:
Signal → Probable Cause → Investigation Step → Mitigation
For each signal, document the most likely causes in order of probability, the specific query or check that distinguishes between them, and the mitigation for each case. This is different from deterministic runbooks where you follow steps sequentially; you are building a diagnostic tree.
Include explicit escalation criteria. When does a latency incident become a safety incident? When does a quality degradation trigger a rollback? These thresholds should be defined in advance, not decided at 3am under pressure.
Document what the fallback state looks like. For every AI feature, define the conservative configuration: smaller context window, simpler prompt, no tool use, or cached responses. This should be activatable via a single flag change. Knowing what the fallback is before an incident means you can use it without second-guessing.
Store known-good output samples alongside the runbook. When you are trying to determine if current outputs represent regression, you need a reference point. A collection of representative good outputs from different traffic segments is as useful as any monitoring metric.
The Monitoring Infrastructure You Actually Need
Traditional APM tools will give you latency and error rate. For LLM systems, you need additional instrumentation that most teams have not built:
Per-request output logging with sampling. Log a sample of full request-response pairs — not just metadata. You cannot evaluate quality from token count alone. Even 1% sampling at moderate traffic volumes gives you enough to detect behavioral drift and to replay for debugging.
Behavioral baselines with drift detection. Compute rolling statistics on output length, refusal rate, tool call patterns, and quality scores. Alert on deviation from baseline rather than absolute thresholds. What counts as an anomalous refusal rate depends on your application; alerting on absolute refusal rate will drown you in false positives.
Tool call instrumentation. Each tool in an agentic system should emit spans with timing, error status, input size, and output size. Tool failures are often the silent cause of what appears to be model degradation.
Cost attribution per feature and per request type. You need to know which features, which user segments, and which input patterns are driving token consumption. Cost spikes are much easier to diagnose when you can narrow the source.
The Org Change, Not Just the Technical Change
The engineers who are best equipped to diagnose traditional infrastructure incidents are often poorly equipped to diagnose AI incidents — and vice versa. Debugging a misbehaving model requires reading outputs, understanding prompt mechanics, and reasoning about probability distributions. These are not standard SRE skills.
Teams that have successfully operationalized AI systems have done one of two things: trained their SREs on AI-specific failure modes and given them the tooling to inspect model behavior, or created a rotation that pairs an SRE with an ML engineer during AI incidents. Neither approach is free, but "let the SRE handle it" without the corresponding training produces 3am incidents that stretch for four hours because the on-call engineer has no mental model for what they are looking at.
Your runbook is only as useful as the person reading it. Write for the on-call engineer who understands distributed systems but may not have internalized the difference between a model regression and an infrastructure latency issue. Make the signal-to-cause mapping explicit. Make the mitigation steps concrete. And accept that the runbook will be wrong sometimes — because the failure mode is probabilistic, and that means diagnosis is too.
The goal is not a perfect runbook. The goal is an on-call engineer who can contain an AI incident in 30 minutes instead of two hours, because the organization thought about this before the alert fired at 3am.
- https://rootly.com/sre/ai-powered-sre-platforms-explained-rootly-leads-the-pack
- https://stackoverflow.blog/2025/10/24/your-runbooks-are-obsolete-in-the-age-of-agents/
- https://neptune.ai/blog/llm-observability
- https://www.datadoghq.com/product/ai/llm-observability/
- https://arize.com/
- https://www.fiddler.ai/blog/91-percent-of-ml-models-degrade-over-time/
- https://www.evidentlyai.com/blog/ai-failures-examples
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://incident.io/blog/what-is-ai-sre-complete-guide-2026
- https://www.langchain.com/articles/llm-observability-tools
- https://groundcover.com/learn/observability/ai-agent-observability
- https://www.braintrust.dev/articles/best-ai-observability-tools-2026
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://incidentdatabase.ai/blog/scalable-ai-incident-classification/
