Skip to main content

On-Call for Stochastic Systems: Why Your AI Runbook Needs a Rewrite

· 10 min read
Tian Pan
Software Engineer

You get paged at 2 AM. Latency is up, error rates are spiking. You SSH in, pull logs, and—nothing. No stack trace pointing to a bad deploy. No null pointer exception on line 247. Just a stream of model outputs that are subtly, unpredictably wrong in ways that only become obvious when you read 50 of them in a row.

This is what incidents look like in LLM-powered systems. And the traditional alert-triage-fix loop was not built for it.

The standard on-call playbook assumes three things: failures are deterministic (same input, same bad output), root cause is locatable (some code changed, some resource exhausted), and rollback is straightforward (revert the deploy, done). None of these hold for stochastic AI systems. The same prompt produces different outputs. Root cause is usually a probability distribution, not a line of code. And you cannot "rollback" a model that a third-party provider updated silently overnight.

Fixing this requires rethinking the runbook at every layer: how you alert, how you triage, and how you write post-mortems.

Why the Traditional Alert Schema Breaks

In conventional systems, you alert on things you can measure precisely: HTTP 5xx rate, p99 latency, queue depth. These metrics have sharp edges—when something goes wrong, the number moves.

LLM degradation is usually gradual and multi-dimensional. A model that has drifted toward verbose, hedging outputs will not trigger a latency alert (it might actually be faster). It will not trigger a 5xx alert (the API returns 200 successfully). What degrades is quality—and quality has no built-in metric.

The result is that traditional alerting creates two failure modes:

Alert storms on noise. LLM outputs naturally vary. If you set static thresholds on any quality-adjacent signal (word count, output length, presence of certain phrases), you'll fire alerts constantly on normal variance. Research shows 75–95% of observability alerts across cloud systems are false positives; for LLM systems with naive quality-based alerting, that number is worse.

Silent degradation on real problems. Model providers update models without announcing it. Prompt changes slip in with feature flags. RAG retrieval quality drops as the vector index drifts from the document corpus. None of these show up in infrastructure dashboards.

The fix is a four-layer metric schema that separates concerns clearly:

Infrastructure metrics — the traditional SRE layer: latency, throughput, token consumption, error rates, cost per request. These should already be in your dashboards.

API-level metrics — specific to LLM API behavior: rate limit (HTTP 429) rate, authentication failure rate, model availability per endpoint, retry frequency. These tell you when the provider is struggling.

Behavioral metrics — this is the new layer: refusal rate (what percentage of legitimate requests the model refuses to answer?), format compliance rate (does the output match the expected JSON schema or structured format?), output length variance (is the model suddenly producing 4x longer responses?). These metrics are cheap to compute deterministically and give early signal on prompt or model drift.

Quality metrics — the most expensive layer: hallucination rate measured via LLM-as-judge scoring, semantic similarity to expected outputs on a golden test set, user feedback signals. These require evaluation infrastructure but are the only reliable way to catch real quality degradation.

Alert on infrastructure metrics with standard thresholds. Alert on API-level and behavioral metrics with moderate sensitivity. Alert on quality metrics only when you have enough samples to separate signal from noise—and use trend-based alerting rather than threshold-based alerting, so a sustained 10% drop in quality scores fires, but a one-request fluctuation does not.

A Triage Decision Tree for Non-Deterministic Failures

When the alert fires, on-call engineers face a decision problem that didn't exist in traditional systems: the failure is almost certainly in one of three very different places, and the diagnostic evidence for each overlaps.

The three failure domains are:

Model provider issues. The API is returning errors, timing out, or rate-limiting. This is actually the easy case because it usually shows up in infrastructure metrics. Key indicators: HTTP 5xx errors, 429 rate limits, authentication failures, regional availability drops. Response: verify on the provider's status page, check API key validity and quota headroom (both tokens-per-minute and requests-per-minute limits), implement automatic failover to a secondary provider or a smaller/cheaper model, and apply exponential backoff with jitter for retries.

Prompt or model issues. The API returns 200, but the outputs are wrong. This is the hard case. Key indicators: elevated refusal rate without provider errors, format compliance failures (model ignores JSON schema instructions), quality score drift on your evaluation set, behavioral changes that coincide with a prompt change or a provider model update. Response: first, run your evaluation test set against the current model and compare scores to the previous baseline. If the scores diverged after a prompt deploy, roll it back. If they diverged without any internal change, the provider updated the model—check their changelog, file a support ticket, and consider pinning to a specific model version if the API supports it.

Infrastructure and retrieval issues. The model itself is fine, but something upstream or downstream has degraded. Key indicators: elevated latency with stable quality scores, increased token consumption (the model is getting longer or noisier context), cascading failures in multi-step agent pipelines, retrieval returning stale or irrelevant chunks. Response: inspect your RAG pipeline—check vector index freshness, embedding coverage, retrieval recall on known queries. Inspect agent-to-agent communication in orchestration layers. Check whether feature flag changes altered which tools or context an agent receives.

The decision tree collapses to three diagnostic questions, in order:

  1. Is the provider API returning errors or rate limits? → Provider issue.
  2. Did quality scores drop? Did anything change in our prompts, context assembly, or retrieval? → Prompt/model issue.
  3. Is latency elevated but quality stable? Are token counts inflated? → Infrastructure/retrieval issue.

Run a sample of recent inputs through your evaluation harness as part of every triage. Without this, you're guessing. With it, you have objective evidence.

Alerting Without the Noise

Alert fatigue is a compounding problem: when on-call engineers start ignoring pages because most are false positives, the real incidents hide in the noise. For LLM systems, where natural output variance is high, this is an existential risk to your observability program.

Three practices that work:

Use dynamic thresholds, not static ones. LLM usage patterns are strongly seasonal—certain queries are more complex at certain times of day or week, which affects latency and quality baselines. A static threshold calibrated for peak-hour behavior will fire constantly during off-hours. Threshold adaptation based on historical patterns—something that 2025-era observability platforms support out of the box—reduces false positive rates dramatically.

Alert on sustained trends, not individual data points. A single low-quality response is noise. A rolling average quality score that has been declining for 30 minutes is signal. Configure quality metric alerts to fire only after a trend has persisted across a meaningful sample window.

Correlate signals before paging. A latency spike alone should not wake someone up at 2 AM. A latency spike plus a rising refusal rate plus a quality score drop in the same 10-minute window is an incident. Multi-signal correlation—grouping related alerts into a single incident with context—reduces page volume and gives the on-call engineer a richer starting point than "latency is high."

Google's SRE practice codifies this: coordinate, communicate, control. The incident commander role works precisely because it filters signal from noise and gives the on-call team a single point of coordination rather than a flood of independent alerts.

The Post-Mortem Template for Stochastic Systems

Standard post-mortem templates were designed for deterministic systems. They ask "what code change caused this?" and "how do we prevent it from happening again?" For LLM incidents, both questions are harder.

The adapted template requires four additions beyond the standard fields:

Detection method. How did you discover the incident? Automated quality evaluation alert? User complaint? Business metric drop (conversion rate, task completion rate)? This field matters because it reveals gaps in your observability—if you found out from a user complaint, you need better automated detection. If an alert fired on a false positive, you need better threshold tuning.

Non-determinism acknowledgment. What is the normal variance range for this metric? Was the degradation clearly outside that range, or is there ambiguity about whether this was a real incident? LLM post-mortems sometimes conclude with "we're not certain this was actually a problem"—and that is fine. Document it explicitly.

Evaluation coverage gap. Did your evaluation test set cover the affected use case? Most LLM incidents reveal blind spots in the golden test set—cases the team didn't think to include. The action item is usually "add these cases to the eval suite before next release."

Rollback capability assessment. Could you have reverted faster? Do you have a previous model version pinned and deployable? Is there a feature flag that could have disabled the LLM feature and fallen back to a rule-based system? For many teams, the answer on first incident is no—and the action item is implementing those capabilities before the next incident.

The blameless post-mortem practice is even more important for AI systems than traditional ones, because "the model hallucinated" is tempting as a root cause and almost never accurate. Models don't hallucinate for no reason. They produce bad outputs because of: insufficient context in the prompt, inadequate retrieval, a training distribution shift, a model version update that changed behavior on edge cases, or missing constraints on the output. One of those is the root cause. The post-mortem should find it.

What the Runbook Should Actually Say

A runbook for LLM incidents needs to be clear enough that someone who didn't build the system can execute it. That means prescribing the specific commands and queries, not general guidance.

Concretely, every LLM feature's runbook should specify:

  • How to pull quality evaluation scores for the last 24 hours, with the exact dashboard link or query
  • How to compare current evaluation scores to the pre-deploy baseline
  • Which model version is currently active, and how to pin or roll back to a previous version
  • Which fallback provider to switch to, and how to redirect traffic
  • How to disable the LLM feature entirely and activate the fallback behavior (rule-based response, degraded mode, or feature flag off)
  • Who to escalate to if the triage decision tree doesn't resolve the issue—specifically, whether to wake the ML engineer or the infrastructure SRE

Add rich metadata to every LLM API call: prompt version identifier, feature flag state, user segment, and any A/B experiment assignment. When an incident hits, this metadata turns a search through undifferentiated logs into a targeted investigation.

The Discipline Gap

The tools for LLM observability are now mature—evaluation frameworks, continuous quality monitoring, provider failover, multi-signal alerting correlation. What's missing in most organizations is not the tooling but the discipline: defining what "normal" looks like before the first incident, writing runbooks before the first page, and running chaos tests before real degradation.

The teams that handle AI incidents well have done the unglamorous upfront work: they know their baseline quality scores, they have a test set that represents real production traffic, they have a fallback provider configured and tested, and they have a runbook that was written by the people who built the system and reviewed by someone who didn't.

The on-call rotation for an AI feature is not inherently harder than for any other complex distributed system. It just requires a different mental model—one where "I can't reproduce this failure" is the expected starting point, not a sign that something has gone badly wrong.

References:Let's stay in touch and Follow me for more thoughts and updates