Skip to main content

On-Call for Stochastic Systems: Why Your AI Runbook Needs a Rewrite

· 10 min read
Tian Pan
Software Engineer

You get paged at 2 AM. Latency is up, error rates are spiking. You SSH in, pull logs, and—nothing. No stack trace pointing to a bad deploy. No null pointer exception on line 247. Just a stream of model outputs that are subtly, unpredictably wrong in ways that only become obvious when you read 50 of them in a row.

This is what incidents look like in LLM-powered systems. And the traditional alert-triage-fix loop was not built for it.

The standard on-call playbook assumes three things: failures are deterministic (same input, same bad output), root cause is locatable (some code changed, some resource exhausted), and rollback is straightforward (revert the deploy, done). None of these hold for stochastic AI systems. The same prompt produces different outputs. Root cause is usually a probability distribution, not a line of code. And you cannot "rollback" a model that a third-party provider updated silently overnight.

Fixing this requires rethinking the runbook at every layer: how you alert, how you triage, and how you write post-mortems.

Why the Traditional Alert Schema Breaks

In conventional systems, you alert on things you can measure precisely: HTTP 5xx rate, p99 latency, queue depth. These metrics have sharp edges—when something goes wrong, the number moves.

LLM degradation is usually gradual and multi-dimensional. A model that has drifted toward verbose, hedging outputs will not trigger a latency alert (it might actually be faster). It will not trigger a 5xx alert (the API returns 200 successfully). What degrades is quality—and quality has no built-in metric.

The result is that traditional alerting creates two failure modes:

Alert storms on noise. LLM outputs naturally vary. If you set static thresholds on any quality-adjacent signal (word count, output length, presence of certain phrases), you'll fire alerts constantly on normal variance. Research shows 75–95% of observability alerts across cloud systems are false positives; for LLM systems with naive quality-based alerting, that number is worse.

Silent degradation on real problems. Model providers update models without announcing it. Prompt changes slip in with feature flags. RAG retrieval quality drops as the vector index drifts from the document corpus. None of these show up in infrastructure dashboards.

The fix is a four-layer metric schema that separates concerns clearly:

Infrastructure metrics — the traditional SRE layer: latency, throughput, token consumption, error rates, cost per request. These should already be in your dashboards.

API-level metrics — specific to LLM API behavior: rate limit (HTTP 429) rate, authentication failure rate, model availability per endpoint, retry frequency. These tell you when the provider is struggling.

Behavioral metrics — this is the new layer: refusal rate (what percentage of legitimate requests the model refuses to answer?), format compliance rate (does the output match the expected JSON schema or structured format?), output length variance (is the model suddenly producing 4x longer responses?). These metrics are cheap to compute deterministically and give early signal on prompt or model drift.

Quality metrics — the most expensive layer: hallucination rate measured via LLM-as-judge scoring, semantic similarity to expected outputs on a golden test set, user feedback signals. These require evaluation infrastructure but are the only reliable way to catch real quality degradation.

Alert on infrastructure metrics with standard thresholds. Alert on API-level and behavioral metrics with moderate sensitivity. Alert on quality metrics only when you have enough samples to separate signal from noise—and use trend-based alerting rather than threshold-based alerting, so a sustained 10% drop in quality scores fires, but a one-request fluctuation does not.

A Triage Decision Tree for Non-Deterministic Failures

When the alert fires, on-call engineers face a decision problem that didn't exist in traditional systems: the failure is almost certainly in one of three very different places, and the diagnostic evidence for each overlaps.

The three failure domains are:

Model provider issues. The API is returning errors, timing out, or rate-limiting. This is actually the easy case because it usually shows up in infrastructure metrics. Key indicators: HTTP 5xx errors, 429 rate limits, authentication failures, regional availability drops. Response: verify on the provider's status page, check API key validity and quota headroom (both tokens-per-minute and requests-per-minute limits), implement automatic failover to a secondary provider or a smaller/cheaper model, and apply exponential backoff with jitter for retries.

Prompt or model issues. The API returns 200, but the outputs are wrong. This is the hard case. Key indicators: elevated refusal rate without provider errors, format compliance failures (model ignores JSON schema instructions), quality score drift on your evaluation set, behavioral changes that coincide with a prompt change or a provider model update. Response: first, run your evaluation test set against the current model and compare scores to the previous baseline. If the scores diverged after a prompt deploy, roll it back. If they diverged without any internal change, the provider updated the model—check their changelog, file a support ticket, and consider pinning to a specific model version if the API supports it.

Infrastructure and retrieval issues. The model itself is fine, but something upstream or downstream has degraded. Key indicators: elevated latency with stable quality scores, increased token consumption (the model is getting longer or noisier context), cascading failures in multi-step agent pipelines, retrieval returning stale or irrelevant chunks. Response: inspect your RAG pipeline—check vector index freshness, embedding coverage, retrieval recall on known queries. Inspect agent-to-agent communication in orchestration layers. Check whether feature flag changes altered which tools or context an agent receives.

The decision tree collapses to three diagnostic questions, in order:

  1. Is the provider API returning errors or rate limits? → Provider issue.
  2. Did quality scores drop? Did anything change in our prompts, context assembly, or retrieval? → Prompt/model issue.
  3. Is latency elevated but quality stable? Are token counts inflated? → Infrastructure/retrieval issue.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates