Skip to main content

On-Call for AI Systems: Incident Response When the Bug Is the Model

· 11 min read
Tian Pan
Software Engineer

Your monitoring is green. Latency is nominal. Error rates are flat. And yet your customer support AI just told 10,000 users that returns are free — permanently — a policy that doesn't exist. No alert fired. No deploy happened. The model just decided to.

This is what on-call looks like for AI systems: a class of production failure that doesn't trigger the alarms you built, can't be traced to a line of code, and can't be fixed by rolling back the last deploy. Standard incident response playbooks — check the logs, identify the commit, revert the change, verify recovery — were designed for deterministic systems. Applied to LLMs, they miss the actual failure mode entirely.

Here's what actually works.

Why Your Runbook Breaks Down

Traditional incident response rests on three assumptions that LLMs violate:

Determinism: Given the same input, you get the same output. In LLMs, the same prompt with temperature > 0 can produce different responses on every call. You can't "reproduce the issue" without capturing exact parameters — model version, temperature, top_p, seed, full conversation history — and even then, non-determinism means you're reconstructing a probability distribution, not replaying a crash.

Explicit failure signals: Applications break by throwing errors. LLMs fail silently. A hallucination, a guardrail bypass, or a quality regression generates a 200 OK with a valid JSON payload. Nothing in your error budget moves. The damage accumulates invisibly until a human notices or a downstream system acts on bad data.

Rollback as containment: When a bad deploy causes incidents, rolling back fixes it. When an upstream model provider pushes an update that changes your LLM's behavior, there's often no rollback path. You didn't deploy anything. The model changed underneath you.

AI incidents require a different mental model: you're not debugging a crash, you're detecting a behavioral drift in a probabilistic system.

The Four Incident Types That Matter

Not all AI incidents are the same. Distinguishing between them changes both the triage approach and the containment action.

Capability regression is when the model's output quality degrades — it stops following formatting instructions, tone shifts, accuracy drops, or structured outputs break. This is often caused by an upstream model update, a prompt change that introduced ambiguity, or fine-tuned weights drifting. Detection requires behavioral baselines: you need to know what "normal" output looks like before you can detect when it changes.

Guardrail bypass is when safety filters fail or are circumvented — either through prompt injection, adversarial input, or a model update that changes the refusal threshold. This category escalated to #2 in OWASP's LLM Top 10 for 2025. It's also the most legally and reputationally damaging class of AI incident, with real examples including chatbots hallucinating policies that became binding in litigation.

Cost explosion is an agent entering an unintended loop or making an astronomical number of calls without a natural stopping condition. A July 2025 incident involving a coding agent accumulated $47,000 in API costs over 11 days before anyone noticed — not because alerts didn't exist, but because the alerts were configured on daily spend summaries rather than real-time rate monitoring. By the time the alert fired, the damage was done.

Latency degradation is the most familiar to SREs but has LLM-specific nuances. Track two separate signals: time-to-first-token (TTFT) is the user-visible wait before streaming begins; time-per-output-token (TPOT) determines how fast the response flows after that. P99 latency matters more than averages here — tail latency in LLM responses is often an order of magnitude worse than the median, and it's what users remember.

Building the Detection Layer

Standard monitoring captures whether your system is up. AI monitoring needs to capture whether your system is behaving correctly — a meaningfully harder problem.

Output quality sampling is the closest analog to functional test coverage in production. Route a percentage of real traffic to a lightweight evaluator — either a separate model or a heuristic — that scores outputs against defined criteria: did the response stay on-topic, did it follow format instructions, did it include required disclosures. The key is doing this continuously, not just at deploy time. Quality signals that look fine at launch can degrade over weeks as upstream models update or prompt drift accumulates.

Behavioral baselines give you something to compare against. Before deploying a prompt change, capture a representative sample of inputs and their expected output characteristics. Embed those outputs, store the distribution, and use cosine similarity checks against that distribution in production to detect drift. Answer drift — where responses gradually deviate from historical outputs — is one of the subtler failure modes and one of the hardest to catch without a baseline.

Real-time cost rate monitoring is different from budget alerts. If you're alerting on daily spend exceeding a threshold, you're discovering the damage after it's done. Instead, compute a rolling rate — API calls per minute or tokens per minute — and alert when that rate exceeds 3x the trailing average. At that point, you can kill the job before it becomes a $47,000 problem.

Refusal rate tracking is a useful leading indicator for guardrail state. LLMs that refuse requests tend to reuse template phrases ("I can't help with that," "I'm not able to"). Tracking the frequency of these templates in your output stream gives you a proxy signal for whether your guardrail is behaving as expected — both over-refusal (capability regression in a safety direction) and under-refusal (guardrail bypass) show up as anomalies.

Triage: Isolating the Root Cause

When an AI incident occurs, the first thing to establish is which layer caused it. There are three, and they require different expertise and different fixes.

The model layer is the LLM itself — its weights, version, and sampling parameters. Incidents here are caused by upstream provider updates, fine-tuning regressions, or parameter misconfiguration. To triage, you need to reproduce the failure with exact parameters (model version, temperature, top_p, max_tokens, seed if set, conversation history). If you can't reproduce it, you're working with samples — gather enough of them to understand the probability distribution of the failure.

The prompt layer is your system prompt and any prompt engineering code. This is where most teams have the most control and the most history. Incidents here look like behavioral changes that correlate with prompt modifications — even small wording changes can shift model behavior unpredictably. Prompt versioning — treating prompts like code with git history, test coverage, and staged rollouts — is the practice that makes this layer tractable. Without it, you're trying to debug an incident with no change log.

The data and retrieval layer covers RAG pipelines, vector databases, and any dynamic context injected into prompts. Incidents here manifest as hallucination spikes or retrieval relevance drops. The usual culprits are stale source documents, embedding drift (when the embedding model changes and old vectors no longer align with new queries), or vector database corruption. This layer is often undermonitored — teams instrument the LLM call but skip instrumenting the retrieval step.

In practice, tool versioning causes roughly 60% of production agent failures — a statistic that puts the infrastructure and data layers ahead of the model layer as the primary source of incidents.

The AI Incident Severity Framework

The standard SEV1-5 ladder needs AI-specific triggers. A system that's technically "up" but hallucinating at scale is a SEV1 even if no service is down.

SeverityLLM-Specific TriggerResponse Time
SEV1Guardrail bypass in production, agent executing unapproved commands, cost exceeding 100x baseline in real-time, PII leakage15 minutes
SEV2Quality regression affecting >20% of traffic, cost 10-100x baseline, capability unavailable1 hour
SEV3Latency P99 exceeding 2x baseline, quality regression affecting <20% of traffic, cost anomaly 2-10x baseline4 hours
SEV4Isolated quality degradation, single-user issue, minor latency spike24 hours

The key addition is the cost-rate trigger at SEV1. Traditional SRE severity frameworks focus on user impact and availability. AI systems introduce a failure mode where the service is "working" — successfully generating responses — but doing so at catastrophic cost.

Containment Actions by Incident Type

Containment for AI incidents doesn't mean "the incident is over." It means "the failure is no longer spreading." The actions differ by type.

For capability regression, the primary lever is prompt rollback via feature flag. If prompts are versioned and the version history is clean, you can switch back to the last-known-good prompt in seconds without a deploy. This is why prompt versioning isn't an engineering nicety — it's a prerequisite for having any incident response capability at all.

For guardrail bypass, the immediate action is to disable the affected feature and route users to a fallback — a human, a static response, or a more conservative model configuration. Reducing temperature (making the model more deterministic) and tightening system prompt constraints are temporary mitigations while you investigate root cause.

For cost explosions, the containment action is a kill switch that stops new LLM calls for the affected feature or user segment. This needs to be implemented before an incident, not during one. Per-user token budgets — enforced pre-call, not post-incident via billing alerts — are the structural fix.

For latency degradation, the response depends on whether the latency is in the model layer (increase timeout tolerance, switch to a faster model variant) or in the retrieval layer (reduce retrieval depth, use a cached response path).

Shadow Testing Before Incidents Happen

The best time to find behavioral regressions is before they reach production users. Shadow testing — running a new model version or prompt against real production traffic without serving results to users — gives you the coverage that benchmarks miss.

The practice is simple in concept: when evaluating a change, route 100% of production traffic to both old and new configurations simultaneously. Capture outputs from both. Compare along your quality metrics. Only graduate to A/B testing — where real users see the new version — after 7 to 14 days of shadow data shows no regression.

The cost is real: shadow testing doubles your LLM API spend during the evaluation window. The benefit is also real: this is the pattern that catches the subtle behavioral changes that pass all your unit tests and evaluation benchmarks but fail on the long tail of production inputs that nobody thought to include in a test set.

What Your On-Call Team Actually Needs

AI incidents require three types of expertise that are rarely combined in a single person. An on-call rotation for an AI system needs coverage across:

  • A software engineer who can trace requests, inspect tool calls, and investigate infrastructure and data layer issues
  • An ML engineer who understands model behavior, can interpret eval results, and knows how to adjust sampling parameters and prompts under fire
  • A data engineer who can inspect retrieval quality, audit the vector database, and verify embedding correctness

This has staffing implications. Teams that put a single SWE on AI on-call are leaving half the incident space unaddressed. The SWE can confirm whether the model is returning a response; they often can't tell whether the response is wrong, and they can't diagnose why if it is.

The other implication is documentation. Runbooks for AI incidents need to cover questions traditional runbooks don't ask: What is the last-known-good prompt version? What's the current model version and when did it last update? What's the baseline quality score for this feature, and what's the current score? Is the retrieval pipeline returning relevant results? These need to be answerable at 2am without a PhD in ML — which means they need to be instrumented, not manually checked.

The Shift That Changes Everything

The mental model underlying traditional incident response is: something broke, find the change that broke it, revert the change. This works because traditional systems are deterministic and observable. You can reason from symptoms to cause.

LLM incidents require a different mental model: something drifted, identify the layer it drifted in, and apply the appropriate lever. The symptoms are behavioral, the cause is probabilistic, and the fix is often about narrowing the distribution of outputs rather than patching a specific bug.

The teams that handle AI incidents well are the ones that accepted this difference early. They built behavioral baselines before they needed them, treated prompts like deployable artifacts with version history, instrumented real-time cost rate rather than daily summaries, and staffed on-call with a range of expertise. None of this is especially exotic engineering. It's just applying the discipline of production operations to a system that doesn't behave like the systems that discipline was designed for.

References:Let's stay in touch and Follow me for more thoughts and updates