Skip to main content

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

Why AI Incidents Are Different

When a traditional service fails, the incident was almost certainly caused by something that recently changed — a deployment, a config push, a dependency update. You can reproduce the failure by replaying the same request. You can bisect the problem by rolling back until the bug disappears.

AI incidents break all three of those assumptions.

Non-determinism makes reproduction unreliable. The same input can produce a different output tomorrow. When OpenAI rolled back GPT-4o in May 2024 after users reported the model had become aggressively agreeable — validating conspiracy theories and praising fraudulent investment schemes — the detection didn't come from internal monitoring. It came from a Reddit thread with 10,000 upvotes. The model hadn't crashed. There were no error spikes. The sycophancy pattern wasn't reproducible on demand; it was a statistical property of the output distribution.

Root cause is ambiguous by design. Traditional debugging narrows to a single defect. AI investigation narrows the field of suspects without isolating one cause — a model provider update, a drift in query distribution, a prompt injection campaign, or a change in how users phrase requests can all produce the same degraded outcome. Your runbook needs to accommodate that ambiguity rather than stall until certainty arrives.

Quality regression can happen without any deployment. AI systems can degrade when nothing on your side changed: an upstream model provider silently updates their base model, training data becomes stale, or query distribution shifts as new users discover your product. The "what changed recently?" question that anchors every traditional incident investigation may have no answer.

Incidents accrue silently at scale. A SQL injection affects one request. A behavioral regression in an LLM can affect thousands of responses before any human reviewer sees the pattern. Documented AI safety incidents increased 56% from 2023 to 2024, largely because production deployments scaled faster than detection capabilities.

The Four Alert Categories That Actually Matter

When designing AI oncall alerting, you're covering four distinct failure classes — and you need instruments for all of them.

1. Quality degradation

This is the hardest to detect and the most important. Your system is returning 200s with sub-200ms latency while producing outputs that are wrong, harmful, or useless.

Quality alerting requires running automated evaluations against sampled production traffic. The approach that works in practice: use an LLM-as-judge to score a random sample of live responses on dimensions relevant to your use case — factual accuracy, task completion, adherence to format constraints, absence of harmful content. Alert when those scores shift by more than two standard deviations from baseline, or when more than 5% of sampled responses fall below a defined quality threshold.

The investment is real: teams report spending six or more months tuning thresholds before quality alerts become reliable enough to page on rather than noise. Multi-metric alerting — combining response quality, user engagement proxies, and session length — reduces false positives by roughly 40% compared to any single signal.

For RAG systems, add faithfulness scoring — is the response grounded in the retrieved context, or is the model confabulating beyond what the sources support? This is the alert that catches corpus drift before it becomes user-visible.

2. Token budget anomalies

Token spend anomalies are one of the most actionable AI-specific signals, because they indicate something structurally wrong with how your system is processing requests — not just a bad output on a specific run.

The GetOnStack incident is the canonical example: context accumulation caused token usage to grow from 5,000 to 80,000+ tokens per session step. That 16x growth ratio was the detectable signal. The absolute cost wasn't unusual yet when the growth started — the ratio of actual to expected tokens was.

Alert on deviation from expected token consumption, not just absolute cost. If your system averages 1,000 tokens per request and a session is consuming 45,000, that's the alert — regardless of whether the bill looks alarming yet.

The critical lesson from the $47K incident: alerts require humans to respond. A hard token cap enforced at the infrastructure layer stops runaway costs automatically. Soft alerts and hard limits serve different purposes. Alert at 70% of session budget; kill at 100%. The kill should require no human intervention at 2 AM.

3. Refusal rate deviations

Refusal rate — the fraction of requests where the model declines to respond or heavily hedges its answer — is a leading indicator of model behavior changes. It should be tracked as a metric with alert thresholds in both directions.

A refusal rate spike can mean:

  • A model provider updated their safety tuning
  • A prompt injection campaign is triggering safety filters
  • A recent prompt change inadvertently made requests look policy-violating

A refusal rate drop is often more dangerous — it can indicate the model is becoming more permissive, trending toward the kind of sycophancy that triggered the GPT-4o rollback. When models rarely refuse, hallucination rates spike on hard or adversarial questions.

Segment refusal rates by request category. A spike in refusals for medical queries with stable rates elsewhere points to category-specific model behavior, not a broad system issue — and each requires a different response.

4. Infrastructure with AI-aware extensions

Standard infrastructure monitoring (latency, error rate, availability) remains necessary but is now the floor, not the ceiling. Two additions that matter specifically for AI:

Time to first token (TTFT) as a distinct metric from total latency. For interactive applications, users experience TTFT as responsiveness; total generation time is less perceptible. These require different optimizations and can degrade independently. A provider outage that slows the inference backend typically shows in TTFT before it shows in error rates.

Session step count and conversation depth. Agents that loop, tool calls that recurse unexpectedly, and conversations that grow without bound all show up as unusually high step counts before they show up in cost or error alerts. A session that's on step 47 of a task that should take 8 steps is almost certainly in trouble.

What an AI Incident Actually Looks Like

An AI incident often starts as an absence of signal rather than a presence of it. Normal dashboards, normal error rates, and then a user complaint, a billing anomaly, or a quality score distribution that's shifted a few points in the wrong direction.

The incident response arc is different too. Traditional incidents prioritize isolation and root cause identification before remediation. For AI incidents, isolation is often harder (you can't always reproduce the triggering input) and root cause ambiguity is normal. The staged approach that works:

First hour: stop the bleed. You probably don't know root cause yet. That's fine. What containment options do you have? Route to a fallback model. Disable the affected feature flag. Increase output filtering. Reduce the temperature setting to constrain output variance. The goal is to reduce blast radius while investigation proceeds in parallel.

First 24 hours: fan out and strengthen. Now you're looking for patterns. What's the scope — is this affecting all queries or a specific category? Did any provider updates come through? Did query distribution shift? Run your golden eval dataset against the current system state. Use automation to analyze logs at scale — manual review of LLM outputs doesn't scale to production incident investigation.

Post-incident: fix at the source and build regression tests. The most durable output of an AI incident is a new test case. Every failure mode you diagnose in production should become a regression test that runs before every prompt change and model update. This is how you build institutional memory in a system where the "code" is natural language.

Building a Runbook That Helps at 2 AM

The failure mode of AI oncall runbooks is that they describe what to observe rather than what to do. An on-call engineer at 2 AM doesn't need a description of the monitoring system — they need a decision tree with unambiguous branches.

A useful AI incident runbook entry has six fields:

Detection signal: What specifically triggered this — the alert name, the threshold that was crossed, and what metric category it belongs to (quality, cost, refusal, infrastructure).

Blast radius assessment: How many users or sessions are currently affected, and how do you determine that quickly? This should be a single query or dashboard link.

Immediate containment options in order of invasiveness: Model version rollback. Feature flag disable. Routing traffic to a fallback. Rate limit the affected endpoint. The runbook should list these with the command or UI action required, not a description of the concept.

Investigation checklist: What changed recently on your side? What changed on the provider side? Check the model provider's status page and changelog. Run the golden eval suite against current state. Check token consumption by session. Look for refusal rate changes in the last 24 hours. Check whether any upstream data sources changed.

Escalation criteria: At what point does this go from on-call engineer to engineering leadership? Frame this in terms of user-visible impact and financial exposure, not technical metrics.

Recovery validation: Before closing the incident, what does "confirmed recovered" look like? Typically: quality eval scores back within baseline range, token consumption back within expected band, refusal rates within normal variance. Run the validation before marking resolved.

One failure mode to design against: AI quality can appear recovered while still degraded. If your recovery signal is "error rate back to zero," you'll close incidents too early. Recovery validation must include quality signal, not just infrastructure signal.

The 2 AM Problem and Hard Enforcement

There's a gap between alerting systems and automated enforcement that costs teams real money. Alerts without enforcement work when:

  • The issue is easy to detect quickly
  • A human is immediately available and empowered to act
  • The action required is well-understood

For AI cost blowups, none of those conditions hold at 2 AM on a Tuesday. The GetOnStack team had alerts. Their system had monitoring. What they lacked was a hard token cap that would have stopped the infinite loop automatically — without requiring anyone to look at a dashboard.

The operational principle: anything you'd want stopped automatically at 2 AM needs to be enforced in code, not just monitored. Token limits enforced at the proxy layer before your application code can override them. Circuit breakers that open when tool call failure rates spike. Hard conversation turn limits that gracefully hand off to humans rather than continuing indefinitely.

Monitoring tells you what happened. Enforcement stops it from happening again while you sleep.

Organizational Readiness

The technical infrastructure is the easier half. The harder half is organizational.

Only 28% of organizations currently link AI observability data to business KPIs, according to a 2025 industry survey. Most teams measure AI system health as engineers — uptime, latency, error rates — rather than connecting quality signals to the outcomes that actually matter: user satisfaction, task completion rates, cost per successful outcome.

The cultural failure modes are predictable: underreporting (fear of blame suppresses early signals), alert fatigue (unclear ownership means nobody owns the signal), and the "nothing changed on our side" response to quality regressions that actually originated from a provider model update.

One concrete change that helps: build a fast path from user feedback to oncall. The GPT-4o sycophancy issue and several other production incidents were detected by user reports on social media before internal monitoring caught them. If a Reddit thread with 10,000 upvotes about your model's behavior reaches your oncall faster than your automated systems, your automated systems are not doing their job.

The teams that navigate AI oncall well treat it as a distinct competency — not an extension of traditional SRE practice, but a new capability that draws on SRE fundamentals while requiring new instruments, new runbooks, and new organizational habits for a class of failures that has no clear precedent in traditional software operations.

Your infrastructure being green means your system is running. It does not mean your system is working. That distinction is the foundation of effective AI oncall.

References:Let's stay in touch and Follow me for more thoughts and updates