AI Oncall: What to Page On When Your System Thinks
A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.
This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.
Why AI Incidents Are Different
When a traditional service fails, the incident was almost certainly caused by something that recently changed — a deployment, a config push, a dependency update. You can reproduce the failure by replaying the same request. You can bisect the problem by rolling back until the bug disappears.
AI incidents break all three of those assumptions.
Non-determinism makes reproduction unreliable. The same input can produce a different output tomorrow. When OpenAI rolled back GPT-4o in May 2024 after users reported the model had become aggressively agreeable — validating conspiracy theories and praising fraudulent investment schemes — the detection didn't come from internal monitoring. It came from a Reddit thread with 10,000 upvotes. The model hadn't crashed. There were no error spikes. The sycophancy pattern wasn't reproducible on demand; it was a statistical property of the output distribution.
Root cause is ambiguous by design. Traditional debugging narrows to a single defect. AI investigation narrows the field of suspects without isolating one cause — a model provider update, a drift in query distribution, a prompt injection campaign, or a change in how users phrase requests can all produce the same degraded outcome. Your runbook needs to accommodate that ambiguity rather than stall until certainty arrives.
Quality regression can happen without any deployment. AI systems can degrade when nothing on your side changed: an upstream model provider silently updates their base model, training data becomes stale, or query distribution shifts as new users discover your product. The "what changed recently?" question that anchors every traditional incident investigation may have no answer.
Incidents accrue silently at scale. A SQL injection affects one request. A behavioral regression in an LLM can affect thousands of responses before any human reviewer sees the pattern. Documented AI safety incidents increased 56% from 2023 to 2024, largely because production deployments scaled faster than detection capabilities.
The Four Alert Categories That Actually Matter
When designing AI oncall alerting, you're covering four distinct failure classes — and you need instruments for all of them.
1. Quality degradation
This is the hardest to detect and the most important. Your system is returning 200s with sub-200ms latency while producing outputs that are wrong, harmful, or useless.
Quality alerting requires running automated evaluations against sampled production traffic. The approach that works in practice: use an LLM-as-judge to score a random sample of live responses on dimensions relevant to your use case — factual accuracy, task completion, adherence to format constraints, absence of harmful content. Alert when those scores shift by more than two standard deviations from baseline, or when more than 5% of sampled responses fall below a defined quality threshold.
The investment is real: teams report spending six or more months tuning thresholds before quality alerts become reliable enough to page on rather than noise. Multi-metric alerting — combining response quality, user engagement proxies, and session length — reduces false positives by roughly 40% compared to any single signal.
For RAG systems, add faithfulness scoring — is the response grounded in the retrieved context, or is the model confabulating beyond what the sources support? This is the alert that catches corpus drift before it becomes user-visible.
2. Token budget anomalies
Token spend anomalies are one of the most actionable AI-specific signals, because they indicate something structurally wrong with how your system is processing requests — not just a bad output on a specific run.
The GetOnStack incident is the canonical example: context accumulation caused token usage to grow from 5,000 to 80,000+ tokens per session step. That 16x growth ratio was the detectable signal. The absolute cost wasn't unusual yet when the growth started — the ratio of actual to expected tokens was.
Alert on deviation from expected token consumption, not just absolute cost. If your system averages 1,000 tokens per request and a session is consuming 45,000, that's the alert — regardless of whether the bill looks alarming yet.
The critical lesson from the $47K incident: alerts require humans to respond. A hard token cap enforced at the infrastructure layer stops runaway costs automatically. Soft alerts and hard limits serve different purposes. Alert at 70% of session budget; kill at 100%. The kill should require no human intervention at 2 AM.
3. Refusal rate deviations
Refusal rate — the fraction of requests where the model declines to respond or heavily hedges its answer — is a leading indicator of model behavior changes. It should be tracked as a metric with alert thresholds in both directions.
A refusal rate spike can mean:
- A model provider updated their safety tuning
- A prompt injection campaign is triggering safety filters
- A recent prompt change inadvertently made requests look policy-violating
A refusal rate drop is often more dangerous — it can indicate the model is becoming more permissive, trending toward the kind of sycophancy that triggered the GPT-4o rollback. When models rarely refuse, hallucination rates spike on hard or adversarial questions.
Segment refusal rates by request category. A spike in refusals for medical queries with stable rates elsewhere points to category-specific model behavior, not a broad system issue — and each requires a different response.
4. Infrastructure with AI-aware extensions
Standard infrastructure monitoring (latency, error rate, availability) remains necessary but is now the floor, not the ceiling. Two additions that matter specifically for AI:
- https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/
- https://dev.to/waxell/the-47000-agent-loop-why-token-budget-alerts-arent-budget-enforcement-389i
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://latitude.so/blog/ai-agent-failure-detection-guide
- https://galileo.ai/blog/production-llm-monitoring-strategies
- https://www.mdpi.com/2624-800X/6/1/20
- https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html
- https://dev.to/sapph1re/how-to-stop-ai-agent-cost-blowups-before-they-happen-1ehp
- https://medium.com/artificial-synapse-media/openai-pulls-back-gpt-4o-after-users-report-excessive-agreeableness-and-ethical-concerns-756d06ea4b0a
- https://incidentdatabase.ai/blog/incident-report-2025-august-september-october/
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
- https://www.dynatrace.com/news/blog/ai-observability-business-impact-2025/
- https://thenewstack.io/when-ai-fails-the-new-reality-of-incident-management/
- https://stackgen.com/blog/ai-sre-vs.-traditional-incident-management-whats-actually-different
