On-Call at 3am for an AI Feature That Didn't 500
The pager goes off at 3:02 AM. You squint at your phone expecting the usual: a database failover, a CDN edge that wandered off, a 500 spike from a service nobody touched in eight months. Instead the alert reads: summarizer.eval-on-traffic.helpfulness rolling-1h: 4.21 → 4.05 (Δ -0.16). No HTTP error. No latency spike. No service is down. Every request the system served in the last hour returned a 200 with a body that parsed cleanly. And yet something is unmistakably worse than it was at midnight, and the rotation expects you to figure out what.
This is the on-call shift the standard runbook wasn't written for. The thing that broke didn't break — it regressed. The error budget you've been tracking for years is denominated in availability and latency, and the failure mode that paged you isn't visible in either. The page is real, the customer impact is real, and your usual diagnostic loop — check the deploy log, check the dependency graph, find the bad release, roll it back — runs into a wall the moment you realize that "the bad release" might be a 30-line system-prompt diff that landed at 4 PM yesterday and looked completely innocuous in code review.
The shape of on-call shifts when half your failure modes don't trip the alerts you already have. The team that figures this out builds a second muscle alongside the SRE one — eval-driven incident response — and the team that doesn't keeps finding out about quality drops from a Twitter thread or a churn cohort report two weeks later. Practitioners now report 14–18 day lags between the onset of silent quality degradation and the first user complaint that gets escalated. That is not an alerting problem. That is the absence of a category of alerting.
Why the page fired without a 500
Traditional alerting watches the wrong half of the failure surface for AI features. An LLM-powered endpoint can return HTTP 200 with a structurally valid response and still be wrong, off-tone, ungrounded, refusing things it shouldn't refuse, agreeing to things it shouldn't agree to, citing sources that don't exist, or hallucinating with high confidence. None of those failure modes show up in p95 latency or error rate. They show up in user behavior — retries, edits, abandoned sessions, thumbs-down — and in evaluation signals run against the actual production stream.
That second category — eval-on-traffic — is what fired your pager. A sampled subset of live requests is scored continuously by a judge model or a deterministic checker, the rolling score is treated as a first-class metric, and a regression in that score escalates the same way a CPU spike does. The mechanics are settled enough now that most LLM observability platforms ship the pattern by default, and the discipline of tying that score to an SLO ("99% of summarizer responses must score ≥ 4.0 on helpfulness over a 24-hour window") is what turns a vibes-based "the model feels worse today" into a paging condition with a burn rate.
The crucial property is that this signal leads the lagging ones. Thumbs-down, retry rates, and session abandonment all eventually move when quality drops, but they move days later, after enough users have hit the regression and bothered to express displeasure. By then your churn cohort is already shaped. Eval-on-traffic moves in minutes because the judge runs against the same traffic the user just saw.
The new diagnostic loop
The runbook for "service is down" is decades old and reflexive. The runbook for "service is silently worse" is none of those things, and most of the on-call rotations shipping AI features in production are still inventing it. The diagnostic questions are different from the start.
The first question is no longer "what changed in the last 24 hours" against the deploy graph alone. It is the same question, but the surface area has tripled:
- Did a model migration land? Provider version bumps are the most common silent regressor —
claude-X.Y → X.Zlooks like a version bump and behaves like a behavioral diff. Many teams pin model IDs and gate migrations behind eval suites for exactly this reason. - Did a prompt change merge? System prompts and few-shot examples are code by every meaningful definition, but they are often shipped through different paths than service code, sometimes by people who don't get paged. The on-call has to know to look at the prompt repo, not just the service repo.
- Did a tool, retriever, or knowledge source change? Index rebuilds, embedding model swaps, and chunking changes all manifest as quality regressions downstream while looking like infrastructure work upstream.
- Did the judge itself drift? This is the trickiest one. The judge prompt is also a prompt; the judge's model can also migrate; the judge's calibration set can grow stale. A "regression" sometimes turns out to be the judge re-baselining onto a slightly different rubric. Teams who run a calibration set against the judge on a monthly cadence catch this; teams who don't chase ghost regressions twice a quarter.
- Did the input distribution shift? A marketing campaign, a partner integration, or a localized rollout can change the shape of incoming requests without changing any code. The model isn't worse; it's seeing more of the part of the distribution it's already weakest on.
The output of this loop is a hypothesis, not a fix. The fix is usually one of three things: roll back a prompt or model pin, narrow the rollout, or accept the regression and open a ticket against the eval suite that should have caught it before it reached production. The third one feels unsatisfying to engineers who came up on rollback-and-blameless-postmortem, and it is correct anyway — most "fixes" for soft regressions are eval improvements, not code patches.
Designing the alerts so you actually get paged
The hardest design problem isn't the alert; it's the threshold. Eval scores are noisier than CPU. A naïve "alert when helpfulness drops by 5%" wakes you up every other night for sample-size noise on a slow Tuesday. A naïve "alert when helpfulness is below 4.0 for an hour" misses gradual drift entirely because the rolling average sits at 4.05 forever. The pattern that survives contact with real on-call shifts has a few properties:
- Scores are bucketed by feature, persona, and request shape. A single "overall quality" number averages too much together; a regression in the "long, multilingual, retrieval-heavy" slice gets washed out by the "short, English, no-tools" majority. Teams that segment their evals by these axes catch regressions hours before the global metric moves.
- Burn-rate alerts beat threshold alerts. SRE-style multi-window, multi-burn-rate alerting (fast-burn over 1h, slow-burn over 6h or 24h) is a much better fit for eval-on-traffic than a fixed cutoff. A 4-point drop sustained for an hour is a different incident than a half-point drop sustained for a day, and they want different runbooks.
- Critical evals are deterministic; expensive evals are sampled. Anything that is a hard policy floor — refusal of unsafe content, presence of a required citation, schema validity — runs against every request, because the cost of missing one is high and the cost of running a regex is zero. The judge-model evals that cost real money per call are sampled to the rate the budget allows, which is usually 1–10% in steady state and 100% during a suspected incident.
- The judge has its own monitoring. The judge's average score, refusal rate, and calibration drift against a held-out set are all metrics that themselves get tracked and alerted. If the judge gets weirder, the alerts it produces get weirder, and you want to know that before you act on them.
- User-behavior signals corroborate, they don't lead. Thumbs-down rates, retry rates, and edit-distance on user revisions all belong in the dashboard, but they are confirmation that an eval-on-traffic regression is real-world impactful, not the primary trigger. Treating them as the primary trigger reproduces the 14–18 day lag that motivated the whole exercise.
What the on-call rotation has to learn
Most rotations were staffed by SREs and backend engineers who learned their craft on systems where the failure modes were uptime, latency, capacity, and config drift. None of those people are wrong to be on the rotation, but the skill set the rotation needs has expanded, and the team that ignores that ends up with a two-tier on-call: SREs handle the 500s, AI engineers handle the soft regressions, and an alert that's actually a hybrid of the two takes twice as long to triage because it bounces between two rotations who don't fully understand each other's mental model.
A rotation that handles AI features end-to-end has a few shared pieces of knowledge that everyone carries, regardless of whether they came in through the SRE door or the ML door. They know where the prompts live and how to diff them. They know which model IDs are pinned and which are floating. They know how to query the eval-on-traffic store for the last 24 hours of scores by slice. They know which evals run continuously and which are gated behind sampling. They know how to roll back a prompt change as a first-class deploy, not as a code-style PR that needs a human reviewer at 3:02 AM. They know the judge's calibration history well enough to recognize "the judge is drifting" as a hypothesis worth investigating.
The handoff documentation has to expand to match. A weekly on-call digest that only enumerates 500s and capacity events isn't telling the next shift what they need to know. The digest that helps the next on-call covers the prompt diffs that landed, the model migrations in flight, the eval slices currently flagged as borderline, the judge calibration runs that drifted, and the open silent-regression incidents whose fixes are eval improvements rather than code patches. Without that, every shift starts cold against a surface area that changes faster than the rotation can absorb.
The postmortem looks different too
Once the alert is acknowledged and the regression is contained, the postmortem on a soft-regression incident asks a different set of questions than the one for a 500. The "why did this fail" is rarely "this code was wrong"; it's much more often "no eval case caught this failure mode, and our judge wasn't sensitive to the drift, and the prompt change reviewer didn't have a way to predict the downstream effect." The action items lean toward eval improvements, judge calibration runs, prompt review tooling, and rollout policy changes. Code changes are the minority outcome.
The healthiest pattern is to treat every soft-regression incident as an automatic source of new eval cases. The exact request shape that paged you, the exact model output that scored low, and the slice of the distribution that drifted all become test fixtures. The next time a prompt change is proposed, the eval suite that runs against it includes the failure mode that woke you up tonight. This is the same flywheel that mature ML teams have run for years against offline data; the addition is hooking it directly to the on-call alert pipeline so that the production signal feeds back into the gating signal without a human in the middle.
The team that builds this loop ends up with an on-call rotation that gets quieter over time, because each incident permanently inoculates the next deploy against its own failure mode. The team that doesn't ends up paging the same engineer for the same shape of regression every six weeks, with a slightly different prompt diff each time, and eventually that engineer leaves and the institutional knowledge of which prompts were tuned against which failure modes leaves with them.
A new on-call discipline, not a new tool
The temptation, every time a category of alert is invented, is to buy a tool that handles it and consider the problem solved. The eval-on-traffic platforms are real and useful and you should pick one, but the harder work is the rotation discipline: the runbook updates, the burn-rate thresholds, the judge calibration cadence, the prompt-diff review process, the postmortem format, and the explicit decision that AI quality regressions are first-class incidents on the same severity scale as availability ones.
The 3 AM page that fires because helpfulness dropped four points is the exact moment that distinction becomes operational rather than philosophical. The team that has done the work picks up the page, opens a runbook that names the right diagnostic loop, and finds the prompt diff or the model pin or the index rebuild within thirty minutes. The team that hasn't picks up the page, stares at it, escalates it to whoever wrote the prompt three weeks ago, and discovers in the morning that the page was right, the regression was real, and the root cause is sitting in a PR description that nobody on the rotation could have parsed at 3 AM. The pager is the same. The shift is what changes.
- https://www.confident-ai.com/knowledge-base/compare/10-llm-observability-tools-to-evaluate-and-monitor-ai-2026
- https://www.braintrust.dev/articles/agent-observability-complete-guide-2026
- https://www.langchain.com/articles/llm-as-a-judge
- https://www.braintrust.dev/articles/what-is-llm-monitoring
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://dev.to/aiwithmohit/your-llm-is-lying-to-you-silently-4-statistical-signals-that-catch-drift-before-users-do-4cg2
- https://dev.to/delafosse_olivier_f47ff53/silent-degradation-in-llm-systems-detecting-when-your-ai-quietly-gets-worse-4gdm
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://openobserve.ai/blog/llm-monitoring-best-practices/
- https://neuraltrust.ai/blog/llm-applications-need-active-alerting
