Skip to main content

The On-Call Rotation That Muted LLM Pages Because Every One Looked Like The Last One

· 11 min read
Tian Pan
Software Engineer

A real regression burned for two days in production. The page had fired. It had fired correctly, at the right threshold, with the right severity. Three weeks earlier the on-call rotation had added a silence rule for that alert family because every page in that family had so far resolved with the same comment: "nothing to do, investigating." The post-mortem could not honestly call the silence a mistake. It was a rational adaptation to a stream of pages the rotation had no playbook for. The regression that mattered shipped against a muted channel because the team's monitoring stack was producing signals it could not act on, and the team's response to that was the only one available: stop listening.

This is not an alerting bug. It is a structural property of how AI features get instrumented when teams reach for the playbook they already know. Latency, error rate, refusal rate, output schema conformance, judge-eval drift — each one is a defensible metric. Each one fires with the same diffuse "model behavior changed" wording. None of them tells the on-call engineer what to do, because no one has written the runbook that maps each signal to an action, because most of the time the signal does not map to an action. The rotation absorbs the noise until the noise is louder than the signal, then it routes around the channel that produces it.

The Catchpoint SRE Report 2025 puts on-call stress at the source of burnout and attrition for nearly 70% of SREs. The numbers under that headline are worse than they look. Typical enterprise rotations see hundreds of alerts per day across all sources, while Google's SRE Workbook recommends no more than two actionable pages per shift. The gap between "fires" and "actionable" is where rotations break, and AI features sit at the worst end of that gap because every signal they produce is probabilistic by construction. The on-call's mental model — "this alert means X is broken, do Y" — works for a stuck deployment. It does not work for "refusal rate moved 0.4 points over a four-hour window."

Why AI Alerts Are Structurally Noisy

A traditional alert fires because a deterministic predicate flipped. CPU crossed 90%. The deployment pipeline returned non-zero. A health check timed out. The on-call has a small set of hypotheses and a known investigation path. The page maps to a runbook entry that maps to commands. The cognitive load is high during the incident but the structure is stable across incidents.

An AI feature page fires because a distribution moved. The model started declining a class of requests it used to accept. Schema conformance dropped from 99.1% to 97.8%. The judge eval that scores response quality drifted by half a point. None of these statements is wrong. None of them is, by itself, a fault. They could mean a real regression in the upstream model. They could mean a benign cohort shift — your traffic pattern changed because a customer rolled out a feature that asks the model different questions. They could mean the provider rebalanced load across a different fleet. They could mean nothing.

The on-call cannot tell the difference from the page. The diagnostic step required is not "ssh to the box and read the log." It is "pull the last 24 hours of traces, sample 50, read them, decide whether the change is a regression or a drift." That is a 20-minute task at minimum, often longer, and the typical answer is "nothing to do." The third or fourth time the rotation pays that cost for nothing, the rotation learns that the cost is not worth paying.

Layer on top of this that the LLM observability market has converged on dashboards that emit alerts for every metric they track. Faithfulness drops, safety regressions, drift across prompts, latency tails, token-count anomalies, cost spikes. Each one is reasonable in isolation. The sum is a rotation receiving pages for five orthogonal signal classes from a feature that, three months ago, did not page at all.

The Local Rationality of the Silence Rule

When the rotation adds a silence rule, no one in the room is making a bad decision. They are looking at three weeks of pages and noting, accurately, that the page-to-action ratio is approximately zero. They are noting, accurately, that pages have a real cost: sleep disruption, context switching off the work the engineer was doing, on-call attrition. They are noting, accurately, that the on-call engineer cannot resolve the page even when it represents something real, because the diagnostic path is multi-hour and requires deep familiarity with the feature.

The silence rule is rational at the level of the rotation. It becomes catastrophic at the level of the system. The signal that the rotation has muted is the same signal that would have caught the real regression. The team has not improved the signal; they have eliminated the channel through which the signal would have arrived. The rotation has bought immediate quality-of-life back at the cost of the team's ability to detect quality regressions in production.

This is the same dynamic as cancer screening with a high false-positive rate. The individual decision to skip a test after three benign results is reasonable; the population-level outcome is detection latency for the cases that matter. Engineering teams handle this in physical health by improving the test, not by improving the patient's discipline at showing up. In on-call we usually flip that — we try to improve the engineer's discipline at responding to noisy pages, and we are surprised when the discipline does not survive a quarter.

What Signals Earn Their Page

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates