Skip to main content

One post tagged with "alert-fatigue"

View all tags

The On-Call Rotation That Muted LLM Pages Because Every One Looked Like The Last One

· 11 min read
Tian Pan
Software Engineer

A real regression burned for two days in production. The page had fired. It had fired correctly, at the right threshold, with the right severity. Three weeks earlier the on-call rotation had added a silence rule for that alert family because every page in that family had so far resolved with the same comment: "nothing to do, investigating." The post-mortem could not honestly call the silence a mistake. It was a rational adaptation to a stream of pages the rotation had no playbook for. The regression that mattered shipped against a muted channel because the team's monitoring stack was producing signals it could not act on, and the team's response to that was the only one available: stop listening.

This is not an alerting bug. It is a structural property of how AI features get instrumented when teams reach for the playbook they already know. Latency, error rate, refusal rate, output schema conformance, judge-eval drift — each one is a defensible metric. Each one fires with the same diffuse "model behavior changed" wording. None of them tells the on-call engineer what to do, because no one has written the runbook that maps each signal to an action, because most of the time the signal does not map to an action. The rotation absorbs the noise until the noise is louder than the signal, then it routes around the channel that produces it.