One post tagged with "ai-oncall"

The Five-Surface Triage Tree: An AI On-Call Playbook for Pages That Don't Fit Your Runbook

April 27, 2026 · 12 min read

Software Engineer

The page fires at 2:47 AM. The agent is sending wrong-tone replies to customer support tickets, the latency dashboard is flat, the error rate is normal, and there is nothing to roll back because nothing was deployed in the last twelve hours. The on-call engineer opens the runbook, scrolls past "restart the worker pool" and "scale the queue," reaches the bottom, and finds nothing that maps to the page in front of them. They start reading the system prompt at 3:04 AM. They are still reading it at 3:31 AM.

This is the new failure shape, and the rotation that was designed for "high latency means restart the pod, elevated 5xx means roll back the deploy, queue depth growing means scale the worker pool" is not equipped to handle it. The first instinct — roll back the deploy — is wrong because nothing was deployed: the model upgraded silently behind a versioned alias, a third-party tool's response shape drifted, the prompt version skewed across regions, or the eval set went stale weeks ago and the regression has been compounding the whole time. The page is real. The runbook is silent. AI on-call is its own discipline now, and trying to retrofit it into the existing rotation produces playbooks whose first step is silence on the call while everyone reads the prompt for the first time.

About Tian Pan