The Five-Surface Triage Tree: An AI On-Call Playbook for Pages That Don't Fit Your Runbook
The page fires at 2:47 AM. The agent is sending wrong-tone replies to customer support tickets, the latency dashboard is flat, the error rate is normal, and there is nothing to roll back because nothing was deployed in the last twelve hours. The on-call engineer opens the runbook, scrolls past "restart the worker pool" and "scale the queue," reaches the bottom, and finds nothing that maps to the page in front of them. They start reading the system prompt at 3:04 AM. They are still reading it at 3:31 AM.
This is the new failure shape, and the rotation that was designed for "high latency means restart the pod, elevated 5xx means roll back the deploy, queue depth growing means scale the worker pool" is not equipped to handle it. The first instinct — roll back the deploy — is wrong because nothing was deployed: the model upgraded silently behind a versioned alias, a third-party tool's response shape drifted, the prompt version skewed across regions, or the eval set went stale weeks ago and the regression has been compounding the whole time. The page is real. The runbook is silent. AI on-call is its own discipline now, and trying to retrofit it into the existing rotation produces playbooks whose first step is silence on the call while everyone reads the prompt for the first time.
The argument of this piece is that the right starting move for an AI page is not "what changed in the deploy" but "which of five surfaces drifted" — and that the team that has not pre-built a triage tree, a freeze button, a replay harness, and a severity rubric for the new failure modes will burn an hour of an outage learning each of those lessons on the call. The fix is not heroics. It is the same instinct that produced classical SRE in the first place: codify the triage, instrument the surfaces, rehearse the response.
Start the call by naming which surface drifted
The first branch of an AI triage tree is not "what was the last deploy" — it is "which of five surfaces is the source of the change?" The five surfaces are code, model, prompt, tool, and data. Every page resolves to a drift on one of them. The on-call's job in the first five minutes is to walk down each branch with a concrete check, not a hunch.
A workable triage checklist looks like this:
- Code change. Pull the deploy log for the affected service. If something shipped in the last 24 hours that touched the agent loop, the prompt assembly, the tool dispatcher, or the response parser, that branch goes hot.
- Model change. Resolve the model alias to its current concrete version. Compare to the version recorded in the last green eval run. Check the provider status page for upgrades, deprecations, or capacity events. A silent upgrade behind an alias is the most common "nothing was deployed but everything changed" cause.
- Prompt change. Diff the prompt registry against the last-known-good version. Many prompt regressions are three-word edits that a non-engineer made in a config UI. Some are version skew across regions, where a flag rollout left two regions on different prompt revisions and the page is regional.
- Tool change. Diff the tool schemas against the version the agent was tested with. Check the tool's own status page. Many "the agent is doing the wrong thing" pages turn out to be a downstream API that started returning a new field or stopped returning an old one, and the agent's parser quietly fell back to a worse path.
- Data change. Check the retrieval index version, the freshness timestamp on the indexed corpus, the embedding model version, and the chunking config. A re-index with a slightly different chunking strategy can shift answer quality without changing a single line of code.
The discipline is that this checklist is run first, in order, in under five minutes, before anyone starts reading the prompt. Reading the prompt is a deep-dive. Walking the five surfaces is triage. The two are different jobs, and conflating them is how the call ends with three engineers who all opened the prompt and none who pulled the model alias log.
Freeze every variable surface before you investigate
The second move, after the surface has been named, is to freeze the rest of the system. Classical incident response froze code by stopping deploys. AI incident response has to freeze five things at once: the resolved model version, the prompt version, every tool schema the agent uses, the retrieval index version, and the configuration of the agent harness itself.
This matters because investigation is racing against drift. While the on-call is reading the prompt, the prompt registry might be receiving a hotfix from a product manager in another timezone. While the on-call is comparing model behavior, the alias might silently resolve to a newer checkpoint. While the on-call is debugging the tool output, the upstream API might ship a fix to the bug that caused the page. The investigation that started at 3 AM against five moving targets has produced no usable evidence by 5 AM.
The freeze button has to be a real artifact, not a discipline. In practice it looks like:
- A registry-level pin that converts every alias in the affected surface into a concrete version for the duration of the incident, and rejects any write to those records until the freeze is lifted.
- A canary-style traffic split that holds a percentage of traffic on the frozen versions while the rest of the system continues to serve so the investigation has both a "before" and an "after" sample to compare.
- An audit log that records exactly which versions were pinned at incident-time, so the postmortem has a reproducible snapshot to point at when it asks "was this regression real or a measurement artifact."
The freeze is the AI equivalent of kubectl cordon. It does not fix the problem. It buys the on-call a stable surface to investigate against. Without it, every diagnostic the on-call runs is implicitly comparing two snapshots of the system that may not even share the same version vector.
Build a replay harness so "did the model behave differently on this input" is a 30-second answer
The classical analog is the core dump. When a server crashes, the operator has the stack trace, the heap, the inputs, the environment. They can reproduce the crash in a sandbox. AI agents have not had this primitive until recently, and the absence is the single largest contributor to long MTTR on AI pages.
A replay harness is a system that captures, for every production trace, the exact inputs to every model call and every tool call along with the version vector that produced the output. The on-call can then point the harness at any past trace, swap one variable — the model version, the prompt revision, the tool schema, the retrieval result — hold every other variable constant, and ask "would this trace have produced a different answer if only this had changed?" The deterministic-replay literature has been arguing for this primitive for two years and the tooling has finally caught up: Message-Action Trace records, replay stubs that intercept nondeterministic dependencies and serve them from the recorded trace, and harnesses that ship as part of the agent framework rather than as a bolt-on observability layer.
What the replay harness gives the on-call is differential diagnosis at clock speed. The page says "the agent is sending wrong-tone replies." The on-call pulls a captured trace from the affected window. They run it against the previous model version with the current prompt, the previous prompt with the current model, and the previous tool schema with both. In ninety seconds they have isolated which of the five surfaces is responsible for the regression, and they have a concrete trace to attach to the incident channel as evidence. Compare to the alternative: an hour of staring at the prompt, half-formed hypotheses, and a final commit message that says "I think it was the model upgrade, we'll find out next week."
The harness is also the thing that converts a page from a forensic exercise into a reproducible test case. Once the regression has been pinned to a single surface, the trace becomes a regression test that the team checks in alongside the fix. The next silent upgrade that breaks this exact behavior pages on a failing eval before it pages on a customer complaint.
Reclassify severity for the failure modes the old taxonomy didn't have
Classical severity rubrics were built around availability and correctness as binary properties. The system is up or down. The transaction completed or failed. AI failure modes do not fit this taxonomy. The agent is up. The transaction completed. The output was wrong but plausible, and the user trusted it, and the downstream consequence will land in three days when the customer escalates.
A severity rubric that papers over this with "well, the API returned 200, so it's Sev3" is the rubric that calibrates the rotation toward the wrong incidents. The fix is to add severity dimensions that match the failure modes the system actually has:
- Output quality regression. The system is up, but a measurable quality metric — refusal rate on valid requests, format consistency, instruction-following score, tone calibration — has drifted past a threshold. Sev2 if it affects a revenue-bearing surface, Sev3 if it affects a non-critical surface, Sev1 if it affects a safety-relevant surface.
- Confident wrong answers. The system is up, but the agent is producing wrong outputs without expressing uncertainty. Sev1 by default; the failure mode is the absence of a hedging signal that downstream systems and human users were relying on.
- Unauthorized action. The agent invoked a tool it should not have invoked, or invoked a tool with parameters outside its sanctioned envelope. Sev1 if the action is irreversible, Sev2 otherwise.
- Cost spike. Per-request token spend has crossed a budget threshold without a corresponding workload change. Sev2 by default; the failure mode here is "the bill is racing the next page."
What the new rubric does, organizationally, is force the on-call to triage the new failure modes with the same urgency as the old ones. A page that says "the model started lying" arrives at the same priority as a page that says "the database is down" precisely because the trust contract with the user is degraded in both cases. The team that does not make this calibration explicit will discover that "wrong but plausible" pages are getting routed to a junior engineer with a four-hour SLA, and the customer-comms drafting cycle is starting from cold every time.
Cross-train the rotation, or split it
The last move is staffing. The engineer strongest at distributed-systems triage is rarely the one who can spot a refusal-style drift on a transcript. The engineer strongest at prompt debugging is rarely the one who can read a Kubernetes event stream. The naive answer — "we'll add an AI round to the on-call rotation" — produces pages that land on whoever was loudest in the runbook design meeting. The real fix is one of two patterns.
The first pattern is cross-training. Every engineer on the rotation goes through a 90-day apprenticeship where they shadow an AI page, run through the triage tree on captured incidents, and ship at least one fix that touched a prompt, a tool schema, or a model alias. The org that runs this in earnest pays a quarter of friction and gets a rotation that can hold both shapes of page. The org that does not run it ends up routing AI pages to the same three people every time, and those three people burn out within two quarters.
The second pattern is to split the rotation. AI on-call becomes its own discipline with its own tier-one rotation, its own MTTR math, its own postmortem template, and its own severity rubric. The classical SRE rotation handles availability, capacity, and infra; the AI on-call rotation handles model, prompt, tool, and data drift. The two teams share an escalation path for incidents that span both surfaces. This is more expensive than cross-training, but it is the right answer when the AI surface area is large enough to justify dedicated headcount and when the failure modes are differentiated enough that a generalist would be a worse responder than a specialist.
The pattern that does not work is the one most teams default into: a single rotation, no cross-training, and a quiet expectation that "AI pages will go to the AI team." The page comes in at 2:47 AM, the AI team is asleep in another timezone, the on-call opens the runbook, and the first step is silence on the call.
The runbook ships before the next page
The mistake worth naming is the shape of the work. AI on-call discipline is not built during an incident. It is built between incidents, on a calendar cadence, by a team that has decided that "the model started lying" is a class of page worth pre-engineering for. That work — the triage tree, the freeze button, the replay harness, the severity rubric, the rotation design — looks like infrastructure investment with no immediate payoff, until the morning the page fires and the runbook either holds or it doesn't.
The teams that have done this work in 2026 are not the ones with the strongest models or the most expensive observability stack. They are the ones whose on-call answer to "what surface drifted" is a five-minute checklist instead of a thirty-minute prompt-reading exercise, whose incident channel produces a frozen version vector in the first ten minutes, and whose postmortem closes with a regression test attached to the trace. The rest of the industry will get there eventually. They will get there one outage at a time.
- https://dev.to/waxell/when-your-ai-agent-has-an-incident-your-runbook-isnt-ready-1ag6
- https://clyro.dev/blog/the-5-ai-agent-failure-modes-why-they-fail-in-production/
- https://www.softwareseni.com/when-ai-sre-fails-production-reality-failure-modes-and-what-they-cost/
- https://agentwiki.org/common_agent_failure_modes
- https://www.ilert.com/agentic-incident-management-guide
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://www.splunk.com/en_us/blog/learn/llm-observability.html
- https://www.traceloop.com/blog/catching-silent-llm-degradation-how-an-llm-reliability-platform-addresses-model-and-data-drift
- https://agenta.ai/blog/prompt-drift
- https://portkey.ai/blog/the-complete-guide-to-llm-observability/
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://rootly.com/ai-sre-guide
- https://sre.google/sre-book/being-on-call/
- https://mlflow.org/docs/latest/genai/prompt-registry/
