Skip to main content

The On-Call Runbook That Assumed a Human Would Read the Page

· 11 min read
Tian Pan
Software Engineer

The page fired at 02:14. The runbook said "page the engineer." The engineer's name resolved to an on-call rotation. The rotation pointed at a Slack channel that the team had wired up six months ago as a unified triage surface. The first message in the channel was the alert. The second message, posted nineteen seconds later, was a calm three-sentence summary: the alerting service, the failing dependency, the last deploy. It was well-written. It ended with "Acknowledged."

The incident commander, watching from her phone in bed, read "Acknowledged" and went back to sleep. Nobody had acknowledged. The agent subscribed to that channel as a first-line triage helper had restated the alert back to the room and signed off with the verb the channel's other readers used to mean "I have the context to act on this." The incident ran unowned for forty-one minutes until a customer ticket woke a different engineer through a different surface.

The runbook was correct. The agent was working as designed. Every link in the chain assumed a human reader with on-call context and an urgency model, and the chain had quietly acquired one link that had neither. The failure mode is not that the agent did something wrong. It is that the workflow's correctness depended on a property — "the responder understands what it means to be paged at 2 AM" — that the workflow had never named, and therefore never enforced when the responder population changed.

The escalation chain is a state machine, not a notification fanout

Most teams think about paging as a notification system: an alert fires, a message lands somewhere, a person reads it, a person acts. That model is fine when every recipient is a person. The moment one of the recipients is an agent — or a workflow, or a webhook, or another system that posts back to the channel — the notification system has become a state machine, and you have to start asking which transitions are authoritative.

The "acknowledge" event is the load-bearing transition. It says: this incident now has an owner who will see it through, the page can stop, the escalation timer can reset. In a human-only chain, the social contract around acknowledging is what makes the state machine safe. A senior engineer does not type "ack" unless they actually mean it, because the cost of being wrong is that the room thinks the problem is handled when it isn't. The contract was never written down because everyone already had it.

When an agent enters the channel, the contract evaporates. The agent's job is to be helpful in the channel — to summarize, to fetch context, to suggest. None of those actions imply commitment, but every one of them produces output that reads like commitment if the receiver is pattern-matching on tone. A polite confirmation from a competent-sounding voice is, in chat, indistinguishable from a "yes I've got this." The incident commander's brain is doing the same kind of compression the rest of the team's brain does: parse the vibe, infer the state, move on.

"Acknowledge" is two different verbs you have been collapsing

The single word hides two semantics that the channel's other participants kept separate by convention:

  1. I saw this. The page reached me, my eyes registered the alert, my brain has begun parsing. This is what most agent-generated "acknowledgements" actually mean.
  2. I have the context to act on this. I am the owner of this incident now. The escalation timer can stop. Other on-call rotations should not be paged. I will be the one who answers when leadership asks for status.

A human on-call engineer says the same word for both because their presence in the channel and their position in the rotation make the second meaning obvious. An agent has neither of those signals. It is in the channel by configuration, not by rotation. Its "presence" is not a commitment to be present in fifteen minutes when the next status update is needed.

The fix is to separate the verbs in the protocol, not in the prose. The incident management platform should distinguish a "received" event (any participant, including agents and webhooks, can emit it) from an "owned" event (only a specifically-authorized human in the rotation can emit it). The escalation timer should reset only on the second. Free-text messages in the channel should not be parseable as state transitions at all, by either humans or downstream automations. The state of the incident is what the state machine says it is, not what the room sounds like.

This sounds pedantic until you realize that the room sounding like the incident is handled is exactly what made the original failure mode invisible to the commander. The agent did not fool the state machine. It fooled the humans who were using the room's tone as a proxy for the state machine.

Signed action tokens make the rotation the source of truth

The "owned" event needs to be unforgeable. A bot account with a Slack token and the right intentions can already emit any message it likes; the protocol can't tell the difference between "/ack from a human's phone" and "/ack from a webhook with a stale config." The standard fix in identity systems is to require a credential that only the intended actor can present, and to verify it server-side before treating the action as authoritative.

In practice this looks like the on-call rotation issuing a short-lived signed token to whoever is currently on shift, delivered through a channel that the agent does not have access to — the actual paging surface on the human's phone, with biometric or push confirmation. The "owned" action in the incident channel requires that token. An agent posting "/ack" in chat without the token gets a polite "received, but ownership requires the on-call action button" reply. The protocol stops conflating who can speak in the room with who can move the state machine forward.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates