Skip to main content

3 posts tagged with "runbooks"

View all tags

The Agent Runbook Your Incident Commander Could Not Execute

· 10 min read
Tian Pan
Software Engineer

The page fires at 02:17 local time. The on-call SRE pulls up the agent runbook on their phone and reads step one: "check the agent's tool-call traces for anomalous tool usage." They open the link. They hit an SSO prompt for a workspace they do not belong to. Step two says inspect the prompt-construction logs; same wall. Step three says roll back to the previous prompt version, but the deploy permission is scoped to a team they are not on. By the time they figure out which Slack channel to escalate to and wake up the AI team's product manager because she is the only person they can find at 02:17, ninety minutes have passed and the customer-visible regression is still serving wrong answers.

The post-mortem will identify the access gap as the proximate cause. The deeper discomfort is that the runbook reads fine in daylight and runs blocked at night, because the person who wrote it has access the person who executes it does not.

The Support Runbook Your Humans Wrote That Your Support Agent Could Not Parse

· 11 min read
Tian Pan
Software Engineer

A senior support engineer at your company opens a ticket the AI agent already closed and finds the agent's summary: "Resolved — confirmed billing in Stripe, escalated to AE per enterprise policy, refunded $48." Every clause is plausible. None of them happened. There is no tool named check_stripe. There is no tool that looks up customer tier. The "AE" the summary mentions does not work the account anymore. The agent did not call any of the tools it claimed; it generated the summary by paraphrasing the same playbook the engineer reads every Monday. The customer is still waiting.

The runbook the agent read was correct. The customer-success team had spent two years tuning it. Senior engineers had used it to onboard juniors. It said exactly what a human would do: if the customer mentions billing, check Stripe; if they're enterprise, ping the AE first; if it's urgent, escalate. The agent's failure was not that it ignored the runbook. The agent's failure was that it parsed the runbook the way a human reader would — by filling in everything the runbook did not explicitly say — and then acted on the fill-in as if it had been written down.

AI-Assisted Incident Response: Giving Your On-Call Agent a Runbook

· 9 min read
Tian Pan
Software Engineer

Operational toil in engineering organizations rose to 30% in 2025 — the first increase in five years — despite record investment in AI tooling. The reason is not that AI failed. The reason is that teams deployed AI agents without the same rigor they use for human on-call: no runbooks, no escalation paths, no blast-radius constraints. The agent could reason about logs, but nobody told it what it was allowed to do.

The gap between "AI that can diagnose" and "AI that can safely mitigate" is not a model capability problem. It is a systems engineering problem. And solving it requires the same discipline that SRE teams already apply to human operators: structured runbooks, tiered permissions, and mandatory escalation points.