The Agent Runbook Your Incident Commander Could Not Execute
The page fires at 02:17 local time. The on-call SRE pulls up the agent runbook on their phone and reads step one: "check the agent's tool-call traces for anomalous tool usage." They open the link. They hit an SSO prompt for a workspace they do not belong to. Step two says inspect the prompt-construction logs; same wall. Step three says roll back to the previous prompt version, but the deploy permission is scoped to a team they are not on. By the time they figure out which Slack channel to escalate to and wake up the AI team's product manager because she is the only person they can find at 02:17, ninety minutes have passed and the customer-visible regression is still serving wrong answers.
The post-mortem will identify the access gap as the proximate cause. The deeper discomfort is that the runbook reads fine in daylight and runs blocked at night, because the person who wrote it has access the person who executes it does not.
This is the failure mode that quietly waits inside almost every agent product that survived its first quarter in production. The AI team built the agent, the observability for the agent, and the deploy pipeline for the agent. They wrote the runbook against the workflow they use to debug it. The runbook is technically correct. It is operationally undeliverable to the person who actually executes it.
The Author Persona Is Not the Reader Persona
Every runbook has two personas, and most agent runbooks confuse them. The author persona is the engineer who built the system, knows where the traces live, has credentials for every backing service, and can describe the failure modes in the vocabulary of the codebase. The reader persona is whoever is paged at 02:17. In most organizations these are different people, and in organizations with a dedicated AI platform team they are reliably different people on reliably different on-call rotations with reliably different access.
Conventional service runbooks survived this gap because the service team and the SRE rotation had been negotiating it for years. There was an unspoken contract: anything in the runbook had to be executable from the access profile of the central oncall. Dashboards rendered in Grafana, not in a team-specific tool. Logs went to the central log store, not a private S3 bucket. Deploys went through the shared deploy console. When a service team forgot this contract, the SRE team noticed during the first drill, sent a stern message, and the runbook got rewritten.
Agent runbooks broke the contract because the AI platform team typically did not exist when the contract was negotiated. They were stood up fast, they own their own observability stack for velocity or cost reasons, and they have their own deploy pipeline because prompts are not code and code review does not catch prompt regressions. None of that is wrong. What is wrong is that the runbook they ship to oncall reads like the readme for their own debugging workflow, with no acknowledgement that the person executing it does not have their tools.
Federation Is the Word You Are Avoiding
The cheap fix everyone tries first is to add the SRE rotation to the AI platform's tools. Grant them SSO into the prompt observability dashboard. Add them to the deploy group. Issue them credentials for the trace store. This works for one rotation, fails the next time someone joins or leaves, and creates an access surface the security team is going to ask hard questions about during the next audit. It is not federation. It is access sprawl with extra steps.
The right move is to push the AI platform's telemetry up into the observability surface oncall already uses. Pick a vendor-neutral instrumentation standard, OpenTelemetry being the obvious one, and emit agent traces, prompt construction logs, and tool-call decisions through it. Federate the resulting data into the central observability stack. The IC opens the same Grafana board they would for any service, sees the agent's behavior alongside everything else, and does not need a separate set of credentials to see it.
This is more work than handing out logins, which is exactly why teams default to handing out logins. The work pays back the first time someone joins the SRE rotation and the AI team does not get a JIRA ticket about it. It pays back the second time during an incident that touches three services and the IC does not have to switch tools between them. The federation effort is one of the few infrastructure investments where the payoff is invisible until you suddenly need it.
Runbook Authoring as a Permission Contract
Once federation exists, the runbook itself needs a discipline most teams have never imposed: each step must declare what access it requires, and a pre-merge check has to verify that the on-call rotation actually has that access.
This sounds bureaucratic until you have shipped one. A runbook step that reads "roll back the prompt to the previous version" is actually a permission contract: it asserts that the reader holds a deploy-rollback scope on the prompt registry. Make that assertion explicit. Tag the step with the scope it requires. At merge time, validate the tag against the membership of the on-call rotation. If the rotation does not hold the scope, the runbook does not merge until either the scope is granted, the step is rewritten to use a break-glass mechanism, or a different rotation is named as the responsible party.
The discipline is the same one we apply to typed function signatures. The runbook step is a function call against the IC's permission set, and an undeclared scope is the runbook equivalent of an untyped argument. It compiles, it looks fine in review, it blows up at runtime when the inputs do not match.
The check itself is not exotic. Most identity providers expose group membership through an API, most deploy systems publish their scope catalog, and the on-call rotation is a list in your paging tool. Wire those three together, add a CI step that fails the runbook PR when the asserted scopes are not held by the rotation, and the failure mode shifts from a 02:17 wall of authentication prompts to a Tuesday-afternoon code review comment.
The Break-Glass Path the AI Team Owes Oncall
Some steps cannot be wrapped in a permission contract because the answer to "should oncall have this scope?" is no. Deploying a new prompt version requires review by people who understand the prompt. Rotating a tool's API key may need coordination with a downstream team. Granting oncall those scopes permanently is the wrong answer.
What you owe them is a break-glass mechanism scoped to the actions an IC will actually need to take during an incident, audited heavily after the fact. A rollback-only deploy endpoint is the canonical example. It accepts one input, the previous version's identifier, and emits a single artifact: a reverted prompt. It cannot deploy a new prompt, edit an existing one, or change tool wiring. The IC can invoke it without being on the deploy team, every invocation pages the deploy team after the fact for review, and the access surface stays small because the endpoint can only do one thing.
The break-glass pattern is well understood in cloud operations and well understood for AI agent rollback specifically; the failure mode is that teams treat it as an enterprise-grade feature to build later. It is an incident-survival feature to build before the first incident. The unit of rollback for an agent is not just a model version: it is a prompt package, tool contracts, policy layer, memory plane, and runtime permissions all together. The break-glass endpoint should restore a known-good bundle of those, not just flip a model pointer. Restoring half the bundle leaves the agent in a configuration no one tested.
Drills Are Not Optional When the Reader Is Not the Author
Even with federation, declared scopes, and a break-glass endpoint, the runbook will rot. Permissions change. Tools get renamed. The prompt registry adds a step. The only way to keep the runbook executable is to actually execute it, with the IC, end to end, against a synthetic incident, on a cadence.
This is where the AI team's instinct fights them again. They will offer to "test" the runbook themselves. That is not a drill of the runbook, that is a drill of the AI team. The drill that matters is the one where the SRE who would actually be paged at 02:17 opens the runbook cold and walks it. Every step that returns an authentication prompt, every link that goes to a dashboard they cannot read, every tool name that has changed since the runbook was written, every assumption the author made about familiarity with the system, surfaces in that drill. Surface it on a Tuesday afternoon, not on a Saturday night when revenue is bleeding.
Mature service teams already do this; the cultural lift for AI platform teams is to accept that their system is now a multi-team operational liability and to staff the drill cadence accordingly. A reasonable starting cadence is quarterly. Quarterly drills catch most rot without becoming a burden. Once an actual incident reveals a runbook gap, that runbook moves to a monthly drill until two consecutive runs are clean.
The Architectural Reality Underneath All Of This
The realization the org has to internalize is uncomfortable for the AI platform team and obvious to anyone who has run an SRE function. An agent in production is a multi-team operational liability. The team that built it owns its design. The team that operates it owns its runbook. The team that gets paged for it owns its execution. These three teams are different teams, and the agent platform did not exist long enough for them to negotiate the interface.
Until they negotiate it, every runbook the AI team writes is a document that reads correct and runs blocked. Federation closes the observability gap. Declared scopes close the permission gap. Break-glass endpoints close the deploy gap. Drills close the rot gap. None of these is hard infrastructure work. All of them require admitting that the runbook author and the runbook reader are different people, with different access, on different rotations, awake at different times, and that the document only counts as written when the reader can actually run it.
The takeaway for any team running an agent in production: open your runbook tonight, hand it to the next SRE on rotation, and ask them to read it end to end without asking you any questions. Whatever they cannot do, that is your roadmap.
- https://www.augmentcode.com/guides/ai-sre-incident-management
- https://aws.amazon.com/blogs/devops/leverage-agentic-ai-for-autonomous-incident-response-with-aws-devops-agent/
- https://www.agentsre.ai/
- https://learn.microsoft.com/en-us/azure/sre-agent/overview
- https://devops.com/the-end-of-alert-fatigue-how-ai-powered-observability-is-transforming-sre-teams-in-2026/
- https://techcommunity.microsoft.com/blog/linuxandopensourceblog/applying-site-reliability-engineering-to-autonomous-ai-agents/4521357
- https://metoro.io/blog/top-ai-incident-response-tools
- https://dev.to/waxell/when-your-ai-agent-has-an-incident-your-runbook-isnt-ready-1ag6
- https://dl.acm.org/doi/pdf/10.1145/3689051.3689056
- https://www.ilert.com/agentic-incident-management-guide
- https://www.tryparity.com/blog/how-meta-uses-llms-to-improve-incident-response
- https://docs.aws.amazon.com/incident-manager/latest/userguide/runbooks.html
- https://rootly.com/blog/building-trust-with-ai-agents-in-site-reliability-engineering
- https://hoop.dev/blog/how-to-keep-ai-runbook-automation-ai-in-devops-secure-and-compliant-with-access-guardrails/
- https://incident.io/blog/runbook-automation-tools-2026-the-complete-guide
- https://docs.aws.amazon.com/wellarchitected/latest/devops-guidance/ag.sad.5-implement-break-glass-procedures.html
- https://techcommunity.microsoft.com/blog/microsoft-security-blog/authorization-and-governance-for-ai-agents-runtime-authorization-beyond-identity/4509161
- https://suhasbhairav.com/blog/managing-versioning-rollback-strategies-for-agent-system-prompts
- https://medium.com/@bhagyarana80/agent-rollback-drills-9-runbooks-for-real-chaos-8a5cf6aeba31
- https://drdroid.io/guides/runbooks-guide-for-sre-on-call-teams
- https://www.fiddler.ai/blog/opentelemetry-ai-observability-guide
- https://predictionguard.com/blog/agentic-ai-monitoring-observability-metrics
