AI-Assisted Incident Response: Giving Your On-Call Agent a Runbook
Operational toil in engineering organizations rose to 30% in 2025 — the first increase in five years — despite record investment in AI tooling. The reason is not that AI failed. The reason is that teams deployed AI agents without the same rigor they use for human on-call: no runbooks, no escalation paths, no blast-radius constraints. The agent could reason about logs, but nobody told it what it was allowed to do.
The gap between "AI that can diagnose" and "AI that can safely mitigate" is not a model capability problem. It is a systems engineering problem. And solving it requires the same discipline that SRE teams already apply to human operators: structured runbooks, tiered permissions, and mandatory escalation points.
The Messy Middle: Why AI Added Work Instead of Removing It
The promise was straightforward: AI agents would triage alerts, correlate logs, and execute remediation — reducing mean time to resolution and freeing engineers from 3 AM pages. In practice, 69% of AI-powered decisions still require human verification. Teams added an AI layer but did not remove the manual layer, creating what practitioners call the "messy middle."
This happened because most teams treated AI incident response as a classification problem: feed the agent alerts, let it figure out what to do. But incident response is not classification. It is a sequence of constrained actions under time pressure, where every action has a blast radius and every minute of downtime costs money.
Customer-impacting incidents are up 43% year-over-year, with each costing roughly $800K. An agent that spends tokens reasoning about what might be wrong while a database melts is worse than useless — it is a distraction.
The fix is not "better models." The fix is giving the agent the same thing you give a new on-call engineer: a runbook.
What a Runbook for an Agent Actually Looks Like
Human runbooks are semi-structured documents: "If you see alert X, check dashboard Y, then run command Z." They assume a human can exercise judgment at each step. Agent runbooks need to be more explicit because the agent has no institutional memory and no intuition about what "looks weird."
An effective agent runbook is a directed graph of actions with three properties:
- Typed inputs and outputs: Each step specifies exactly what data it consumes (a metric query, a log pattern, a service name) and what it produces. No free-text "investigate the situation" steps. Structured outputs turn an LLM call into something closer to a typed function call than a text-generation task, which dramatically reduces hallucination risk.
- Explicit decision points: Branch conditions are concrete thresholds, not vibes. "If error rate exceeds 5% for 3 minutes" is a decision point. "If things look bad" is not.
- Scoped permissions per step: Each action declares what infrastructure it touches and what level of approval it needs. A log query runs automatically. A pod restart needs notification. A database failover needs explicit human approval.
The difference between a useful agent and a dangerous one is not intelligence — it is constraint.
The Three-Tier Autonomy Model
The most effective teams are not choosing between "fully autonomous agent" and "AI that just writes summaries." They are using a graduated autonomy model that maps directly to risk:
Level 1 — Advisory only. The agent ingests alerts, gathers context (recent deploys, log anomalies, metric trends), and posts a diagnosis to the incident channel with recommended actions. Humans execute everything. This is where every team should start. It validates the agent's reasoning quality without risking production.
Level 2 — Execution with approval. The agent proposes specific actions — restart a service, scale up replicas, roll back a deployment — and posts them as approval cards in Slack or your incident tool. An on-call engineer clicks "approve" or "reject." The agent executes approved actions and monitors the result. If the remediation does not improve metrics within a defined window, it automatically rolls back and escalates.
Level 3 — Conditional autonomy. For a narrow set of pre-approved, low-risk runbooks, the agent acts independently. Restarting a stateless service that has crash-looped three times in five minutes does not need a human in the loop at 3 AM. But the scope of Level 3 should be deliberately small and expand only when safety metrics prove consistent, low-risk behavior over time.
The critical insight is that these levels are not properties of the agent. They are properties of the action. The same agent might operate at Level 3 for cache invalidation, Level 2 for service restarts, and Level 1 for anything touching a database. Risk tiering is per-action, not per-agent.
The Guardrail Architecture
Guardrails are not optional safety theater — they are the load-bearing structure that makes agent-driven incident response viable. The pattern that works in production has four layers:
Identity and access control. Every agent gets its own service account with role-based access, scoped to specific tools and infrastructure. The agent never accesses infrastructure directly — it interacts through secure tool APIs. Credentials are short-lived and rotated automatically. This is the same principle behind least-privilege access for human operators, applied to an automated system.
Blast-radius checks. Before executing any remediation, the agent evaluates the scope of impact. Restarting one pod in a 50-pod deployment is different from restarting the only pod. The agent must understand topology — which services depend on this one, what percentage of traffic it handles, whether there is redundancy. Without topology awareness, even a "safe" restart can cascade into an outage.
Circuit breakers on agent behavior. If the agent is taking the same action repeatedly, if it is generating anomalous patterns of API calls, or if its confidence score drops below a threshold, a circuit breaker trips and escalates to a human. This catches the failure mode where an agent enters a remediation loop — restarting a service that immediately crashes again, over and over, while ignoring the root cause.
Mandatory audit trail. Every decision, every piece of evidence the agent considered, every action it took or proposed — all of it gets logged in a format that a post-incident review can consume. This is not just compliance. It is how you debug the agent itself. When an agent makes a bad call, you need the full execution record for the session, because agent incidents often have diffuse causes distributed across multiple steps rather than a single localized failure.
Why "Investigate This Alert" Is a Dangerous Prompt
The single most common mistake in deploying AI for incident response is giving the agent an unstructured directive: "Here is an alert, figure out what is wrong and fix it." This is the equivalent of handing a new hire root access and saying "the site is down, good luck."
Unstructured prompts create three specific failure modes:
- Hallucinated diagnoses. Without a constrained investigation path, the agent will generate plausible-sounding but fabricated root causes. It might confidently report a memory leak when the actual problem is a misconfigured load balancer. Larger context windows paradoxically make this worse — more data means more opportunities to find spurious correlations.
- Unsafe actions. An agent reasoning freely about "what to try" might decide that dropping a database cache, killing a process, or modifying a security group is a reasonable next step. Without explicit permission boundaries, the agent's action space is unbounded.
- Time waste during a crisis. An agent with no runbook spends tokens on exploration — querying metrics that are irrelevant, checking services that are unrelated, reasoning about possibilities instead of following a known diagnostic path. Every minute it spends exploring is a minute of downtime.
The solution is to map your most frequent incident types to structured runbooks before the agent encounters them. Audit your last 50 incidents. Identify the top 10 patterns. For each pattern, define the diagnostic steps, the decision criteria, and the permitted remediation actions. Then give the agent those runbooks, not open-ended authority.
Measuring Whether Your Agent Is Helping
The obvious metric is MTTR reduction, and teams that deploy structured agent runbooks are seeing real gains — 30% to 70% reduction in resolution time for incidents that match pre-defined patterns. But MTTR alone misses the picture.
Track these alongside it:
- Diagnostic accuracy: What percentage of the agent's initial diagnoses match the actual root cause identified in post-incident review? If this is below 70%, the agent is adding noise, not signal.
- Suggestion acceptance rate: When the agent proposes an action at Level 2, how often does the on-call engineer approve it? Low acceptance rates mean the agent's runbooks do not match how your team actually operates.
- Autonomous resolution rate: What percentage of incidents does the agent resolve at Level 3 without human intervention? The first-year target for most teams is 10–30%. Higher than that too early likely means the risk classification is too loose.
- On-call satisfaction: If engineers report that the agent creates more work (verifying bad suggestions, cleaning up bad actions) than it saves, the deployment is a net negative regardless of what the MTTR numbers say.
The feedback loop matters as much as the metrics. Human corrections — cases where an engineer overrides the agent — should feed back into the system with reasoning logged, becoming test cases for future agent versions.
Start Strict, Then Expand
The organizations getting real value from AI in incident response share one pattern: they started with narrow scope and high constraint, then expanded deliberately. They did not deploy an omniscient agent and hope for the best. They picked one service, one incident type, one runbook. They ran the agent in advisory mode for weeks, compared its diagnoses to human diagnoses, and tuned the runbook until the acceptance rate was consistently above 80%.
Only then did they move to Level 2 for that specific runbook. And only after months of clean execution did they consider Level 3 for the lowest-risk actions within that scope.
This is slow. It is also the only approach that works. The alternative — deploying broad AI automation and cleaning up the mess — is how you end up in the messy middle, where 78% of your engineers spend 30% of their time on manual toil despite having an "AI-powered" incident management system.
Give your agent a runbook. Start it in read-only mode. Promote it when it earns trust. The tooling is ready. The discipline is what most teams are missing.
- https://runframe.io/blog/state-of-incident-management-2025
- https://www.ilert.com/agentic-incident-management-guide
- https://neubird.ai/blog/ai-sre-hallucination-guardrails/
- https://incident.io/blog/5-best-ai-powered-incident-management-platforms-2026
- https://rootly.com/sre/2025-devops-trend-ai-incident-automation-cuts-mttr-40
- https://skyflo.ai/blog/human-in-the-loop-safety-for-ai-ops
