Skip to main content

AI in the SRE Loop: What Works, What Breaks, and Where to Draw the Line

· 12 min read
Tian Pan
Software Engineer

Most production incidents don't fail because of missing tools. They fail because the person holding the pager doesn't have enough context fast enough. An engineer wakes up at 3 AM to a wall of firing alerts, spends the first 20 minutes piecing together what actually broke, another 20 minutes deciding which runbook applies, and by the time they're executing the fix, the incident has been open for nearly an hour. The raw fix might take 5 minutes.

AI can compress that context-gathering window from 40 minutes to under 2. That's the genuine value on the table. But "LLM helps your oncall" is not one product decision — it's a stack of decisions, each with its own failure mode, and some of those failure modes have consequences that a customer service chatbot hallucination doesn't.

What AI Actually Does Well in Incident Response

There are three places where LLM-based tooling has shown measurable improvements, and they're worth treating separately because they have very different risk profiles.

Alert correlation is the clearest win. Modern production systems generate enormous alert noise. A single cascading failure can fire dozens of distinct alerts across services, and the relationship between them isn't obvious from the alert text alone. AI-powered correlation engines analyze topology (service A depends on service B), temporal patterns (service B's CPU spike came 30 seconds before service A's error rate climbed), and historical incident patterns to group related alerts into a single incident case. What arrives as 40 firing Slack notifications becomes "service A is degraded; probable cause: resource exhaustion in service B." The triage time savings are real — industry benchmarks consistently show 40-73% reductions in time-to-detection for correlated failures.

Automated post-mortem generation is where the human effort savings are most concrete. Post-mortems have always suffered from a timing problem: the best moment to write them is immediately after an incident when context is fresh, but that's also when teams are exhausted and context-switching back to normal work. AI tools that pull from Slack thread exports, alert timelines, deployment events, and runbook execution logs can generate a structured first draft in minutes. Engineers have reported cutting post-mortem time from 90 minutes of writing from scratch to 15 minutes of review and refinement. The consistency benefit matters too — every post-mortem follows the same structure, which makes cross-incident pattern analysis tractable rather than an exercise in wading through different engineers' writing styles.

Runbook execution assistance is the most powerful capability, and also where the risk profile changes meaningfully. Moving from "here is the runbook" to "I will execute diagnostic steps and surface findings" reduces the cognitive load on an oncall engineer who may not have intimate familiarity with every service they cover. Agents that can pull relevant logs, run read-only diagnostic commands, check recent deployments, and synthesize findings give a responder a significant head start. According to production deployments at scale, 89% of agent-proposed remediations are accepted — which is either a testament to accuracy or a sign that engineers are rubber-stamping AI suggestions under pressure. That distinction matters.

The Failure Mode That Will End Your Adoption Program

There's a failure pattern that doesn't appear in benchmark reports because it's hard to measure: the confident wrong diagnosis that sends your oncall down the wrong path for 45 minutes during an active incident.

LLMs don't have an internal uncertainty signal. When a model generates an explanation for why service A is degraded, it constructs that explanation the same way regardless of whether the evidence is overwhelming or ambiguous — the token generation process doesn't have a "I'm not sure about this" mode that produces differently shaped text. The model sounds authoritative when it's right and equally authoritative when it's pattern-matching on superficially similar historical incidents that don't actually apply.

This is a different problem than the generic "AI hallucinates" concern. In low-stakes contexts, a confidently wrong answer is annoying and correctable. In a P1 incident at 3 AM with five engineers in a war room, a confident wrong diagnosis from your AI tool can anchor the whole team to the wrong root cause. The engineer who was thinking "wait, maybe it's the database connection pool" suppresses that instinct because the AI said it's a memory leak in service B. Forty-five minutes later, when it's clearly not a memory leak, you've lost time you can't get back.

The mechanisms that create this failure are structural, not bugs to be patched. Models are trained on historical incident data, which means they are calibrated to incidents that have already happened. Novel failure modes — new infrastructure patterns, unusual load shapes, unexpected third-party dependency failures — are precisely the scenarios where historical pattern matching is least reliable, and also the scenarios where you most need accurate diagnosis.

Where Autonomous Action Becomes a Problem

The runbook automation story gets more complicated when agents move from read-only diagnostics to executing remediations. The industry is sorting itself into two camps on this, and neither camp is entirely right.

The case for autonomous remediation is compelling on its face. If an agent can diagnose with 89% accuracy and execute the fix in 90 seconds while a human is still context-switching into the incident, the speed advantage is real. And some remediation actions — restarting a known-flaky sidecar, rolling back a deployment within a narrow time window — are low enough risk that autonomous execution is defensible.

The problem is that the 11% of wrong diagnoses aren't uniformly distributed. They cluster around novel situations, ambiguous signals, and complex multi-service failures — which are also the situations where the remediation action is most likely to be irreversible or to make the situation worse. Scaling a service that has nothing to do with the actual bottleneck wastes money and time. Restarting the wrong pod during a database migration can cause data loss. These aren't hypothetical; they're the actual failure modes of systems that execute first and verify later.

Microsoft's Azure SRE Agent, deployed internally at scale (over 1,300 agents, 35,000+ incidents mitigated), addresses this with a bounded autonomy model: agents can execute diagnostic actions autonomously, but any remediation action that changes production state requires human approval before execution. This creates a two-tier system where the agent handles the tedious work of investigation and recommendation, and a human makes the final call on any write operation. The cognitive load reduction is substantial; the risk surface for catastrophic autonomous action is contained.

The key insight is that the value isn't in replacing human judgment — it's in replacing the work that doesn't require human judgment. Pulling logs, searching for similar incidents, checking deployment history, generating a timeline: none of that requires a human expert. Deciding whether to restart a production service at 3 AM does.

Designing the Loop

If you're building or deploying AI-assisted incident response, the architectural decisions that determine whether it works come down to a few concrete patterns.

Separate read-only and write actions explicitly. Your agent architecture should have a hard boundary between actions that observe (query logs, check metrics, search deployment history) and actions that mutate (restart services, scale resources, modify configuration). Autonomous operation is defensible in the read tier. The write tier needs human approval gates.

Surface confidence signals, not just conclusions. A well-designed AI diagnostic output shouldn't just say "probable cause: memory leak in service B." It should say "probable cause: memory leak in service B (high confidence: 3 similar incidents in last 90 days with matching pattern) — alternate hypothesis: database connection pool exhaustion (low confidence: no corroborating metrics)." The alternate hypothesis is often what saves you when the primary diagnosis is wrong. Engineers need enough context to push back on the AI's conclusion, not a black-box verdict.

Build in the feedback loop from the start. Every time an engineer accepts, modifies, or rejects an AI diagnosis, that signal is training data. Organizations that instrument this from day one can measure their AI tool's actual accuracy on their actual workload — which is the only accuracy that matters. Abstract benchmark performance on synthetic incident data tells you almost nothing about how the model will perform on your specific failure patterns.

Use historical incidents as ground truth, not just training data. Before deploying AI-assisted correlation and diagnosis in production, run it against your last six months of incidents as a validation exercise. If the model would have misdiagnosed three of your last ten P1 incidents, you need to know that before it's handling real ones.

The Post-Mortem Generation Win You're Probably Leaving on the Table

One place where the risk calculus is almost entirely favorable and adoption remains surprisingly low: AI-generated post-mortems.

The case against AI post-mortems is usually "they'll miss important nuance" or "engineers won't trust them." The first concern is real but bounded — AI post-mortem tools that ingest structured data (alert timelines, runbook execution logs, Slack threads) produce accurate timelines and contributing factors at a level of consistency that humans rarely match on incident N+7 at 2 AM. The second concern tends to dissolve once engineers actually use the output: they're editing a structured draft, not validating a black-box decision.

The operational gain isn't just time. Consistent post-mortem structure makes pattern analysis tractable. When every post-mortem uses the same contributing factors taxonomy, you can actually answer "how many of our P1 incidents in the last quarter were caused by deployment-related issues?" without a manual audit of 40 different writing styles. That's infrastructure for organizational learning, which is the actual point of post-mortems.

The integration path is also relatively simple: export the Slack thread from your incident channel, attach the PagerDuty timeline, and pipe it through a structured generation prompt with your post-mortem template. The failure modes are low-stakes — a wrong contributing factor in the draft gets caught in review. This is the opposite of autonomous remediation, and that's exactly why you should do it.

What This Looks Like in Practice

The organizations that are getting the most value from AI-assisted incident response share a few patterns.

They didn't automate their existing process — they redesigned around what AI is good at. If your current incident workflow has oncall start by manually searching Slack for related incidents, AI correlation replaces that step. If oncall starts by pulling 5 different dashboards to build a mental model of system state, AI synthesis replaces that step. Organizations that try to bolt AI onto an existing process and see if it helps get marginal improvements. Organizations that rebuild the first 20 minutes of incident response around AI capabilities see the 40-60% MTTR improvements the benchmarks report.

They measure the right things. MTTR improvement is a real metric, but it's a lagging indicator. The leading indicators that actually let you tune your AI deployment are: false positive rate on AI diagnoses (how often does the AI confidently diagnose the wrong root cause?), diagnostic latency contribution (how much time does the AI investigation step add vs. save?), and remediation acceptance rate with accuracy tracking (not just did the engineer accept the suggestion, but was the suggestion correct?). Most teams measure the first and ignore the latter two.

They treat the AI as a junior responder, not an oracle. The most effective deployment model in high-performing teams isn't "AI decides, human rubber-stamps." It's "AI investigates and surfaces context, human uses that context to make faster decisions." The AI's role is to shrink the search space, not to close the loop on every incident.

The Genuine Hard Problem

There's a version of this where everything works well: AI handles the investigation gruntwork, surfaces a clear diagnosis with supporting evidence, a human confirms and authorizes the fix, and your MTTR drops from 45 minutes to 12. That version exists in production today at organizations that have invested in the instrumentation and architecture to make it work.

The version that doesn't work well looks like this: an agent confidently executes a remediation based on a pattern match from a superficially similar incident two months ago, the pattern doesn't actually apply, and by the time a human is looking at the situation the agent has made it worse. Or an oncall engineer, overwhelmed by a complex multi-service failure, trusts the AI diagnosis because it's confident and they don't have bandwidth to second-guess it, and spends 45 minutes executing the wrong runbook.

Both versions are real. The difference between them isn't which AI model you're using — it's whether you've designed your system to treat the AI's output as evidence to inform a decision or as a verdict to execute. That's an architectural decision, not a model quality decision. And making it correctly requires being honest about the specific ways AI-augmented incident response fails, not just optimizing for the scenarios where it works.

The on-call engineer is still the one who understands what "this should not be happening" means in the context of your specific system. AI can get them to that moment faster. It shouldn't be the one deciding what it means.

References:Let's stay in touch and Follow me for more thoughts and updates