We just finished an internal review of our DevOps automation strategy, and I want to share what we found—because the gap between the promise and the reality is significant.
The Promise
Gartner predicts 40% of enterprise apps will feature task-specific AI agents by 2026, up from under 5% in 2025. Our own vendor pitches are full of slides showing agents handling incident triage, deployment rollbacks, capacity planning, and security scanning—all autonomously. The narrative is compelling: free your SREs from toil, let agents handle the repetitive 80%.
We piloted three agent-based workflows in Q1: automated PR review with security scanning, incident classification and initial response, and deployment health monitoring with auto-rollback triggers.
The Reality We Hit
Here’s what nobody told us during the sales cycle:
Coordination complexity grows exponentially, not linearly. Two agents require one connection path. Five agents need ten potential interaction paths. When our PR review agent flagged a dependency issue and handed off to our security scanning agent, which then needed to coordinate with our deployment agent—the interaction matrix exploded. Each additional agent multiplied our testing scenarios by roughly 3x.
Reliability compounds negatively in chains. If each agent in a chain is 95% reliable individually, two agents give you 90.25%. Three agents: 85.7%. Five agents working in sequence: 77%. We’re building systems with worse reliability characteristics than the manual processes they replaced, except now the failure modes are opaque.
Traditional monitoring is blind to semantic failures. Our agents return HTTP 200 with confidently wrong content. The deployment health agent once declared a service “healthy” because CPU and memory were fine—while it was serving 500 errors to 30% of requests on a specific endpoint. Prometheus didn’t catch it because the agent’s definition of “healthy” didn’t match our SRE team’s definition.
The Distributed Systems Parallel
What struck me is how much this mirrors the early microservices era. In 2015-2016, everyone decomposed their monoliths without understanding distributed systems fundamentals—eventual consistency, partial failure, network partitions. The result was years of painful incidents and a generation of engineers who had to learn distributed systems the hard way.
We’re doing the same thing with agents. Teams are deploying multi-agent systems without understanding:
- Consensus and coordination protocols - How do agents agree on state? What happens during split-brain scenarios?
- Idempotency and retry semantics - When an agent fails mid-task, can you safely retry? Or does the retried agent create a duplicate deployment?
- Observability beyond metrics - Traces, evals, session coherence. Your Datadog setup doesn’t have primitives for “this agent’s reasoning was correct but its action was wrong.”
- Circuit breakers and backpressure - When the incident response agent starts generating false positives at 3 AM, what stops the cascade?
What We Changed
After our review, we adopted three principles:
-
Human-in-the-loop by default, autonomy earned. Every agent starts supervised. It earns autonomous operation through a 30-day evaluation period with specific reliability metrics.
-
Linear chains only, no meshes. We banned agent-to-agent communication. Every handoff goes through a human-reviewed queue. Yes, this is slower. It’s also debuggable.
-
SRE owns agent governance. Not the ML team, not the platform team. Our SREs have the operational muscle memory to understand failure modes, blast radius, and rollback procedures.
40% of multi-agent pilots fail within six months of production deployment. We don’t want to be in that 40%.
The Question for This Community
For those of you deploying AI agents in your DevOps pipelines:
- How are you handling the coordination complexity? Are you using any orchestration frameworks that actually work at scale (not just in demos)?
- Who owns agent governance in your org? Is it the ML/AI team, platform engineering, SRE, or someone else?
- Have you found the right abstraction layer for agent observability? We’re evaluating Arize and Braintrust but the space feels immature.
I’m particularly interested in hearing from anyone who’s moved past the “two agents in a pipeline” stage to actual multi-agent systems in production. What broke first?