Self-healing clusters are the holy grail of platform engineering. The idea is simple: let AI detect issues and automatically remediate them without human intervention. No more 3am pages. No more runbook execution. Just… automation.
I’ve been piloting AI-driven operations for six months. Here’s what I’ve learned.
The Promise
Modern SRE is drowning in complexity:
- More microservices than humans can track
- More configuration parameters than anyone understands
- More failure modes than runbooks can cover
AI promises to help:
- Anomaly detection that humans can’t do at scale
- Pattern matching across historical incidents
- Automated remediation of known issues
The Reality
What works:
-
Scaling decisions - AI is actually good at deciding when to scale up or down based on traffic patterns. Better than humans, honestly.
-
Known-issue remediation - If you’ve seen an issue before and documented the fix, AI can apply it automatically.
-
Noise reduction - AI-powered alert correlation significantly reduces alert fatigue. This is a real win.
What doesn’t work:
-
Novel failure modes - AI is trained on past incidents. When something genuinely new happens, AI is as confused as a junior engineer.
-
Business context - AI doesn’t know that today is Black Friday and we shouldn’t be auto-scaling down. It doesn’t know that the CEO is doing a demo right now and we need extra headroom.
-
Cascading failures - When AI “fixes” one thing and breaks another, which triggers more “fixes,” you get a feedback loop that makes things worse.
The Incident That Changed My Mind
Three months ago, our AI detected elevated latency in a service. It correctly identified that pods were memory-constrained and automatically increased memory limits. Fine so far.
But the memory pressure was actually a symptom of a database connection leak. By giving pods more memory, the AI allowed them to leak connections longer before OOMing. This exhausted our connection pool, which cascaded to other services.
The AI kept “remediating” by scaling up more pods (each leaking connections) until we hit our quota limits across three services.
Total outage: 47 minutes. Time for humans to understand what happened: 3 hours.
The AI did exactly what it was trained to do. And it made things worse.
Where I’ve Landed
AI as assistant, not autonomous agent:
- AI suggests remediations; humans approve them
- AI handles simple, well-understood issues automatically
- Novel patterns always escalate to humans
- There’s always a “stop all automation” button
Clear boundaries:
| Scenario | AI Response |
|---|---|
| Known issue, standard fix | Auto-remediate |
| Known issue, unusual time | Suggest + require approval |
| Unknown pattern | Alert only, no action |
| Cascading failure detected | Stop all automation, page humans |
Questions for SRE Practitioners
- How much automation do you trust AI with?
- What guardrails have you implemented?
- Has anyone experienced an “AI-made-it-worse” incident?