Picture this: You wake up Monday morning to a Slack notification.
InfraAgent: Detected Terraform drift in production VPC configuration. Root cause: Manual change applied Friday evening. Drift remediated, state reconciled, changes reapplied. Production nominal. Details: [link]
Your first reaction: “Wow, that’s amazing! The agent fixed it before I even knew there was a problem!”
Your second reaction: “Wait… what exactly did it change? And who approved that?”
Welcome to 2026: Agents That Fix Your Infrastructure
According to the latest Platform Engineering predictions, AI as primary code reviewer is becoming standard practice, and agents aren’t just reviewing - they’re remediating.
The value proposition is compelling:
Faster incident response (minutes vs hours)
Consistent standards enforcement
24/7 operations without on-call fatigue
Automatic drift detection and correction
Learning from past incidents to prevent future ones
But as someone who’s spent years building reliable infrastructure, I have… questions.
The Scenario That Keeps Me Up
Let’s make this concrete. Your infrastructure agent has permissions to:
- Read Terraform state
- Detect configuration drift
- Analyze changes and assess impact
- Generate remediation plans
- Apply changes to reconcile state
One night, it detects that someone manually modified a security group rule in production (they were debugging an urgent issue and forgot to update Terraform).
The agent:
- Sees the drift
- Determines the manual change violates policy
- Generates a plan to revert to Terraform state
- Applies the change
- Reports success
What the agent doesn’t know: That “violation” was actually a critical hotfix that prevented a security incident. By reverting it, the agent just re-opened a vulnerability.
Benefits vs Risks
The Dream Scenario:
- Consistent infrastructure that matches your IaC definitions
- No more “well, production is slightly different because…”
- Reduced toil for ops teams
- Faster recovery from misconfigurations
The Nightmare Scenario:
- Cascading failures from agent decisions
- Loss of operational knowledge (“the agent handles it”)
- Agents trained on public repos applying insecure patterns
- Outages caused by “technically correct” but contextually wrong remediations
The Core Question: Write Access to Production
I’m genuinely torn on this.
Argument FOR agent autonomy:
- Speed matters in incident response
- Humans make mistakes too (we’ve all fat-fingered a kubectl command)
- Bounded autonomy with good guardrails can work
- Agents don’t get tired or stressed at 3am
Argument AGAINST agent autonomy:
- Infrastructure changes have blast radius across all services
- Context matters in ways that are hard to encode in agent constraints
- When things go wrong, rollback complexity multiplies
- Accountability becomes unclear
Where I’m landing: Agents should have read-analyze-propose capabilities, but humans must approve infrastructure changes that touch production.
But even that has nuance. What’s “production”? If the agent is managing non-critical dev environments, maybe full autonomy is fine?
The Bounded Autonomy Pattern
Drawing from the earlier discussion about agentic workflows, here’s what I’m thinking:
Tier 1: Full Autonomy (Agent decides and acts)
- Dev environment configuration
- Test infrastructure provisioning
- Non-critical service scaling within defined bounds
- Documentation updates and drift reports
Tier 2: Semi-Autonomy (Agent proposes, human quick-approves)
- Staging environment changes
- Non-critical production changes (monitoring configs, tags)
- Security group rule modifications
- IAM policy updates
Tier 3: Advisory Only (Agent suggests, human implements)
- Core production infrastructure (VPCs, databases, load balancers)
- Security-critical configurations
- Architecture changes
- Disaster recovery systems
Does this make sense? Or am I just making it too complex?
The Knowledge Loss Problem
Here’s what worries me beyond the immediate risks:
If agents handle infrastructure operations, do we lose deep operational knowledge?
I learned Kubernetes by breaking it repeatedly. I understand AWS networking because I’ve debugged bizarre routing issues at 2am. I know how our deployment pipeline works because I’ve fixed it when it broke.
If agents handle all that… what do I learn? What does the next generation of infrastructure engineers learn?
Do we become “agent operators” who can define constraints and review decisions, but can’t actually troubleshoot the underlying systems?
Maybe that’s fine. Maybe that’s the future. But it makes me uncomfortable.
What I Want from the Community
I’m not looking for “agents are good” or “agents are bad” takes. I want to hear from people who are actually doing this:
1. Where have you drawn the line on agent autonomy for infrastructure?
- What can agents do automatically?
- What requires human approval?
- Have you regretted giving agents too much access?
2. How do you handle the “technically correct but contextually wrong” problem?
- How do agents learn organizational context beyond rules?
- Have you had incidents where agents made things worse?
3. What’s your rollback strategy when agents make mistakes?
- Can you easily undo agent-applied changes?
- How do you prevent cascading failures?
4. How are you maintaining operational knowledge on your team?
- Are junior engineers learning infrastructure, or just learning to operate agents?
- Do you have “manual mode” training?
I’m genuinely trying to figure out the right path here. The technology is compelling, but the risks are real.
What are you seeing in the wild?