AI Agents That Auto-Remediate Your Infrastructure Code: Dream or Nightmare?

Picture this: You wake up Monday morning to a Slack notification.

InfraAgent: Detected Terraform drift in production VPC configuration. Root cause: Manual change applied Friday evening. Drift remediated, state reconciled, changes reapplied. Production nominal. Details: [link]

Your first reaction: “Wow, that’s amazing! The agent fixed it before I even knew there was a problem!”

Your second reaction: “Wait… what exactly did it change? And who approved that?”

Welcome to 2026: Agents That Fix Your Infrastructure

According to the latest Platform Engineering predictions, AI as primary code reviewer is becoming standard practice, and agents aren’t just reviewing - they’re remediating.

The value proposition is compelling:

  • :white_check_mark: Faster incident response (minutes vs hours)
  • :white_check_mark: Consistent standards enforcement
  • :white_check_mark: 24/7 operations without on-call fatigue
  • :white_check_mark: Automatic drift detection and correction
  • :white_check_mark: Learning from past incidents to prevent future ones

But as someone who’s spent years building reliable infrastructure, I have… questions.

The Scenario That Keeps Me Up

Let’s make this concrete. Your infrastructure agent has permissions to:

  • Read Terraform state
  • Detect configuration drift
  • Analyze changes and assess impact
  • Generate remediation plans
  • Apply changes to reconcile state

One night, it detects that someone manually modified a security group rule in production (they were debugging an urgent issue and forgot to update Terraform).

The agent:

  1. Sees the drift
  2. Determines the manual change violates policy
  3. Generates a plan to revert to Terraform state
  4. Applies the change
  5. Reports success

What the agent doesn’t know: That “violation” was actually a critical hotfix that prevented a security incident. By reverting it, the agent just re-opened a vulnerability.

Benefits vs Risks

The Dream Scenario:

  • Consistent infrastructure that matches your IaC definitions
  • No more “well, production is slightly different because…”
  • Reduced toil for ops teams
  • Faster recovery from misconfigurations

The Nightmare Scenario:

  • Cascading failures from agent decisions
  • Loss of operational knowledge (“the agent handles it”)
  • Agents trained on public repos applying insecure patterns
  • Outages caused by “technically correct” but contextually wrong remediations

The Core Question: Write Access to Production

I’m genuinely torn on this.

Argument FOR agent autonomy:

  • Speed matters in incident response
  • Humans make mistakes too (we’ve all fat-fingered a kubectl command)
  • Bounded autonomy with good guardrails can work
  • Agents don’t get tired or stressed at 3am

Argument AGAINST agent autonomy:

  • Infrastructure changes have blast radius across all services
  • Context matters in ways that are hard to encode in agent constraints
  • When things go wrong, rollback complexity multiplies
  • Accountability becomes unclear

Where I’m landing: Agents should have read-analyze-propose capabilities, but humans must approve infrastructure changes that touch production.

But even that has nuance. What’s “production”? If the agent is managing non-critical dev environments, maybe full autonomy is fine?

The Bounded Autonomy Pattern

Drawing from the earlier discussion about agentic workflows, here’s what I’m thinking:

Tier 1: Full Autonomy (Agent decides and acts)

  • Dev environment configuration
  • Test infrastructure provisioning
  • Non-critical service scaling within defined bounds
  • Documentation updates and drift reports

Tier 2: Semi-Autonomy (Agent proposes, human quick-approves)

  • Staging environment changes
  • Non-critical production changes (monitoring configs, tags)
  • Security group rule modifications
  • IAM policy updates

Tier 3: Advisory Only (Agent suggests, human implements)

  • Core production infrastructure (VPCs, databases, load balancers)
  • Security-critical configurations
  • Architecture changes
  • Disaster recovery systems

Does this make sense? Or am I just making it too complex?

The Knowledge Loss Problem

Here’s what worries me beyond the immediate risks:

If agents handle infrastructure operations, do we lose deep operational knowledge?

I learned Kubernetes by breaking it repeatedly. I understand AWS networking because I’ve debugged bizarre routing issues at 2am. I know how our deployment pipeline works because I’ve fixed it when it broke.

If agents handle all that… what do I learn? What does the next generation of infrastructure engineers learn?

Do we become “agent operators” who can define constraints and review decisions, but can’t actually troubleshoot the underlying systems?

Maybe that’s fine. Maybe that’s the future. But it makes me uncomfortable.

What I Want from the Community

I’m not looking for “agents are good” or “agents are bad” takes. I want to hear from people who are actually doing this:

1. Where have you drawn the line on agent autonomy for infrastructure?

  • What can agents do automatically?
  • What requires human approval?
  • Have you regretted giving agents too much access?

2. How do you handle the “technically correct but contextually wrong” problem?

  • How do agents learn organizational context beyond rules?
  • Have you had incidents where agents made things worse?

3. What’s your rollback strategy when agents make mistakes?

  • Can you easily undo agent-applied changes?
  • How do you prevent cascading failures?

4. How are you maintaining operational knowledge on your team?

  • Are junior engineers learning infrastructure, or just learning to operate agents?
  • Do you have “manual mode” training?

I’m genuinely trying to figure out the right path here. The technology is compelling, but the risks are real.

What are you seeing in the wild?

Strong opinion incoming: NEVER give agents direct production write access. Ever.

I mean it. Your nightmare scenario isn’t hypothetical - it’s inevitable.

The Security Model You Need

Here’s the only pattern I’m comfortable with for infrastructure agents:

Read → Analyze → Propose → Human Approves → Agent Applies (with verification)

Note that there are TWO human gates:

  1. Before the change (approval)
  2. After the change (verification that it did what was intended)

Why? Because infrastructure agents have a unique combination of dangerous characteristics:

Wide Attack Surface:

  • Access to secrets and credentials
  • Ability to modify security controls
  • Privilege to change access patterns
  • Authority to deploy code

High Blast Radius:

  • One bad infrastructure change affects all services
  • Cascading failures are common
  • Rollback is complex and often incomplete
  • Debugging is harder than application code

Attractive Attack Target:

  • Compromising one agent could give an attacker infrastructure control
  • Agent credentials are valuable targets
  • Supply chain attacks on agent training data
  • Poisoned context or prompt injection

Real Risks (Not Theoretical)

Let me give you scenarios I’ve actually seen or responded to:

Scenario 1: The Helpful Agent
Agent detects unused IAM role, deletes it to “clean up.” That role was used by a monthly batch job. Job fails at month-end, major financial reporting impact.

Scenario 2: The Overzealous Remediator
Agent sees security group with 0.0.0.0/0 access, locks it down to specific IPs per policy. Those IPs were the CDN that served the website. Site goes down.

Scenario 3: The Poisoned Pattern
Agent trained on public Terraform examples applies a “common pattern” for database access. That pattern is insecure by design (comes from a tutorial, not production code). Database exposed.

Scenario 4: The Cascading Failure
Agent remediates Kubernetes node affinity issue by rescheduling pods. Triggers resource contention. Triggers more rescheduling. Cluster enters death spiral. Human intervention needed but agent keeps “fixing” it.

The Framework I Use

Immutable Infrastructure + Agent Suggestions + Human Gate

Practically:

1. Agents can’t apply changes directly

  • All agent-proposed changes go through PRs
  • Infrastructure repos require human approval
  • Agents document reasoning in PR description
  • Humans verify before merge

2. Agents have read-only production access

  • Can analyze state
  • Can detect drift
  • Can propose remediations
  • Cannot apply changes

3. Exception: Pre-approved safe operations

  • Scaling within defined bounds (e.g., 2-20 instances)
  • Tagging and labeling (metadata only)
  • Monitoring configuration updates
  • All with comprehensive audit logging

4. All agent actions generate security events

  • Logged to immutable audit trail
  • Monitored for anomalies
  • Reviewed in security retrospectives
  • Correlated with other security data

Addressing Your Tiers

Your tier system is close, but I’d adjust:

Tier 1 (Full Autonomy):

  • :white_check_mark: Dev/test environment changes (agree)
  • :warning: But still with rollback capability and audit logging
  • :cross_mark: Not even non-critical production - context matters

Tier 2 (Semi-Autonomy):

  • :cross_mark: No “quick approve” - humans must actually review
  • :cross_mark: IAM policy updates should be Tier 3 (security critical)
  • :white_check_mark: Agent proposes, human reviews carefully, human approves

Tier 3 (Advisory Only):

  • :white_check_mark: Everything in production (including “non-critical”)
  • :white_check_mark: All security configurations
  • :white_check_mark: All access controls
  • :white_check_mark: All changes with cross-service impact

The Knowledge Loss Problem Is Real

You asked about losing operational knowledge. This is a security concern, not just an operational one.

When teams lose deep infrastructure knowledge:

  • They can’t identify subtle security issues
  • They don’t understand attack vectors
  • They can’t respond effectively to incidents
  • They trust agent decisions they shouldn’t

I’ve seen this in practice: Teams that over-rely on automation tools (even pre-AI) lose the ability to diagnose complex security issues.

Recommended Security Controls

If you’re moving toward infrastructure agents:

1. Principle of Least Privilege (Always)

  • Agents get minimum permissions for read/analyze
  • Apply permissions granted through human-approved process
  • Regular permission audits
  • Automatic expiration of unused permissions

2. Defense in Depth

  • Agent recommendations reviewed by security automation (separate from agent)
  • Infrastructure change gates at multiple levels
  • Anomaly detection on agent behavior
  • Kill switches for emergencies

3. Audit Everything

  • What the agent proposed
  • Why (its reasoning)
  • What constraints were active
  • What human approved
  • What actually changed
  • What the impact was

4. Regular Security Reviews

  • Review agent decisions for security implications
  • Red team exercises that include agent compromise scenarios
  • Incident response drills that assume agent is adversarial
  • Supply chain security for agent dependencies

Bottom Line

Infrastructure agents are useful for:

  • :white_check_mark: Detection and analysis
  • :white_check_mark: Proposing solutions
  • :white_check_mark: Drafting changes
  • :white_check_mark: Generating documentation

Infrastructure agents are NOT appropriate for:

  • :cross_mark: Autonomous production changes
  • :cross_mark: Security-critical decisions
  • :cross_mark: Anything you can’t easily verify and rollback

Your instinct to be cautious is correct. The benefits of autonomy don’t outweigh the security and reliability risks.

Start conservative. Prove safety in limited contexts. Expand slowly with comprehensive controls.

And never, ever let an agent apply production infrastructure changes without human review.

Sam’s security perspective is critical, and I largely agree. But let me add the operational reality from managing infrastructure at scale in a regulated environment.

The Compliance Lens

In financial services, “the agent did it” is not an acceptable answer when regulators ask “who authorized this change?”

We’ve had to build frameworks that satisfy both operational efficiency AND regulatory requirements:

Every infrastructure change must have:

  1. Identified human owner - Someone accountable
  2. Documented reasoning - Why was this change needed?
  3. Risk assessment - What could go wrong?
  4. Approval chain - Who signed off?
  5. Rollback plan - How do we undo this?
  6. Audit trail - Immutable record of what happened

Agents can help with all of these EXCEPT #1 - there must be a human owner.

Where We Use Agents in Infrastructure

After 18 months of experimentation, here’s what’s working:

Agent-Assisted Workflows (Not Autonomous):

1. Drift Detection and Analysis

  • Agent continuously monitors for configuration drift
  • Generates reports with impact analysis
  • Proposes remediation options
  • Human reviews and approves

2. Change Planning

  • Agent drafts Terraform changes based on requirements
  • Generates blast radius analysis
  • Suggests rollback procedures
  • Human reviews, modifies, approves

3. Incident Response Assistance

  • Agent suggests diagnostic commands during incidents
  • Proposes potential fixes based on past incidents
  • Generates runbooks for common issues
  • Human drives the incident, agent assists

4. Compliance Checking

  • Agent scans infrastructure for policy violations
  • Flags non-compliant configurations
  • Suggests remediation approaches
  • Human prioritizes and addresses

What We’re NOT Using Agents For:

  • :cross_mark: Automatic application of infrastructure changes
  • :cross_mark: Production deployments without human trigger
  • :cross_mark: Security or IAM modifications
  • :cross_mark: Database schema changes
  • :cross_mark: Network topology modifications

The Middle Ground: Pre-Approved Safe Operations

There IS a middle ground between “no autonomy” and “full autonomy”:

Pre-Defined, Pre-Approved, Bounded Operations

Example: Auto-scaling within defined parameters

  • Agent can scale ASG from 5 → 15 instances (within bounds)
  • Agent cannot modify the instance type or AMI
  • Agent cannot change security groups or IAM roles
  • Agent logs all scaling decisions with reasoning
  • Humans receive notifications of scaling events
  • Weekly human review of scaling patterns

This works because:

  • The operation is well-understood and low-risk
  • The bounds are clearly defined and enforced
  • The blast radius is limited
  • Rollback is straightforward
  • Audit trail is comprehensive

Addressing the Knowledge Loss Concern

You’re absolutely right to worry about this. I’m seeing it already.

What we’re doing:

1. “Manual Mode” Training

  • New infrastructure engineers spend first 3 months working WITHOUT agent assistance
  • They learn Kubernetes, Terraform, AWS by hand
  • They debug real issues manually
  • Only after demonstrating competency do they get agent access

2. Incident Response Drills

  • Regular exercises where agents are “disabled”
  • Team must diagnose and resolve infrastructure issues manually
  • Builds muscle memory for manual operation
  • Identifies knowledge gaps

3. Agent Explanation Requirements

  • When agents propose changes, they must explain their reasoning
  • Engineers must be able to critique that reasoning
  • “I don’t understand why the agent suggested this” = not ready to approve

4. Architecture Reviews

  • Regular sessions where team discusses infrastructure decisions
  • Agents can propose, but humans must debate alternatives
  • Preserves architectural thinking skills

The Rollback Challenge

Sam mentioned this briefly, but it’s worth expanding: Infrastructure rollback is complex.

If an agent applies a change that causes an issue, rolling back isn’t always straightforward:

Application rollback:

  • Deploy previous version
  • Usually works cleanly

Infrastructure rollback:

  • Might affect multiple services
  • State changes may not be reversible
  • Data migrations might have occurred
  • Other changes might depend on this one

This is why we require:

  • Every agent-proposed change includes rollback plan
  • Humans verify rollback plan is viable
  • Changes are tested in staging first (including rollback)
  • Critical infrastructure has blue/green or canary patterns

Practical Recommendation

Start here:

Phase 1: Observation Only (Current)

  • Deploy agents in read-only mode
  • Let them analyze and report
  • Humans do all the work
  • Build confidence in agent recommendations

Phase 2: Proposal Mode (Next 3-6 months)

  • Agents propose changes via PR
  • Humans review, modify, approve
  • Agents help with analysis and documentation
  • Measure: are agent proposals actually good?

Phase 3: Limited Autonomy (6-12 months)

  • Agents can apply pre-approved safe operations
  • Humans define bounds very carefully
  • Comprehensive monitoring and alerting
  • Quick rollback mechanisms

Phase 4: Evaluate (12+ months)

  • Based on data from Phase 3, decide if broader autonomy makes sense
  • Don’t expand autonomy without proven safety record

The Accountability Answer

To your question about “who owns it when an agent causes an outage?”

The human who granted the agent permission to make that class of changes.

This is why we’re so conservative about agent autonomy. If I grant an agent permission to modify production infrastructure, I’m accountable for any issues that causes.

That accountability drives conservative decision-making, which honestly feels appropriate for infrastructure.

Final Thought

The question isn’t “should agents manage infrastructure?” It’s “what’s the right division of responsibilities between agents and humans?”

Agents excel at:

  • Tireless monitoring
  • Pattern recognition across huge datasets
  • Generating consistent documentation
  • Suggesting solutions based on past precedent

Humans excel at:

  • Understanding context and organizational priorities
  • Making judgment calls in ambiguous situations
  • Taking accountability for critical decisions
  • Learning from incidents to improve processes

Design your agent workflows to leverage both.

I’m going to share something that might be unpopular: We gave an agent limited production write access, and it worked out better than I expected.

Let me explain the context and constraints before Sam has a heart attack.

What We Did (And Why)

We deployed an infrastructure agent with these specific, tightly-bounded permissions:

Agent Can Do Automatically:

  1. Auto-scaling within defined ranges

    • Scale application servers from 10-50 instances
    • Scale database read replicas from 2-10
    • Cannot change instance types or configurations
    • All scaling logged and alerting humans immediately
  2. Self-healing on known failure patterns

    • Restart crashed services (after human-defined pattern matching)
    • Replace unhealthy instances in ASG
    • Clear disk space from known temp directories
    • Each action follows pre-approved runbook
  3. Configuration remediation for specific drift types

    • Reapply tagged resources with missing tags
    • Fix monitoring agent configurations
    • Update DNS TTLs that violate policy
    • All from pre-approved list, not general-purpose

Agent Cannot Do:

  • :cross_mark: Anything touching IAM, security groups, or network ACLs
  • :cross_mark: Database schema or data changes
  • :cross_mark: Deploy code or change application configurations
  • :cross_mark: Modify infrastructure outside the pre-approved operations list
  • :cross_mark: Make any change that affects more than one service

Why We Did This:

Our platform team was drowning in repetitive operational toil. 40% of our on-call pages were for issues that had known, safe solutions but required human intervention due to process constraints.

We had three choices:

  1. Hire more ops people (expensive, doesn’t scale)
  2. Continue with high toil (burnout risk, slow incident response)
  3. Carefully delegate specific, well-understood operations to agents

We chose #3.

Results After 6 Months

Operational Impact:

  • On-call pages reduced by 35%
  • Mean time to recovery improved by 40% for agent-handled incidents
  • Ops team satisfaction significantly improved (less 3am toil)
  • Zero infrastructure outages caused by agent decisions

What Surprised Us (Positively):

  • Agents are consistent - they follow the runbook EVERY time
  • Humans get tired and make mistakes at 3am; agents don’t
  • Having comprehensive audit logs of agent reasoning helped us improve our runbooks
  • Team confidence in agent decisions grew over time

What Surprised Us (Negatively):

  • We still needed significant human oversight and monitoring
  • Edge cases the runbooks didn’t cover required human escalation (as designed, but more common than expected)
  • Some team members became overly reliant on agents for routine operations
  • Cultural resistance was stronger than anticipated

The Critical Success Factors

This only worked because we:

1. Started Incredibly Conservative

  • First 3 months: Agent only proposed changes, humans applied
  • Next 3 months: Agent could apply only to dev/staging
  • Final phase: Limited production autonomy with extensive monitoring

2. Had Comprehensive Guardrails

  • Every autonomous action had multiple safety checks
  • Blast radius analysis before agent was given permission
  • Kill switch to instantly disable agent autonomy
  • Weekly reviews of agent decisions

3. Maintained Human Oversight

  • Real-time alerting on all agent actions
  • Dedicated dashboard showing agent decision-making
  • Weekly team review of agent-handled incidents
  • Quarterly external audit of agent permissions

4. Invested in Observability

  • Full agent decision logs with reasoning
  • Metrics on agent effectiveness
  • Tracking of edge cases and escalations
  • Integration with our incident management system

Where I Agree with Sam

Sam’s security concerns are valid. We’re not disagreeing - we’re operating in different risk environments.

His “NEVER” stance makes sense for:

  • Security-critical operations
  • Changes with wide blast radius
  • Environments with strict compliance requirements
  • Operations where rollback is complex

Our limited autonomy works because:

  • Operations have narrow blast radius
  • Rollback is straightforward
  • We have comprehensive monitoring
  • Risk tolerance is different for our specific use cases

The Organizational Change Management Challenge

Luis’s point about accountability is crucial. Here’s what we learned:

The agent is a tool, not an owner.

When the agent scales up instances automatically:

  • The platform team owns that decision
  • We granted the agent those permissions
  • We defined the constraints
  • We’re accountable for the outcomes

This is documented in our incident response process, our org chart, and our job descriptions.

Cultural Framing Matters:

We talk about the agent as a “team member with very specific, limited responsibilities” rather than “automation that manages infrastructure.”

This helps the team maintain appropriate ownership mindset while leveraging agent capabilities.

Metrics That Matter

We track:

  • MTTD (Mean Time to Detection) - Agents excel here
  • MTTR (Mean Time to Recovery) - Agents good for known issues, humans better for novel problems
  • Agent Decision Quality - % of agent actions that humans would have approved
  • False Positive Escalations - Times agent escalated when it didn’t need to
  • Missed Escalations - Times agent should have escalated but didn’t (most concerning metric)

We review these weekly and adjust agent boundaries based on data.

What We’re Learning

Agents are good at:

  • Consistent execution of well-defined procedures
  • 24/7 monitoring without fatigue
  • Fast response to known patterns
  • Comprehensive documentation of actions

Agents are not good at:

  • Novel problem-solving
  • Understanding business context
  • Making judgment calls in ambiguous situations
  • Adapting to unexpected circumstances

This reinforces Sam and Luis’s points about human oversight for complex decisions.

Recommendation: Start Smaller Than You Think

If you’re considering giving agents infrastructure autonomy:

  1. Start with non-production - Prove safety over months, not weeks
  2. Define boundaries tighter than necessary - You can expand, but hard to contract
  3. Invest heavily in observability - You can’t govern what you can’t see
  4. Get security review early - Sam’s concerns should shape your design
  5. Plan for rollback - Both technical and organizational
  6. Measure everything - Data should drive expansion of autonomy

The Answer to Your Question

You asked: “Should agents have write access to production infrastructure?”

My answer: It depends on your risk tolerance, your guardrails, and the specific operations.

For broadly applicable, general-purpose infrastructure changes? No. Sam is right.

For narrowly-defined, well-understood, low-blast-radius operations with comprehensive safety controls? Maybe yes, if your organization is prepared for the responsibility.

The key is: Don’t think of it as “agents vs humans.” Think of it as “what’s the right delegation of responsibilities with appropriate accountability?”

Reading this thread, I’m struck by how different the infrastructure conversation is from the application development conversation.

When we talked about AI agents writing application code, the worst case scenario was “the code doesn’t work” or “there’s a bug in production.” Annoying, fixable.

When we’re talking about AI agents modifying infrastructure, the worst case is “our entire platform goes down and we can’t get it back up quickly.” Significantly more serious.

Learning From Debugging Production Incidents

The “knowledge loss” concern Alex raised hits differently in the infrastructure context.

I learned most of what I know about our infrastructure by debugging production incidents:

  • Why does this service randomly time out? (Learned about connection pooling limits)
  • Why did this deployment fail? (Learned about Kubernetes resource limits)
  • Why is this API slow? (Learned about DNS caching and service mesh latency)

Each incident taught me something about how the pieces connect.

If agents prevent most of those incidents (which is good!), where do I learn that knowledge?

And more importantly: When something truly novel breaks - something the agent hasn’t seen before - who’s equipped to debug it?

The “Works Until It Doesn’t” Problem

Michelle’s experience is encouraging, but I wonder about the long-tail risks.

Your agent successfully handled operations for 6 months with zero outages. That’s great! But that means:

  • Your team got comfortable with agent autonomy
  • Your monitoring and safety processes proved adequate for the scenarios you encountered
  • You built organizational confidence

But what happens in month 7 when the agent encounters a scenario that’s ALMOST like a known pattern, but subtly different?

The agent might:

  1. Correctly recognize it doesn’t match the pattern → escalate (good)
  2. Incorrectly think it matches the pattern → apply wrong fix (bad)
  3. Correctly match pattern, but context makes it wrong → apply right fix at wrong time (bad)

That third case worries me. The agent following its programming perfectly, but context makes that the wrong decision.

Runbooks Generated by Agents

Luis mentioned agents generating runbooks. I love this idea, but with a caveat:

If agents both generate the runbooks AND execute them, who verifies the runbooks are correct?

Scenario:

  • Agent encounters new incident pattern
  • Agent diagnoses and resolves issue
  • Agent generates runbook for future incidents
  • Future incident occurs
  • Agent applies runbook automatically

At what point did a human verify the runbook was correct? What if the agent’s original diagnosis was wrong, but happened to work? Now we’ve codified a wrong solution.

Proposal: Agents Must Teach, Not Just Do

What if we required agents to explain their reasoning in a way that helps humans learn?

When an agent auto-scales your infrastructure, instead of just:

“Scaled from 10 → 15 instances due to high CPU utilization.”

What if it said:

"Scaled from 10 → 15 instances. Here’s what I observed:

  • CPU utilization crossed 80% threshold for 5 consecutive minutes
  • Request latency increased from 100ms → 300ms
  • Error rate remained stable at 0.1%
  • Historical pattern suggests traffic spike will last 30-45 minutes
  • Scaling analysis: 15 instances should bring CPU to ~65%, latency back to normal
  • Alternative considered: horizontal pod autoscaling already at max
  • Risk assessment: Low (scaling within defined bounds, health checks passing)"

The second version teaches me:

  • What signals the agent looked at
  • How it made the decision
  • What else it considered
  • Why this was the right move

If I read enough of these explanations, I start to build the mental model the agent is using.

Question for Michelle

Your 6-month success story is compelling. A few questions:

  1. How do you onboard new infrastructure engineers who join after agents are already handling these operations? Do they learn by reading agent decision logs?

  2. What’s your plan for when the agent encounters something outside its defined bounds? Does your team still have the skills to handle it manually?

  3. Have you done disaster recovery drills where the agent is “offline”? Can your team still operate effectively?

Not trying to poke holes - genuinely curious how you’re addressing the long-term knowledge retention while benefiting from agent autonomy.

Where I’m Landing

I think the tiered approach makes sense:

  • Dev/Test: Agents can have significant autonomy (learning environment)
  • Staging: Agents propose, humans approve quickly (verification environment)
  • Production: Agents assist, humans decide (accountability environment)

With exceptions for Michelle’s “pre-approved, narrow, low-risk operations” in production - IF:

  • :white_check_mark: Comprehensive monitoring and audit trails
  • :white_check_mark: Easy rollback mechanisms
  • :white_check_mark: Regular human review of agent decisions
  • :white_check_mark: Training that maintains manual operation skills
  • :white_check_mark: Disaster recovery plans that assume agent is unavailable

But I think Sam’s conservative stance is the right default, and any deviation requires strong justification and safety controls.

Because application bugs are fixable. Infrastructure outages are career-limiting events.