Agentic Coding Runs for Hours Without Human Input. Are We Ready to Trust It, or Just Debugging It Differently?

Last week I left an AI coding agent running while I went to lunch. :steaming_bowl: Came back to 47 commits spread across 8 files. The code worked. Tests passed. But I spent the next hour trying to understand why it made certain architectural decisions.

This is the new reality we’re navigating: agentic coding agents that run for hours without human input. Not minutes. Hours.

The Rakuten Reality Check

There’s a case study making the rounds: Rakuten engineers gave Claude Code a task involving a 12.5-million-line codebase spanning multiple programming languages. The agent ran autonomously for 7 hours and completed the implementation with 99.9% numerical accuracy. Seven. Hours. :exploding_head:

No human wrote a single line of code during that time.

When I first read this, my immediate thought was: “Okay, but would I trust that? Would I hit merge?”

The Trust Paradox We’re All Living

Here’s what the data tells us: developers now use AI in roughly 60% of their daily work. That’s massive adoption. But here’s the kicker—we can only fully delegate 0-20% of tasks.

Think about that gap. We’re using AI constantly, but we’re not trusting it to work unsupervised for most things.

In my design systems work, I see this play out constantly:

  • Component generation? Agent crushes it. :white_check_mark: Give it a design spec, it creates React components with proper TypeScript types, follows our naming conventions, even adds basic tests.

  • Accessibility decisions? Nope. :cross_mark: Still need human judgment for ARIA labels, keyboard navigation patterns, focus management. The agent makes reasonable guesses, but “reasonable” isn’t good enough when you’re building for inclusive access.

The agents are incredible at execution. But the judgment calls? That’s still us.

Are We Building Trust or Just Better Debugging?

Here’s my honest question: After six months of using agentic tools daily, I’m not sure if I trust the AI more… or if I’ve just gotten better at reviewing and debugging AI-generated code.

Those are not the same thing.

Trust means I can look away. Debugging means I’m still in the loop, just in a different phase. Maybe that’s fine! Maybe that’s exactly where we should be. But let’s be clear about what we’re actually doing.

When I let an agent run for an hour on a component library refactor, I’m not trusting it the way I’d trust a senior engineer. I’m trusting that:

  1. It won’t break anything critical (tests + CI will catch that)
  2. The blast radius is contained (it’s working in a feature branch)
  3. I can review and course-correct after

That’s not the same as “I trust this will be production-ready when I get back from lunch.”

The Changing Nature of Our Work

What I am seeing change is where I spend my time:

  • Less time writing boilerplate and repetitive code :down_arrow:
  • More time reviewing architectural decisions :up_arrow:
  • Less time debugging syntax errors :down_arrow:
  • More time validating accessibility, performance, edge cases :up_arrow:

The craft is shifting. I’m becoming more of an architect and critic than a builder. And honestly? For design systems work, that’s probably the right direction. My value isn’t typing React code—it’s knowing which components should exist and how they should behave across contexts.

But I’m curious: Are we ready for this shift? Or are we pretending to trust AI while secretly just becoming really good debuggers of autonomous systems?

What’s your experience been? Where do you draw the line between delegation and supervision?


Related: We’re also seeing agents complete 20+ actions autonomously before requiring human input—literally double what was possible six months ago. The capabilities are accelerating faster than our frameworks for using them.

Maya, this resonates deeply with our experience in financial services. That trust paradox you described? It’s magnified 10x when you’re dealing with banking systems where a mistake can mean regulatory fines or customer financial harm.

Trust Requires Audit Trails

We’ve been experimenting with agentic tools for about 4 months now, specifically for refactoring our legacy banking code. Here’s what we learned fast: trust in financial services isn’t about confidence—it’s about verifiability.

Last month, we let an AI agent loose on refactoring a 15-year-old transaction processing module. Overnight run. The agent made 200+ changes across 30 files. The code compiled. Tests passed. Performance actually improved by 8%.

But we couldn’t ship it.

Why? Because we couldn’t clearly document why it made certain architectural decisions. Our compliance team needs to trace every significant logic change back to a human decision-maker. “The AI thought it was better” doesn’t fly with auditors.

Where It Actually Works for Us

That said, we’re not throwing these tools away. We’ve found a sweet spot:

Agent does the heavy lifting overnight:

  • Refactoring for code style consistency
  • Updating deprecated API calls
  • Generating test coverage for untested paths
  • Data migration script generation

Human architects review and own every decision:

  • Changes to business logic flow
  • Security-critical authentication/authorization
  • Data model modifications
  • Integration points with external systems

The agent becomes a force multiplier for the grunt work, freeing our senior engineers to focus on the judgment-heavy architectural calls.

Velocity vs Control: The Real Question

You asked where we draw the line between delegation and supervision. Here’s our current framework:

:green_circle: Full delegation (review after completion):

  • Code formatting and linting
  • Test case generation for existing logic
  • Documentation updates
  • Dependency updates (within version constraints)

:yellow_circle: Supervised delegation (checkpoint reviews):

  • Refactoring within a bounded module
  • Performance optimization
  • Database query optimization
  • API client code generation

:red_circle: Human-led with AI assistance (AI suggests, human decides):

  • Architectural changes
  • Security-sensitive code
  • Business logic modifications
  • Anything customer-facing

The challenge? This slows us down. We’re not getting the full velocity benefits that startups in less-regulated spaces might see. But the trade-off is acceptable because the cost of errors is asymmetric in our industry.

The Rollback Strategy

One thing we implemented early: granular rollback capabilities.

Every autonomous agent session gets tagged in git. If we discover an issue 3 days later, we can isolate and revert just that agent’s changes without touching human commits. Think of it like feature flags, but for AI-generated code.

It’s extra overhead, but it’s bought us confidence to experiment more aggressively.


Question for the group: How do we balance velocity with control in industries where the cost of failure is high? Are we overthinking this, or are faster-moving companies underestimating risk?

This conversation mirrors debates we had 10 years ago about CI/CD and infrastructure-as-code. Remember when “let the pipeline deploy to production automatically” felt dangerous?

History Doesn’t Repeat, But It Rhymes

I was at Microsoft during the transition from manual deployments to fully automated pipelines. The resistance was fierce. Senior engineers said: “You can’t trust automation with production systems.” But we learned something crucial:

Trust doesn’t come from blind faith. It comes from observability, limits, and gradual rollout.

Maya, your point about trust vs. better debugging really struck me. I think we’re experiencing the same shift we saw with automated deployments: we’re not trusting the tool to be perfect—we’re trusting the system we’ve built around it to catch failures.

Real Example: Our Cloud Migration

Last quarter, we used agentic coding tools for a major cloud infrastructure migration. Here’s how we structured it:

Agents handled:

  • Infrastructure-as-code generation (Terraform configs for 200+ services)
  • Configuration file migrations
  • Environment variable mapping
  • Initial test case generation

Humans made the calls on:

  • Network topology decisions
  • Security group rules
  • Data residency choices
  • Cost optimization trade-offs

The agent ran for ~6 hours over a weekend, generating probably 15,000 lines of Terraform and related configs. But here’s the key: we had observability at every step.

  • Real-time logs of what it was changing
  • Automated validation gates (syntax checks, policy compliance)
  • Rollback checkpoints every 30 minutes
  • Cost estimation before any apply

So when I came in Monday morning, I wasn’t reviewing 6 hours of mysterious work. I was reviewing a staged progression with clear decision points.

Not Replacing Judgment, Scaling It

Luis mentioned the velocity-control trade-off. I’d frame it differently: we’re not trading velocity for control—we’re learning which controls actually matter.

In the old model:

  • Human writes code line by line
  • Human tests locally
  • Human pushes to CI
  • Automated tests run
  • Human reviews
  • Automated deployment

In the new model:

  • Human defines requirements and constraints
  • Agent writes code
  • Agent runs initial tests
  • Human reviews architectural decisions
  • Automated tests run
  • Human reviews
  • Automated deployment

We moved two steps to automation. But the critical review gates remain. We just review at a higher level of abstraction.

The Governance Question

Here’s what keeps me up at night: What’s our governance framework for agentic development?

We have mature processes for code review, deployment approvals, incident response. But do we have:

  • Standards for when agents can work unsupervised?
  • Required checkpoints for long-running autonomous tasks?
  • Audit trails for AI-generated architectural decisions?
  • Rollback procedures specific to agent-generated code?

Most companies don’t. We’re winging it based on gut feel and borrowed practices from adjacent problems.

I’d love to see the industry develop shared frameworks here. Not rigid standards that slow innovation, but battle-tested patterns that let us move fast safely.


Question for this group: Who’s working on governance frameworks for agentic development? What does “responsible AI-assisted development” look like at your orgs?

This is fascinating from a product perspective, but I’m going to ask the uncomfortable business question:

If agents generate code faster than humans can review, what’s our actual throughput gain?

The Review Bottleneck Nobody Talks About

We’ve been tracking this at my company for the past 2 months. Initial excitement: “AI is making our devs 30% faster!” But when we actually measured end-to-end cycle time (story picked up → code in production), the gains were closer to 10%.

Why? Because code review became the constraint.

Our senior engineers are now spending 40-50% of their time reviewing AI-generated code vs. ~25% before. The junior engineers are indeed faster at generating code. But someone still needs to verify it makes sense architecturally.

Are We Measuring the Right Things?

Maya’s point about trust vs debugging really resonates from a product lens. If we’re honest, we’re not measuring:

  • Time to understand AI-generated code (new cost we didn’t have before)
  • Confidence level in merge decisions (are we shipping with more uncertainty?)
  • Hidden technical debt from “good enough” AI code that passes tests but isn’t idiomatic
  • Context switching cost of reviewing machine code vs human code

Michelle mentioned observability and checkpoints. From a product ROI perspective, those checkpoints are costly. If I have to review an agent’s work every 30 minutes, how is that different from pair programming with a junior developer?

Maybe it’s not about being faster. Maybe it’s about handling more complexity or taking on work we’d otherwise defer.

The Executive Stakeholder Question

Here’s what I’m struggling to communicate upward to our C-suite:

CFO: “You said AI would make engineering 30% faster. When do we see that in velocity?”

Me: “Well, we’re using AI for 60% of development work, but we can only fully delegate 20% of tasks, and code review has become the bottleneck, so… it’s complicated.”

That’s not a satisfying answer. But it’s the truth.

What I wish I’d said 6 months ago: “We expect throughput gains of 10-15% on average, with 30%+ gains on specific task types like testing and refactoring, but architectural work won’t see meaningful speedups.”

Setting realistic expectations would have saved a lot of awkward quarterly review conversations.

Where I See Real Value

That said, I’m not pessimistic. Where we’re seeing genuine ROI:

  1. Reducing drudgework - Devs happier, less burnout from repetitive tasks
  2. Handling tech debt - We can finally tackle the refactoring backlog
  3. Test coverage - AI-generated tests are improving our quality metrics
  4. Prototyping speed - Getting to first working version 2-3x faster

But “happier engineers” and “cleared tech debt backlog” are harder to put in a CFO deck than “30% faster delivery.”

Real Talk for Product Leaders

If you’re considering AI coding tools and need to justify them to finance:

  • Frame it as quality and sustainability investment, not pure velocity
  • Set expectations: 10-20% cycle time improvement, not 50%
  • Track engineer satisfaction and retention (those have dollar values)
  • Measure tech debt reduction (also has a dollar value in avoided incidents)

The “AI makes us 3x faster” narrative is setting everyone up for disappointment.


Question: How are other product/engineering leaders framing AI tooling ROI to non-technical stakeholders? What metrics are actually moving?

David’s point about review bottlenecks hits hard, but I’m more concerned about what this means for how we develop engineering talent.

The Skill Ceiling Problem

Maya asked: “Are we building trust or just better debugging?” From a team development perspective, I’d add: Are we building engineers or AI supervisors?

Here’s what I’m seeing with our junior engineers who’ve only ever worked in an AI-augmented environment:

They’re incredibly productive at:

  • Implementing features from well-defined specs
  • Fixing bugs when given clear reproduction steps
  • Writing tests for existing code
  • Following established patterns

But they struggle with:

  • Defining the problem when requirements are vague
  • Architecting solutions from scratch
  • Debugging novel issues the AI hasn’t seen
  • Making trade-off decisions without clear frameworks

The agents are amazing at execution. But execution isn’t where senior engineering judgment comes from. It comes from years of making mistakes, seeing systems fail, learning what matters.

If juniors lean on AI for all the execution work, where do they develop that judgment?

We Changed Our Hiring Screens

Three months ago, we completely overhauled our interview process. Used to focus heavily on coding ability—can you implement this algorithm, can you debug this function, etc.

Now we’re screening for:

  1. Problem definition skills - Can you take an ambiguous requirement and structure it?
  2. Architectural thinking - How do you decide between approaches?
  3. Trade-off analysis - What questions do you ask when evaluating options?
  4. System understanding - Can you reason about how components interact?

These are the skills that matter when AI can write the implementation code.

But here’s the uncomfortable truth: I don’t know how to teach these skills without years of hands-on implementation experience.

The traditional path was: junior → write lots of code → make mistakes → learn patterns → become senior. If AI short-circuits “write lots of code,” what’s the new path?

Problem Definition > Code Writing

Luis mentioned agents being force multipliers for grunt work. Michelle talked about reviewing at higher levels of abstraction. Both true. But both assume you’ve already developed the judgment to operate at that level.

For seasoned engineers like all of us in this thread, AI is incredible. We know what to ask for, what to review, what matters. We have the context.

For someone 2 years into their career? I worry we’re creating a dependency, not capability.

What We’re Trying: Pairing Junior + AI + Senior

Here’s our experiment: junior engineers work with AI agents, but with explicit senior mentorship focused on the why not the what.

Structure:

  • Junior + agent implement the feature
  • Senior reviews with junior, discussing: “Why did the agent choose this approach? What are alternatives? What would break? What scales?”
  • Basically, using AI-generated code as a teaching tool for architectural thinking

Early results… mixed. It’s more time-intensive for seniors. But the juniors are asking better questions than they used to.

The Long Game

Maya, you said you’re becoming more architect and critic than builder. I think that’s the future for all of us. But we need to figure out how people become architects without going through the builder phase.

Or maybe they still need the builder phase, just compressed? More iterations in less time because AI handles execution?

I don’t have answers. But this is the conversation we need to be having alongside the “look how fast AI codes” excitement.


Question for engineering leaders: How are you developing junior talent in an AI-augmented environment? What skills are you prioritizing in hiring and mentorship?