AI Code Has 1.7× More Issues Per PR, But Bain Says ROI Is 'Unremarkable'—Who's Right?

I’ve been thinking about the AI coding assistant conversation, and I keep coming back to two data points that seem impossible to reconcile.

The Quality Problem

Recent research tracking 8.1 million pull requests found that AI-generated code contains 1.7 times more issues than human-written code (10.83 issues per PR vs 6.45). Logic errors appear 1.75x more often. Security vulnerabilities rise 1.57x. And across teams using AI assistants heavily, there’s a 9% increase in bugs per developer.

That’s… not great. These aren’t edge cases—this is systematic quality degradation at scale.

The Perception Gap

But here’s where it gets stranger. Before adopting AI tools, developers expected to be 24% faster. After using them for months, they still believed they were 20% faster.

Measured reality? Tasks took 19% longer to complete.

That’s a 43-percentage-point expectations gap—one of the largest perception-reality disconnects in software engineering research. We feel faster while the clock says we’re slower.

The Bain Reality Check

Then Bain drops their Technology Report 2025 with this line: “Software coding was one of the first areas to deploy generative AI, but the savings have been unremarkable.”

Two out of three software firms rolled out AI coding tools. Among those, adoption is high. Teams report 10-15% productivity boosts. But the actual business impact? Underwhelming.

Why? Developers spend only 20-40% of their time actually writing code. Even significant speedups in code generation translate to modest overall gains when most of our day is meetings, alignment, debugging, and coordination.

The Bottleneck We Created

Here’s the kicker: developers complete 21% more tasks and merge 98% more pull requests with AI assistance. Sounds great, right?

Except AI-generated PRs wait 4.6 times longer before code review begins. Overall review time increases 91%. Average PR size jumps 154%.

We’ve created a review bottleneck. Human approval can’t scale with AI velocity. And when reviewers get fatigued from 5-10x volume, they start skimming—precisely when AI code has 1.7x more issues that need catching.

So Who’s Right?

Are we measuring the wrong things? Quality vs velocity—is this a necessary tradeoff, or are we simply deploying AI wrong?

The research shows individual developers produce more code. But organizations see “unremarkable” savings. Quality degrades. Review becomes the limiting factor. Time saved coding doesn’t redirect to higher-value work.

Maybe the answer isn’t “who’s right” but “what are we optimizing for?”

I’m curious: For those using AI assistants on your teams—what patterns are you seeing? Is the quality-velocity tradeoff real, or is this a deployment and process problem?

This resonates deeply with what we’re experiencing in financial services. The review bottleneck you describe isn’t theoretical—it’s our daily reality.

The Numbers Match Our Experience

We rolled out GitHub Copilot to 40 engineers last quarter. Within six weeks:

  • PRs per developer: +47%
  • Average PR size: +118% (not quite 154%, but close)
  • Time to first review: +3.8x (again, mirrors your 4.6x finding)
  • Bugs reported in production: +12%

The perception gap is real too. In our quarterly survey, 73% of engineers said AI tools made them more productive. Our DORA metrics? Deployment frequency flat, lead time increased 8%, change failure rate up 11%.

The Reviewer Fatigue Problem

What we didn’t anticipate was the cognitive load on senior engineers doing reviews. When a PR is 300 lines instead of 150, and you’re reviewing 8 PRs a day instead of 3, something has to give.

One of my tech leads told me: “I used to deeply understand every line I approved. Now I’m pattern-matching for obvious issues and trusting tests. I know I’m missing subtle problems.”

That’s the quiet part we don’t say out loud. Review quality degrades when volume 5x’s, but we haven’t adjusted our processes.

Training Gap

Your question about “are we deploying AI wrong” hits home. We realized reviewers need different skills for AI-generated code:

  • Spotting over-engineering (AI loves abstraction layers)
  • Catching security anti-patterns (AI doesn’t understand your threat model)
  • Verifying business logic correctness (AI gets syntax right, semantics wrong)

We’re now piloting “AI code review training” for senior engineers. Feels strange—training humans to review AI—but necessary.

Question for the community: Has anyone successfully scaled code review processes to match AI velocity? What changed?

The alternative is accepting that individual productivity goes up while organizational quality goes down. That math doesn’t work long-term.

Okay, contrarian take: maybe we’re using AI for the wrong tasks entirely.

The Production Bug That Taught Me

Last month our design system team used Copilot to refactor a bunch of component utilities. “It’s just helper functions,” we thought. “Low risk.”

AI cheerfully generated what looked like clean, well-structured code. Passed tests. Got approved. Shipped to production.

Three days later: accessibility regression. The AI had “optimized” our focus management logic in a way that broke keyboard navigation for screen readers. Syntax perfect. Semantics catastrophically wrong.

The kicker? Our accessibility specialist would have caught it in 30 seconds. But she wasn’t looped in because “it’s just a refactor.”

Wrong Tool, Wrong Job

Here’s what I learned: AI is brilliant at boilerplate but dangerous with nuance.

Where AI actually helps our team:

  • Generating repetitive test fixtures
  • Scaffolding component boilerplate
  • Writing first-draft documentation
  • Converting design tokens between formats

Where AI gets us in trouble:

  • Business logic with edge cases
  • Accessibility considerations
  • Security-sensitive code
  • Anything requiring domain knowledge

The 1.7x more issues makes total sense when you realize AI doesn’t understand why code needs to be a certain way. It pattern-matches on syntax, not on the constraints that matter.

The “Looks Right, Works Wrong” Problem

As a designer, I see this constantly. AI can generate UI components that look like they follow our design system but violate the intent behind our patterns.

It’s the software equivalent of those AI-generated images where people have 6 fingers. Superficially correct, fundamentally broken.

Maybe Bain Is Right for the Wrong Reasons

The “unremarkable savings” might not be about the 20-40% coding time limitation. It might be that we’re increasing work downstream:

  • More time reviewing larger PRs
  • More bugs to debug in production
  • More technical debt to untangle
  • More time explaining to stakeholders why velocity ≠ progress

We save 30 minutes writing code, spend 2 hours fixing the subtle bugs it introduced. Net productivity: negative.

Question: What if the right approach is to be radically selective about where we use AI? Like, “boilerplate only, humans for everything else” selective?

That feels like heresy when everyone’s trying to maximize AI usage. But maybe maximization isn’t the goal. Maybe precision is.

The “unremarkable savings” finding makes perfect sense when you think about it as an organizational design problem, not a tools problem.

The Time Capture Question

Bain is right that developers spend 20-40% of their time coding. But here’s the question nobody’s asking: When AI saves coding time, where does that time actually go?

At my previous company (mid-stage SaaS, 80 engineers), we tracked this after rolling out Copilot:

  • 34% of saved time → more coding (not higher-value work, just more features)
  • 28% → meetings that expanded to fill available time
  • 19% → context switching and other coordination overhead
  • 11% → longer code reviews (ironic, given the bottleneck)
  • 8% → actually directed to strategic work like architecture and planning

Only 8% of productivity gains translated to work that moves the business forward.

The rest? Absorbed by organizational friction, Parkinson’s Law, and lack of intentional capacity planning.

This Is a Leadership Problem

The quality-velocity tradeoff you’re describing isn’t inevitable. It’s a symptom of deploying technology without redesigning the workflow.

When individual devs get 15% faster at coding but organizations see “unremarkable” impact, the gap is how we’re capturing and directing that capacity.

Questions I’m asking my leadership team now:

  1. What higher-value work should engineers do with saved time? (We made a list. Architecture reviews, technical debt reduction, mentoring, customer research.)

  2. How do we actively redirect capacity? (We track “AI time saved” as a resource pool and allocate it quarterly to strategic initiatives.)

  3. What processes need to change? (Review processes, testing strategy, deployment frequency, planning cycles.)

  4. What skills do we need to build? (Critical evaluation of AI output, architectural thinking, code review for AI-generated code.)

The Org Design Lens

@eng_director_luis your question about training reviewers is spot-on, but I’d go further: the entire SDLC needs redesign for AI velocity.

If PRs are 154% larger and wait 4.6x longer, maybe smaller PRs should be the constraint. Maybe automated review gates need to get smarter. Maybe we need specialized “AI code review” roles.

@maya_builds your point about selective AI usage is tactical brilliance. But it requires organizational discipline to enforce “boilerplate only” when everyone’s incentivized to move fast.

The Real Question

Are you treating AI adoption as a tools rollout or an organizational transformation?

Because if it’s the former, you get individual productivity gains and organizational chaos.

If it’s the latter, you redesign processes, adjust incentives, build new capabilities, and then capture the benefits.

Bain’s “unremarkable” findings are what happens when you do the former and hope for the latter.

From a product lens, this entire discussion highlights something we don’t talk about enough: velocity without quality is negative business value.

The Customer-Facing Reality

Let me reframe the metrics through outcomes that matter to customers and the business:

The Original Data:

  • 98% more PRs merged
  • 154% larger PR size
  • 1.7x more issues per PR
  • 9% more bugs per developer

What This Actually Means:

  • 98% more features shipped… with 1.7x more defects embedded
  • 154% more code surface area for bugs to hide in
  • 9% more customer-facing issues per sprint
  • Compounding technical debt at accelerated rates

The Technical Debt Compound Interest Problem

When you ship faster but break more, you’re not gaining velocity—you’re taking on debt at compound interest.

Here’s the math nobody wants to do:

If AI lets you ship Feature A in 3 days instead of 5 days (40% faster!), but it has 1.7x more bugs that take 4 days to debug and fix in production… you’re net slower. Plus you’ve damaged customer trust. Plus you’ve created technical debt that slows down Feature B.

@vp_eng_keisha nailed it: only 8% of saved time goes to strategic work. The rest goes to servicing the quality debt AI helped create.

The Product-Engineering Alignment Gap

In product reviews, engineering teams show velocity metrics: “We deployed 47% more features last quarter!”

In customer success reviews, we show bug reports up 12%, churn signals from quality issues, and NPS declining 3 points.

These should be the same conversation. They’re not.

Engineering optimizes for throughput. Product optimizes for outcomes. AI tools are revealing how badly misaligned these incentives are.

The Business Case Question

Here’s the framework I’m using with our leadership team:

Don’t ask: “Does AI make developers faster?”
Ask: “Does AI make the business more valuable?”

Measured by:

  • Customer satisfaction (not deployment frequency)
  • Revenue impact (not features shipped)
  • Engineering leverage (not individual productivity)
  • Technical foundation strength (not code volume)

When we reframe it this way, @maya_builds’ “radically selective” approach makes perfect business sense. Use AI where it reduces toil without creating downstream risk. Avoid it where quality compounds.

The Trap We’re In

The 1.7x more issues and Bain’s “unremarkable” findings are two sides of the same coin:

Individual developers feel productive (more code written). Organizations see no business gain (quality degrades faster than we ship features). Customers experience more bugs.

Everyone’s working harder. Nobody’s winning.

The question isn’t whether AI tools work. It’s whether we’re measuring success correctly.

If your north star is “code produced,” you’ll get lots of code. If your north star is “customer value delivered,” you’ll redesign how you use AI—and maybe conclude you’re using it wrong.