AI adoption causes 9% more bugs per developer and 154% larger PRs—Are we optimizing for the wrong metric?

Three months into our company-wide AI coding assistant rollout, I pulled the metrics. Our engineers are merging 98% more pull requests. The board loved that slide.

Then I looked deeper.

The data that didn’t make the board deck:

  • Bug rates increased 9% per developer
  • Average PR size ballooned 154%
  • Code review time increased 91%
  • Zero measurable improvement in delivery velocity at the organizational level

We celebrated individual productivity gains while organizational outcomes stayed flat. How did we miss this?

The Productivity Paradox Is Real

Recent research across 10,000+ developers confirms what I’m seeing: AI-authored code produces 1.7x more problems than human code—10.83 issues per PR versus 6.45. The quality gap isn’t subtle.

The breakdown is alarming:

  • 75% more logic and correctness errors
  • 322% more privilege escalation paths
  • 3x more readability issues
  • Technical debt increased 30-41% after AI adoption

But here’s what really keeps me up at night: We’re creating a review bottleneck we can’t scale out of.

The Senior Engineer Trap

Our senior engineers are drowning. They’re reviewing nearly twice as many PRs that are 154% larger than before. Review time has almost doubled, yet we expect the same thoroughness for security, architecture, and maintainability.

The math doesn’t work. You can’t double review volume, increase PR size by 154%, and expect to catch the 75% increase in logic errors. Something has to give.

And it is giving—we’re seeing seniors rush reviews or batch them, which means more bugs reach production. The cognitive load of reviewing AI-generated code is different. You can’t trust the patterns. Every line needs active verification.

We’re Measuring the Wrong Things

I think we’ve been optimizing for the wrong metrics. We track:

  • PRs merged per developer :white_check_mark: (up 98%!)
  • Lines of code written :white_check_mark: (up significantly!)
  • Time from commit to merge :white_check_mark: (faster!)

But we’re not tracking:

  • Time from feature conception to customer value
  • Cost of increased bug density
  • Senior engineer bandwidth erosion
  • Technical debt accumulation rate
  • Long-term codebase maintainability

Our velocity metrics look phenomenal while our ability to deliver stable, maintainable software degrades.

The Uncomfortable Question

Are we shipping faster or just shipping more bugs faster?

At my previous company (Microsoft), we learned that velocity without quality is just waste. Fast feedback loops matter, but only if the code we’re iterating on is fundamentally sound.

AI coding assistants give us faster code generation. But if that code requires 91% more review time, introduces 9% more bugs, and creates 3x more readability issues—are we actually more productive?

Or are we creating a debt bomb that will explode when we try to maintain this code in 12 months?

What We’re Trying

We’re experimenting with a few guardrails:

  1. “AI-assisted” PR labels - Forces visibility into which PRs used AI heavily
  2. Size limits on AI-generated PRs - Forcing smaller, more reviewable chunks
  3. Mandatory senior review for AI-heavy PRs - Accepting that these need deeper scrutiny
  4. Quality metrics dashboards - Tracking bug density, review time, time-to-fix alongside velocity

Early signals suggest forcing smaller AI-generated PRs helps significantly. Reviewability improves, bug catch rate increases.

The Real Metric Question

What should we actually be measuring?

I suspect the answer is something like “value delivered to customers per unit of engineering effort” but that’s notoriously hard to quantify.

Maybe it’s “features shipped with <X% bug rate in first 30 days”?

Or “engineering throughput adjusted for technical debt creation”?

I don’t have the answer yet. But I’m increasingly convinced that our current metrics are leading us to optimize for activity rather than outcomes.

How are you thinking about AI productivity measurement? What metrics matter in your organization?


Sources: CodeRabbit State of AI vs Human Code, Faros AI Productivity Paradox, Stack Overflow on AI Bugs

This hits close to home, Michelle. I’m seeing the exact same pattern with my 40-person team.

The Review Bottleneck Is Real

I tracked our senior engineers’ time allocation over the last 6 months:

  • Before AI tools: ~35% of their time on code review
  • After AI tools: ~60% of their time on code review

That’s not a typo. Our most experienced engineers—the ones who should be architecting systems, mentoring juniors, and solving complex technical problems—are now spending more than half their time reviewing code.

The Hidden Cost Nobody Talks About

Here’s what bothers me: Our juniors are shipping faster with AI assistance. That part works. They generate code quickly, tests pass, they submit PRs confidently.

But someone has to review all of it. And that burden falls disproportionately on senior engineers who can spot the subtle issues you mentioned—the logic errors, the security vulnerabilities, the patterns that “look right” but aren’t.

We’ve essentially shifted the bottleneck from writing code to reviewing it. And review doesn’t scale the same way code generation does.

The Question I’m Wrestling With

Is AI’s benefit to junior developers creating an unsustainable burden on senior engineers?

We’re accelerating one part of the pipeline while creating a massive slowdown in another. It’s like installing a bigger engine in a car but keeping the same brakes—something’s going to fail under load.

What I Think We’re Missing

Your point about measuring the wrong metrics really resonates. We need to be tracking:

  • Senior engineer capacity utilization - Are we burning out our most valuable people?
  • Review queue depth and age - How long are PRs waiting for quality review?
  • Review thoroughness metrics - Are we catching issues or rubber-stamping to clear the queue?

Right now, our dashboards show “throughput up!” but they hide the fact that our senior engineers are working nights and weekends just to keep up with review volume.

Should We Be Building Better Review Processes?

Maybe the answer isn’t just “review AI code more carefully” but fundamentally rethinking our review process for an AI-assisted world:

  • Do we need specialized AI code reviewers?
  • Should we be investing in better automated review tools that catch AI-specific issues?
  • Do we need different review SLAs for AI-generated vs human-written code?

I don’t have answers, but I know our current approach isn’t sustainable. We’re asking senior engineers to do 60% more review work while expecting the same quality bar. The math doesn’t add up.

Anyone else seeing this pattern? How are you handling the review burden?

Coming at this from the product side, and this conversation is making me uncomfortable—in a good way.

Our Engineering Velocity Metrics Look Amazing for Board Meetings

I literally presented this slide two weeks ago:

  • Sprint velocity up 40%
  • Story points completed up 35%
  • Deployment frequency increased

The board loved it. Our investors loved it. Everyone celebrated our “AI-driven productivity gains.”

But Here’s What I Didn’t Show Them

Customer-facing bugs increased 22% quarter-over-quarter.

Time-to-resolution for those bugs is 30% longer because, as Michelle and Luis pointed out, our senior engineers are drowning in review work and don’t have capacity for rapid firefighting.

Our NPS score dropped 8 points, and when I dug into the qualitative feedback, customers are explicitly calling out “more bugs” and “quality issues.”

Are We Shipping Features Faster or Just Shipping More Bugs Faster?

That’s the question that keeps me up at night now.

From a product perspective, I don’t actually care how many PRs get merged. I care about:

  • Value delivered to customers (not features deployed)
  • Reliability of those features (not volume of code shipped)
  • Time from customer need identification to working solution (not commit-to-deploy time)

And on those metrics? We’re not winning. We might even be losing.

The Metrics Mismatch Is Creating Misalignment

Here’s what scares me: Engineering is getting rewarded for velocity. Our engineering dashboard is green. Managers are hitting their OKRs.

But product is dealing with the downstream consequences:

  • Support tickets increasing
  • Customer trust eroding
  • Technical debt making future features harder to build
  • Firefighting eating into roadmap capacity

We’re optimizing different parts of the system, and they’re working against each other.

What Product Should Actually Be Measuring

Michelle’s question about metrics hits differently from my angle. I think we need to track:

  1. Customer Impact per Engineering Effort - Not “features shipped” but “customer problems solved” weighted by impact
  2. Feature Stability Rate - % of features that work correctly in production for 30 days without significant bugs
  3. Total Cycle Time - From customer need identified to value delivered (not just code commit to deploy)
  4. Technical Debt Service Costs - How much engineering capacity goes to fixing/maintaining vs building new value

Right now, I’m celebrating velocity wins while customers experience quality losses. That’s a losing strategy.

The Uncomfortable Business Reality

If AI tools let us ship 40% faster but introduce 22% more bugs that take 30% longer to fix… the math doesn’t work in our favor.

Fast feedback loops are only valuable if we’re iterating on fundamentally sound code. Otherwise, we’re just accumulating technical debt at an accelerated pace.

How are other product teams handling this? What metrics are you tracking to ensure you’re measuring business impact, not just engineering activity?

okay this thread is giving me flashbacks and I need to share because I think I’ve been part of the problem :grimacing:

The AI Code Review Cognitive Load Is Different

Working on our design system, I review component PRs regularly. When reviewing human-written code, I can usually spot patterns—“oh, this developer always handles edge cases this way” or “their naming conventions are consistent.”

With AI-generated code, I can’t trust ANY of those shortcuts.

Here’s What Actually Happened Last Month

One of our engineers (using AI heavily) submitted a new accessible dropdown component. The PR looked PERFECT:

  • Clean code structure ✓
  • Tests passing ✓
  • Proper ARIA attributes ✓
  • Documentation included ✓

I approved it after a quick review because it “looked right.”

Two weeks later, our accessibility audit caught it.

The component had subtle keyboard navigation issues that only appeared in specific screen reader + browser combinations. The AI had generated code that passed automated tests but failed real-world accessibility requirements.

I missed it because reviewing AI code is EXHAUSTING. You can’t pattern-match. You can’t trust that “this looks like good code” means “this IS good code.”

The Readability Problem Is Real

Michelle mentioned 3x more readability issues—I’m seeing this constantly. AI-generated code often:

  • Uses verbose variable names that sound descriptive but aren’t
  • Implements patterns that are “technically correct” but inconsistent with our codebase style
  • Adds comments that restate what the code does rather than explaining WHY

Six months from now, when someone needs to modify this code, they’ll struggle because the intent isn’t clear. It looks like it was written by someone who doesn’t understand our design system philosophy.

The Uncomfortable Confession

I’ve started rubber-stamping AI PRs when I’m tired or behind on reviews.

The cognitive effort to thoroughly review AI code when you have 10 PRs in your queue? I just… sometimes I look at the tests, see they pass, and approve it.

I know that’s bad. I know I’m letting quality slip. But Luis’s point about senior engineers spending 60% of their time on reviews? That’s real, and we don’t have infinite capacity.

Am I The Only One Who Feels Less Confident Reviewing AI Code?

Like, with human code, if I miss something in review, it’s usually because the logic was genuinely subtle. I learn from it.

With AI code, I miss things because I’m exhausted from the cognitive load of not being able to trust patterns. And I don’t learn anything useful because the mistakes are random, not systematic.

What Would Help

David’s point about feature stability metrics resonates. Maybe we need:

  • AI code audit cycles - Systematic review of AI-generated code 30 days post-merge
  • Different review checklists - Specific things to check in AI vs human code
  • AI code readability standards - Force AI-generated code through readability linters before review
  • Smaller PR size limits - Michelle mentioned this, but I think it’s critical for AI code

I don’t want to slow down the team. But I also can’t keep approving code I don’t fully trust just to hit velocity metrics.

How are other folks handling the review cognitive load? Any strategies that actually work?