46% of Code is AI-Generated, But PRs Have 1.7× More Issues. Are We Measuring Output or Outcome?

Last quarter, my team at our financial services company merged 60% more pull requests than the previous quarter. Our VP of Product sent a congratulatory Slack message. Engineering leadership celebrated the productivity gains. We were crushing it.

Then our incident rate went up 35%. Production bugs that should’ve been caught in review started slipping through. Our most senior engineers were spending entire afternoons just reviewing AI-generated code. Something wasn’t adding up.

The Output Paradox

Here’s what the data is telling us in early 2026:

  • AI-generated code now represents nearly 50% of all code being written
  • Developers using AI assistants can merge approximately 60% more PRs
  • Companies achieving full AI adoption see median cycle time drop by 24% (from 16.7 to 12.7 hours)

Sounds incredible, right? We’re shipping faster than ever.

But here’s the part that should make every engineering leader pause:

  • AI-generated code produces 1.7× more issues overall - 10.83 issues per PR versus 6.45 for human code
  • Logic and correctness errors occur 1.75× more frequently
  • Security vulnerabilities rise 1.57×
  • PR volume increased 20% year-over-year, but incidents per pull request jumped 23.5%

Source: CodeRabbit’s State of AI vs Human Code Generation Report

The Trust Problem

Even more telling than the metrics is the human reaction:

  • Only 3% of developers highly trust AI-generated code
  • 75% of developers won’t merge AI code without manual review
  • 46% of developers say they don’t fully trust AI outputs at all

Source: Panto AI Coding Statistics

This creates a paradox: AI is generating code faster than we can write it ourselves, but we don’t trust it enough to merge without careful human review. So who’s really more productive?

What Are We Actually Measuring?

This is where I think we’re making a fundamental mistake as an industry. We’re optimizing for the wrong metrics:

What we’re measuring:

  • PRs merged per week
  • Lines of code changed
  • Cycle time from commit to merge
  • AI adoption percentage

What we should be measuring:

  • Production incidents per 1000 lines of code
  • Mean time to detection and resolution (MTTR)
  • Code review effectiveness (issues caught in review vs production)
  • Customer-impacting bugs per release
  • Time senior engineers spend fixing AI-generated issues

McKinsey found that AI coding tools reduce time spent on routine coding tasks by an average of 46%. But that same study found that time spent on code review increased by 12% when developers didn’t adequately verify AI-generated code before submitting it.

Source: Second Talent AI Code Quality Metrics

We’re not measuring the downstream costs.

The Real Productivity Question

When my team shipped 60% more PRs last quarter, here’s what happened that didn’t show up in our velocity charts:

  1. Our senior staff engineers started spending 15+ hours per week just on code review
  2. We had to add a “senior review gate” for any AI-generated code touching payment systems
  3. Three production incidents traced back to AI-generated edge case handling failures
  4. Our junior engineers stopped learning debugging skills because “the AI will just rewrite it”

So did we actually become 60% more productive? Or did we shift the work from writing code to reviewing, debugging, and fixing code?

A Challenge to Engineering Leaders

I’m not anti-AI. We’re still using AI coding assistants every day. But I think we need to fundamentally rethink how we measure productivity in this new era.

Instead of celebrating “60% more PRs merged,” we should be asking:

  • Are we shipping features customers actually want?
  • Is our code more reliable or less reliable than last quarter?
  • Are we building sustainable systems or creating technical debt at scale?
  • Are we developing our junior engineers’ skills or atrophying them?
  • Is our engineering team more effective at delivering business outcomes?

What Should We Track Instead?

I’m working on defining a new set of “AI-era engineering effectiveness metrics” for my team. Here’s my draft:

  1. Quality-adjusted velocity: PRs merged × (1 - production incident rate)
  2. Review effectiveness ratio: Issues caught in review / total issues
  3. AI code quality score: Track issue density specifically for AI-generated vs human code
  4. Senior engineer time distribution: How much time spent on review vs building?
  5. Business outcome metrics: Time to resolve customer issues, feature adoption rates, system uptime

But honestly, I’m still figuring this out. That’s why I’m posting here.

How are you measuring engineering productivity in the AI era? Are you still tracking merge velocity, or have you shifted to outcome-based metrics? What’s working—and what’s not?

For those of you who’ve been through this transition: What metrics helped you make the case to leadership that “60% more PRs” doesn’t necessarily mean “60% more value”?

I’d love to hear your perspectives, especially from other engineering leaders navigating this same challenge.


Luis Rodriguez
Director of Engineering
Austin, TX

Luis, this resonates so much with my experience in design systems—and honestly, with my failed startup.

When we were building our SaaS product, we had this amazing design tool that could generate pixel-perfect components from Figma in seconds. Leadership loved it. Our engineering team was shipping UI faster than ever.

But here’s what we learned the hard way: “working” doesn’t mean “good.”

Those auto-generated components technically functioned. They passed our visual regression tests. They rendered correctly on different screen sizes. But the UX was terrible. The accessibility was an afterthought. The user flows made sense to the AI but confused actual humans.

The Quality Debt Compounds

The parallel to your AI code quality issue is striking:

  • You’re seeing 1.7× more issues in AI-generated code
  • We saw 2× more user complaints on features built with AI-generated UI components
  • Both look fine at first glance
  • Both fail under real-world stress

The problem with quality debt is it compounds over time. Every AI-generated component we shipped became a maintenance burden. Every “fast” sprint became a slow quarter of cleanup.

We eventually had to do a massive refactor of our entire component library. It took 3 months. It’s one of the reasons we ran out of runway.

The Visual Analogy

Here’s how I think about the AI productivity paradox:

It’s like using AI to generate pixel-perfect mockups that have terrible UX. Sure, you shipped 60% more screens. But did you actually solve the user’s problem? Did you make the experience better?

Your engineering team is shipping 60% more PRs. But are you delivering 60% more customer value? Or are you just moving the problems downstream—from writing code to reviewing it, from dev to production, from engineering cost to customer frustration?

Are We Creating Technical Debt at Scale?

The question that haunts me from my startup failure: Are we optimizing for the wrong thing?

In design, we learned that shipping fast means nothing if users can’t accomplish their goals. A beautiful, broken interface is still broken.

In your case: Does it matter that you merged 60% more PRs if your incident rate went up 35%?

What scares me is that with AI, we can now create bad code (or bad design) at scale. We’re industrializing technical debt.

And just like my startup, by the time you realize you’re building on a shaky foundation, you’ve already committed months of work and significant resources.


Maya Rodriguez
Design Systems Lead
Austin, TX

Luis, I’m living this exact challenge right now at my EdTech startup.

Last year, we went “AI-first” with development. Our board loved the productivity narrative. Engineering velocity became our North Star metric. For about 90 days, it felt like magic.

Then Reality Hit

Three months into our AI-first approach:

  • Production incidents up 40%
  • Customer support tickets up 28% (specifically about bugs and unexpected behavior)
  • Senior engineer retention became a concern because they were spending 60%+ of their time on code review instead of building
  • Junior engineers stopped learning - why debug when the AI will just rewrite it?

The metrics we celebrated in board meetings (PRs merged, velocity, cycle time) told a completely different story than the metrics our customers experienced (bugs, downtime, time to fix issues).

The Hidden Tax

Here’s what didn’t show up in our velocity dashboards:

We had to hire two additional staff engineers just to handle the code review load. The cost of those hires basically wiped out any efficiency gains from AI coding assistants.

Think about that: We celebrated being “60% more productive” while simultaneously needing 20% more senior engineering capacity just to review all that AI-generated code.

The Junior Engineer Problem

But the part that keeps me up at night is the impact on junior developers.

In the past 6 months, I’ve noticed our junior engineers are struggling with debugging and problem-solving. When they hit an error, their first instinct is to paste it into Claude or ChatGPT and accept whatever code it suggests.

They’re not learning:

  • How to read stack traces
  • How to use a debugger
  • How to reason about code behavior
  • How to think through edge cases

We’re creating a generation of engineers who can prompt AI but can’t debug production issues.

What happens when they need to become senior engineers? When they need to review AI-generated code themselves? How will they know what’s correct if they never learned to evaluate code quality?

What We’re Measuring Now

After that painful Q3, I completely revamped our engineering metrics. We still track velocity, but it’s no longer the primary metric. Instead, we focus on:

  1. Mean Time to Recovery (MTTR) - How quickly can we fix production issues?
  2. Production incident rate per 1000 LOC - Quality-adjusted velocity
  3. Code review cycle time - Are we catching issues before production?
  4. Customer-impacting bugs per release - The ultimate outcome metric
  5. Senior engineer time allocation - Are we using our most expensive resources effectively?

The Honest Conversation with Leadership

The hardest part was going to our board and saying: “The velocity story we’ve been celebrating? It’s incomplete.”

Here’s what convinced them:

  • Customer NPS correlation: Our NPS dropped 8 points during our “most productive quarter”
  • Support cost increase: Customer support costs up 22% due to increased bug reports
  • Engineering retention risk: Two senior engineers in exit interviews cited “endless code review” as a reason for leaving

When you translate “60% more PRs” into “40% more incidents, 28% more support tickets, and at-risk senior engineer retention,” the productivity narrative falls apart.

Moving Forward

We still use AI coding assistants. But we now have:

  • Mandatory human review gates for any AI-generated code touching critical paths
  • “Learning time” for junior engineers - dedicated time for debugging and problem-solving without AI assistance
  • Quality metrics balanced with velocity in our engineering dashboards
  • Clear ownership rules - AI can suggest, but humans own the outcome

Luis, your proposed metrics framework is spot-on. The key is tying engineering effectiveness to business outcomes, not just code output.

Are we shipping features customers want? Are systems more reliable? Are we developing sustainable engineering capabilities?

Those are the questions that actually matter.


Keisha Johnson
VP of Engineering
Atlanta, GA

Coming at this from the Product side, and honestly, this is the conversation I’ve been trying to have with our Engineering team for the past 6 months.

Customers Don’t Care About PR Velocity

Here’s what our customers care about:

  • Does the feature work reliably?
  • How quickly do we fix bugs when they report them?
  • Is the product experience improving or degrading?
  • Can they trust our platform for their critical workflows?

None of those map to “PRs merged per week.”

The Disconnect I’m Seeing

At my fintech startup, we had a fascinating (and frustrating) disconnect last year:

Engineering’s narrative:

  • “We shipped 3× more features in Q3 than Q2!”
  • “AI coding assistants are transforming our productivity!”
  • “Look at our velocity charts!”

Product’s reality:

  • Customer NPS dropped from 42 to 34
  • Support ticket volume up 45%
  • Three of our largest enterprise customers flagged “reliability concerns”
  • Feature adoption rates actually decreased despite more launches

We were celebrating in engineering stand-ups while I was fielding frustrated customer calls.

The Feature Factory Trap

Here’s what I think happens when we optimize for output over outcome:

We become a feature factory instead of a value creation engine.

We shipped 3× more features with AI assistance. But:

  • 40% of those features had less than 5% adoption (customers didn’t want or need them)
  • 25% of those features had critical bugs in the first 2 weeks post-launch
  • Our engineering team spent so much time shipping that we stopped doing discovery work—no time to validate what customers actually needed

We were building faster, but we weren’t building the right things.

And when we did build the right things, the quality issues meant customers didn’t trust them.

The Metric That Changed the Conversation

What finally got engineering and product aligned was tying engineering metrics directly to business outcomes.

We started tracking:

  1. Feature Adoption Rate - What % of target users actually use what we ship?
  2. Time to First Value - How long until a customer gets value from a feature? (Bugs delay this)
  3. Support Tickets per Feature Launch - Quality indicator tied to customer pain
  4. Customer Trust Score - Surveying enterprise customers specifically on reliability
  5. NPS Correlation - Does our “most productive sprint” correlate with better customer sentiment?

When we put these on the same dashboard as “PRs merged,” the story changed completely.

The Business Case Against Pure Velocity

For any engineering leader trying to make this case to their CEO/board, here’s the business framing that worked for us:

Old framing:
“We’re 60% more productive with AI coding assistants!”

New framing:
"We’re shipping faster with AI, but here’s the total cost:

  • Support costs up 25% due to increased bug reports
  • NPS down 8 points, risking renewal rates
  • 2 senior engineers spent 80% of time on code review vs building
  • Churn risk increased for our enterprise segment due to reliability concerns"

Net result: The velocity gains cost us more in downstream impacts than they saved in development time.

The Product-Engineering Partnership We Need

Luis, I love your proposed metrics framework. Can I suggest one addition?

Customer Outcome Metrics: For every major feature, define the customer outcome we’re trying to drive and measure whether we achieved it.

Because ultimately, customers don’t buy “PRs merged.” They buy outcomes:

  • Faster time to complete their workflows
  • More reliable systems
  • Better insights from their data
  • Lower costs for their operations

If we’re shipping 60% more code but customers aren’t experiencing 60% more value, we’re optimizing the wrong thing.


One last thought: The best Product-Engineering partnerships I’ve seen measure success the same way—by customer and business outcomes, not by activity metrics.

When Product and Engineering celebrate the same wins (customer value delivered) instead of different wins (features shipped vs velocity), that’s when magic happens.


David Okafor
VP of Product
New York, NY

Maya, Keisha, David—thank you all for these perspectives. This is exactly the conversation we need to be having.

The Pattern is Clear

Reading through your experiences, I’m seeing the same pattern emerge:

Initial phase: AI coding assistants → velocity surge → leadership celebration

Reality phase (3-6 months later):

  • Quality issues surface (Maya’s UX problems, my 35% incident increase, Keisha’s 40% incident spike)
  • Customer impact becomes visible (David’s NPS drop, Keisha’s support ticket increase)
  • Senior engineer burden explodes (Keisha hiring 2 more staff engineers, my team’s 15hrs/week review time)
  • Learning and skill development suffers (Keisha’s junior engineers not learning debugging)

The insight: Velocity ≠ Value

What We’re Testing Now

Based on this discussion and my own experiments, here’s what my team is piloting:

1. “AI-Assisted” Metric Tracking

We’re tracking not just AI code generation time, but the full cycle:

  • Time saved in initial coding
  • Time added in code review
  • Time added in bug fixing
  • Net productivity impact

So far, the data shows we’re about 20% more efficient, not 60%. That’s still good, but it’s honest.

2. Quality Gates Before Merge

We implemented what Keisha mentioned—mandatory human review gates for AI-generated code in critical paths:

  • Payment processing → Staff engineer review required
  • Security/auth → Security review required
  • Database migrations → DBA + senior engineer review

This slowed our PR merge rate by about 15%, but our production incident rate dropped by 40%. Worth it.

3. Weekly “AI Debt Review” Sessions

Every Friday, the team reviews:

  • What AI-generated code caused issues this week?
  • What patterns are we seeing in AI mistakes?
  • What should we add to our “human review always” list?

It’s become our most valuable learning session.

4. Junior Engineer “No AI Fridays”

Keisha, your point about junior engineers not learning debugging hit hard. We started “No AI Fridays” where junior engineers must solve problems without AI assistance.

Initial resistance, but after 6 weeks, their problem-solving and debugging skills have noticeably improved.

The Metrics Framework We’re Adopting

Based on everyone’s input, here’s the framework we’re moving toward:

Input Metrics (what we do):

  • AI-assisted code generation time
  • Human review time
  • Total development cycle time

Process Metrics (how we work):

  • Code review effectiveness ratio (David’s addition)
  • Senior engineer time distribution
  • Junior engineer skill development indicators

Output Metrics (what we ship):

  • Features delivered per sprint
  • Quality-adjusted velocity: PRs × (1 - incident rate)

Outcome Metrics (what customers experience):

  • Production incidents per 1000 LOC
  • MTTR for customer-impacting issues
  • Customer NPS and trust scores (David’s framework)
  • Feature adoption rates
  • Support ticket volume per feature

The Leadership Conversation

David, your reframing is perfect for board conversations. I’m presenting to our CTO next week, and I’m going to use exactly this approach:

Not: “We’re 60% more productive!”

But: “AI gives us 60% faster code generation, but here’s the total system impact when we account for review time, quality issues, and customer outcomes. Net improvement: 20%, with quality controls in place.”

Collaboration Invitation

Would any of you be interested in collaborating on an “AI-Era Engineering Effectiveness Framework”?

I’m thinking:

  • Shared metrics definitions
  • Industry benchmarks as we gather data
  • Best practices for AI-assisted development
  • Case studies (what worked, what didn’t)

If there’s interest, maybe we start a working group here on tianpan? I’d love to turn this conversation into something actionable for the broader engineering leadership community.


Luis Rodriguez
Director of Engineering
Austin, TX