Last quarter, my team at our financial services company merged 60% more pull requests than the previous quarter. Our VP of Product sent a congratulatory Slack message. Engineering leadership celebrated the productivity gains. We were crushing it.
Then our incident rate went up 35%. Production bugs that should’ve been caught in review started slipping through. Our most senior engineers were spending entire afternoons just reviewing AI-generated code. Something wasn’t adding up.
The Output Paradox
Here’s what the data is telling us in early 2026:
- AI-generated code now represents nearly 50% of all code being written
- Developers using AI assistants can merge approximately 60% more PRs
- Companies achieving full AI adoption see median cycle time drop by 24% (from 16.7 to 12.7 hours)
Sounds incredible, right? We’re shipping faster than ever.
But here’s the part that should make every engineering leader pause:
- AI-generated code produces 1.7× more issues overall - 10.83 issues per PR versus 6.45 for human code
- Logic and correctness errors occur 1.75× more frequently
- Security vulnerabilities rise 1.57×
- PR volume increased 20% year-over-year, but incidents per pull request jumped 23.5%
Source: CodeRabbit’s State of AI vs Human Code Generation Report
The Trust Problem
Even more telling than the metrics is the human reaction:
- Only 3% of developers highly trust AI-generated code
- 75% of developers won’t merge AI code without manual review
- 46% of developers say they don’t fully trust AI outputs at all
Source: Panto AI Coding Statistics
This creates a paradox: AI is generating code faster than we can write it ourselves, but we don’t trust it enough to merge without careful human review. So who’s really more productive?
What Are We Actually Measuring?
This is where I think we’re making a fundamental mistake as an industry. We’re optimizing for the wrong metrics:
What we’re measuring:
- PRs merged per week
- Lines of code changed
- Cycle time from commit to merge
- AI adoption percentage
What we should be measuring:
- Production incidents per 1000 lines of code
- Mean time to detection and resolution (MTTR)
- Code review effectiveness (issues caught in review vs production)
- Customer-impacting bugs per release
- Time senior engineers spend fixing AI-generated issues
McKinsey found that AI coding tools reduce time spent on routine coding tasks by an average of 46%. But that same study found that time spent on code review increased by 12% when developers didn’t adequately verify AI-generated code before submitting it.
Source: Second Talent AI Code Quality Metrics
We’re not measuring the downstream costs.
The Real Productivity Question
When my team shipped 60% more PRs last quarter, here’s what happened that didn’t show up in our velocity charts:
- Our senior staff engineers started spending 15+ hours per week just on code review
- We had to add a “senior review gate” for any AI-generated code touching payment systems
- Three production incidents traced back to AI-generated edge case handling failures
- Our junior engineers stopped learning debugging skills because “the AI will just rewrite it”
So did we actually become 60% more productive? Or did we shift the work from writing code to reviewing, debugging, and fixing code?
A Challenge to Engineering Leaders
I’m not anti-AI. We’re still using AI coding assistants every day. But I think we need to fundamentally rethink how we measure productivity in this new era.
Instead of celebrating “60% more PRs merged,” we should be asking:
- Are we shipping features customers actually want?
- Is our code more reliable or less reliable than last quarter?
- Are we building sustainable systems or creating technical debt at scale?
- Are we developing our junior engineers’ skills or atrophying them?
- Is our engineering team more effective at delivering business outcomes?
What Should We Track Instead?
I’m working on defining a new set of “AI-era engineering effectiveness metrics” for my team. Here’s my draft:
- Quality-adjusted velocity: PRs merged × (1 - production incident rate)
- Review effectiveness ratio: Issues caught in review / total issues
- AI code quality score: Track issue density specifically for AI-generated vs human code
- Senior engineer time distribution: How much time spent on review vs building?
- Business outcome metrics: Time to resolve customer issues, feature adoption rates, system uptime
But honestly, I’m still figuring this out. That’s why I’m posting here.
How are you measuring engineering productivity in the AI era? Are you still tracking merge velocity, or have you shifted to outcome-based metrics? What’s working—and what’s not?
For those of you who’ve been through this transition: What metrics helped you make the case to leadership that “60% more PRs” doesn’t necessarily mean “60% more value”?
I’d love to hear your perspectives, especially from other engineering leaders navigating this same challenge.
Luis Rodriguez
Director of Engineering
Austin, TX