Three months into our company-wide AI coding assistant rollout, I pulled the metrics. Our engineers are merging 98% more pull requests. The board loved that slide.
Then I looked deeper.
The data that didn’t make the board deck:
- Bug rates increased 9% per developer
- Average PR size ballooned 154%
- Code review time increased 91%
- Zero measurable improvement in delivery velocity at the organizational level
We celebrated individual productivity gains while organizational outcomes stayed flat. How did we miss this?
The Productivity Paradox Is Real
Recent research across 10,000+ developers confirms what I’m seeing: AI-authored code produces 1.7x more problems than human code—10.83 issues per PR versus 6.45. The quality gap isn’t subtle.
The breakdown is alarming:
- 75% more logic and correctness errors
- 322% more privilege escalation paths
- 3x more readability issues
- Technical debt increased 30-41% after AI adoption
But here’s what really keeps me up at night: We’re creating a review bottleneck we can’t scale out of.
The Senior Engineer Trap
Our senior engineers are drowning. They’re reviewing nearly twice as many PRs that are 154% larger than before. Review time has almost doubled, yet we expect the same thoroughness for security, architecture, and maintainability.
The math doesn’t work. You can’t double review volume, increase PR size by 154%, and expect to catch the 75% increase in logic errors. Something has to give.
And it is giving—we’re seeing seniors rush reviews or batch them, which means more bugs reach production. The cognitive load of reviewing AI-generated code is different. You can’t trust the patterns. Every line needs active verification.
We’re Measuring the Wrong Things
I think we’ve been optimizing for the wrong metrics. We track:
- PRs merged per developer
(up 98%!) - Lines of code written
(up significantly!) - Time from commit to merge
(faster!)
But we’re not tracking:
- Time from feature conception to customer value
- Cost of increased bug density
- Senior engineer bandwidth erosion
- Technical debt accumulation rate
- Long-term codebase maintainability
Our velocity metrics look phenomenal while our ability to deliver stable, maintainable software degrades.
The Uncomfortable Question
Are we shipping faster or just shipping more bugs faster?
At my previous company (Microsoft), we learned that velocity without quality is just waste. Fast feedback loops matter, but only if the code we’re iterating on is fundamentally sound.
AI coding assistants give us faster code generation. But if that code requires 91% more review time, introduces 9% more bugs, and creates 3x more readability issues—are we actually more productive?
Or are we creating a debt bomb that will explode when we try to maintain this code in 12 months?
What We’re Trying
We’re experimenting with a few guardrails:
- “AI-assisted” PR labels - Forces visibility into which PRs used AI heavily
- Size limits on AI-generated PRs - Forcing smaller, more reviewable chunks
- Mandatory senior review for AI-heavy PRs - Accepting that these need deeper scrutiny
- Quality metrics dashboards - Tracking bug density, review time, time-to-fix alongside velocity
Early signals suggest forcing smaller AI-generated PRs helps significantly. Reviewability improves, bug catch rate increases.
The Real Metric Question
What should we actually be measuring?
I suspect the answer is something like “value delivered to customers per unit of engineering effort” but that’s notoriously hard to quantify.
Maybe it’s “features shipped with <X% bug rate in first 30 days”?
Or “engineering throughput adjusted for technical debt creation”?
I don’t have the answer yet. But I’m increasingly convinced that our current metrics are leading us to optimize for activity rather than outcomes.
How are you thinking about AI productivity measurement? What metrics matter in your organization?
Sources: CodeRabbit State of AI vs Human Code, Faros AI Productivity Paradox, Stack Overflow on AI Bugs