I just walked out of an executive meeting where our CFO challenged our entire AI tooling investment. The data he presented was stark: we’ve rolled out AI coding assistants to 85% of engineering, developers self-report 40% efficiency gains and 25% fewer bugs, but our DORA metrics are flat—or worse. Deployment frequency hasn’t budged. Lead time is up 8%. And our change failure rate actually increased 7.2% since AI adoption began.
He asked point-blank: “If developers are so much more productive, why isn’t it showing up in our delivery metrics?”
I didn’t have a good answer.
The Individual vs Team Paradox
Here’s what we’re seeing at the individual level:
- Developers complete 21% more tasks per sprint
- Pull requests merged per developer up 98%
- Self-reported satisfaction with coding tools at all-time high
- Time saved on boilerplate, tests, and refactoring is real and measurable
But at the team and organizational level:
- Delivery stability down 7.2% according to our DORA tracking
- PR review times up 91% creating a massive bottleneck
- Average PR size increased 154% making reviews harder
- Bugs per developer up 9% despite individual claims of fewer bugs
The data doesn’t lie, but it also doesn’t make sense.
Where Is the Disconnect?
I’ve been digging into the research, and the pattern is consistent across the industry. Multiple studies show that while individual developers feel more productive, companies aren’t seeing measurable improvement in delivery velocity or business outcomes when you aggregate the data.
Some hypotheses I’m exploring:
1. We’re measuring the wrong things. DORA metrics reflect team capabilities, not individual code generation speed. Maybe we need new metrics for the AI era that capture creative problem-solving vs code volume.
2. The bottleneck shifted. We optimized one part of the assembly line (code writing) while leaving others untouched (review, testing, deployment). Now we have a massive pile-up at the review stage—senior engineers drowning in PRs they can’t approve fast enough.
3. Quality is degrading invisibly. Research shows AI-assisted code has 23.7% more security vulnerabilities. We’re moving faster, but are we building sustainable systems or creating tomorrow’s technical debt?
4. AI amplifies existing dysfunction. Teams with strong processes see gains. Teams with weak organizational capabilities see more chaos. The research suggests AI acts as an “amplifier”—magnifying whatever you already have, good or bad.
The CFO’s Real Question
What he’s really asking is: “Are we confusing activity with impact?”
29-41% of our code is now AI-generated. Developer productivity is up only 3.6%. That’s a massive efficiency gap. Where did the productivity go?
More code churned doesn’t automatically mean more customer value delivered. It might just mean we’re busy—writing more code, reviewing more code, debugging more code—without actually solving more problems or shipping more features.
How Are You Explaining This?
I need to go back to the exec team with a better answer. For those of you facing similar scrutiny:
- How are you measuring AI impact beyond self-reported developer satisfaction?
- Have you seen DORA improvements, or are you experiencing the same paradox?
- What organizational changes did you make to capture individual productivity gains at the team level?
- How do you communicate this complexity to finance and business leaders who just want to see ROI?
I’m open to the possibility that AI tools aren’t delivering on the productivity promise—or that we’re implementing them wrong. But I also suspect we’re dealing with a measurement and organizational design problem, not just a tooling problem.
What’s your experience?