We’re sitting in this bizarre moment where AI is generating 41% of all new code in 2026, and yet when you look at actual organizational productivity gains, we’re seeing a plateau around 10%. That’s it. Not the 20-55% that the early GitHub/Google/Microsoft studies promised.
But here’s what really keeps me up at night as a product leader: We might be measuring the wrong things entirely.
The Perception Gap Is Real
The METR study stopped me cold. They ran a proper RCT with 16 experienced open-source developers—people with an average of 5 years on their own repositories. The results?
- Developers using AI took 19% longer to complete tasks
- But they believed they were 24% faster
- That’s a 39-point perception gap between feeling and reality
Think about what this means for how we’re evaluating AI tools today. If developers feel faster but are slower, and we’re making tool adoption decisions based on subjective feedback… we’re flying blind.
Three Levels of Measurement, Three Different Stories
I’ve started thinking about this in layers:
Individual Developer Level:
- Time saved writing code:
Real (3.6 hours/week average) - Feeling of productivity:
Real (developers report 10-30% boost) - Actual task completion speed:
Mixed evidence (METR says slower, other studies say faster)
Team Level:
- Sprint velocity:
Mostly unchanged despite AI adoption - Code review burden:
Increasing (66% say AI code is “almost right, but not quite”) - Technical debt accumulation:
AI-assisted code has 1.7× more issues - DORA metrics:
No meaningful improvement at most orgs
Business Level:
- Time to market:
Unknown - Cost per feature:
Unknown - Revenue impact:
Only 33% of decision-makers link AI to financial growth
The CFO Problem
Here’s the business reality: CFOs are deferring 25% of AI investments to 2027 because they can’t see the ROI. And honestly? I don’t blame them.
If I’m being asked to justify K+/year in AI coding tool licenses, what’s my business case? “Developers feel more productive” isn’t going to cut it. “We’re generating more code” isn’t a win if that code needs more debugging time.
What I need is:
- Time to market improvement for new features
- Cost reduction per delivered capability
- Customer impact metrics (faster bug fixes, more features shipped)
- Risk metrics (security vulnerabilities, production incidents)
So What Should We Be Measuring?
I think we need a framework that acknowledges the complexity:
- Code generation speed (individual metric, short-term)
- Code durability (team metric, medium-term) - How long does it survive in production without modification?
- Review efficiency (team metric, short-term) - Are we spending more time fixing AI suggestions than we saved writing them?
- Production quality (business metric, long-term) - Incident rates, security vulnerabilities, customer-facing bugs
- Developer capability (organizational metric, long-term) - Are we building or eroding engineering skills?
But here’s the uncomfortable truth: Even with better metrics, we might find that AI tools are net negative in some contexts and net positive in others. And that’s okay! The goal isn’t to prove AI is universally good—it’s to understand when it helps and when it hurts.
The Question for This Group
What are you actually measuring to evaluate AI coding tools?
Are you tracking anything beyond “developer sentiment” and “adoption rates”? Have you found metrics that meaningfully correlate with business outcomes?
Because right now, it feels like we’re in a measurement crisis. We’re adopting tools at 93% rates while only 46% of developers fully trust them, and we can’t definitively say whether they’re making us faster or slower.
That’s… not great.
Sources: