Google reports that 25% of their code is now AI-assisted, yet they only see ~10% engineering velocity gains. The math doesn’t add up.
Either:
- AI-generated code requires disproportionate review/debugging time (the hidden cost)
- We’re measuring the wrong outputs (velocity isn’t the right metric)
I suspect it’s #2. We’re optimizing for activity (code written) when we should be optimizing for outcomes (value delivered).
The Metrics That Lie
Most teams measure AI impact using:
- Lines of code written: Meaningless (more code ≠ better code)
- PRs merged per week: Activity not outcomes
- Cycle time: Can decrease while quality suffers
- AI tool usage rate: Correlation not causation
These are output metrics that don’t tell you if you’re building the right things or building things right.
What We Should Measure Instead
I’ve been thinking about this through a product management lens. Here’s my proposed framework:
Level 1: Input Metrics (Are People Using AI?)
- AI tool adoption rate (what % of engineers use it?)
- Training completion rate (did they learn how?)
- Engagement (how often do they use it?)
Purpose: Tells you if adoption is happening, not if it’s working.
Level 2: Output Metrics (Is More Code Being Produced?)
- Code written per engineer
- PRs created per week
- Features shipped per sprint
Purpose: Tells you if activity increased, not if value increased.
Level 3: Outcome Metrics (Is Better Software Being Delivered?)
- Time to value: Idea → production → customer validation
- Code quality: Defect rate, post-deployment incidents
- Customer satisfaction: NPS, support tickets, adoption
- Team health: Developer satisfaction, retention, onboarding time
Purpose: Tells you if you’re delivering more value to customers.
The critical insight: Level 3 metrics matter most, but most teams only measure Level 1-2.
The Prevented Disasters Problem
Here’s what’s hard to measure: value created by problems that didn’t happen.
AI tools provide value through:
- Bugs caught before production
- Security issues identified in review
- Performance problems spotted early
- Architecture mistakes prevented
How do you measure a bug that never made it to customers? A security breach that never happened?
This is real value but nearly impossible to quantify.
The Option Value Challenge
Another measurement gap: AI creates option value—the ability to pivot, experiment, and respond to opportunities.
Example:
- Engineer uses AI to prototype new approach in 2 hours instead of 2 days
- Learns it won’t work, abandons it
- Traditional metrics: “wasted 2 hours on dead-end”
- Option value perspective: “validated hypothesis 16 hours faster”
We’re not measuring this learning velocity benefit.
What Google’s 10% Gain Actually Means
Let’s unpack that Google data point:
- 25% of code is AI-assisted
- Only 10% velocity improvement
This suggests:
- AI speeds up some parts of the process (writing code)
- But doesn’t speed up others (requirements, review, testing, deployment)
- So end-to-end cycle time improves modestly
If coding is 30% of the software delivery cycle, and AI makes it 50% faster:
- 30% × 50% = 15% total cycle time improvement
- Factor in review overhead, testing, etc.
- Net result: ~10% velocity gain
The math actually checks out. AI optimizes one part of a multi-stage process.
The Quality-Velocity Tradeoff
What if the real story isn’t “AI only improved productivity 10%” but “AI improved sustainable productivity 10%”?
Velocity without quality is just accumulated technical debt.
Better question: How much does AI improve velocity while maintaining or improving quality?
Metrics I’d track:
- Sustainable velocity: Features shipped that don’t require follow-up bug fixes
- Rework rate: % of PRs that need subsequent fixes
- Customer-facing quality: Incidents, performance regressions, user complaints
If AI increases velocity 30% but defect rate also increases 30%, you haven’t actually improved productivity—you’ve just shifted work from coding to debugging.
My Proposed Measurement Framework
Here’s what I’d recommend tracking:
Developer Experience:
- Satisfaction with AI tools (7/10 or higher?)
- Perceived productivity (do engineers feel more effective?)
- Learning velocity (faster ramping on new domains?)
Engineering Efficiency:
- Lead time for changes (commit → production)
- Deployment frequency (shipping more often?)
- Change failure rate (fewer incidents?)
- Time to restore service (faster recovery?)
Business Outcomes:
- Feature adoption (are customers using what we built?)
- Customer satisfaction (NPS, retention)
- Engineering retention (are engineers staying?)
- Recruiting (can we attract better talent?)
Connect the dots: AI training → faster cycle time → more experiments → better product-market fit → revenue growth.
Questions for the Community
- What are you measuring? And is it actually telling you if AI is working?
- How do you measure prevented disasters? Bugs that never shipped, issues caught in review?
- What’s the right timeframe for ROI? Should we expect results in 3 months? 12 months? 24 months?
- Are qualitative metrics undervalued? Developer satisfaction, team morale, etc.?
I’m convinced most teams are measuring the wrong things. We need to shift from output metrics (code written) to outcome metrics (value delivered).
What’s everyone else seeing in their organizations?