Building on the AI productivity paradox discussions—I want to challenge a fundamental assumption we’re all making: that “developer productivity” is the right metric to optimize for in the first place.
The Metrics Mismatch I’m Seeing
From the product/business side, here’s what’s confusing about the current narrative:
Engineering celebrates:
- Story points up 35%
- Commits up 60%
- Pull requests up 47%
- Individual developer satisfaction with productivity tools up significantly
Business sees:
- Features shipped per quarter: flat
- Time-to-market for new capabilities: unchanged
- Customer-facing release velocity: same as last year
- Customer satisfaction and retention: unchanged
We have a complete disconnect between input metrics (code produced) and output metrics (customer value delivered).
The Framework Problem: What Are We Actually Measuring?
Let me break down the productivity measurement stack:
Level 1: Individual Coding Metrics (What AI Improves)
- Lines of code written per hour
- Time to implement a specified feature
- Commits per day
- PRs created per week
AI impact: +40% improvement
Business relevance: Low—these don’t correlate with business outcomes
Level 2: Engineering Team Metrics (What We Track)
- Story points completed per sprint
- Velocity trends
- DORA metrics (deployment frequency, lead time, change failure rate, MTTR)
AI impact: Minimal to zero improvement
Business relevance: Medium—proxies for engineering effectiveness
Level 3: Product Delivery Metrics (What Product Tracks)
- Features shipped per quarter
- Time from concept to customer availability
- Percentage of roadmap delivered
- Feature adoption rates
AI impact: No measurable improvement so far
Business relevance: High—directly impacts go-to-market
Level 4: Business Outcome Metrics (What Actually Matters)
- Customer value delivered
- Revenue impact of new features
- User engagement and satisfaction
- Competitive differentiation
AI impact: Unclear—hard to attribute causally
Business relevance: Critical—this is what the business is optimizing for
The Uncomfortable Question
If coding speed improves 40% (Level 1) but business outcomes (Level 4) don’t improve, what does “productivity” even mean?
Are we measuring productivity wrong? Or is the productivity we’re measuring just not the productivity that matters?
Why Level 1 Gains Don’t Translate Up
Here’s my hypothesis for why individual coding productivity doesn’t translate to business productivity:
1. Coding Is a Smaller Fraction of Total Cycle Time Than We Think
From idea to customer value, typical timeline:
- Discovery and requirements: 1-2 weeks
- Design and technical planning: 1 week
- Implementation (coding): 1-2 weeks ← AI improves this
- Code review and iteration: 1 week
- Testing and QA: 1 week
- Deployment and rollout: 1 week
- Documentation and enablement: 1 week
- User adoption and feedback: 2-4 weeks
Total: 9-13 weeks
Coding portion: ~15-20% of total cycle
If AI makes coding 40% faster, total cycle improves by ~6-8%. But we’re not even seeing that much improvement.
2. Faster Coding May Increase Downstream Work
Based on what Maya and Keisha have shared:
- More code → more review time (+91% in Luis’s data)
- More complexity → more testing needed
- Less design discipline → more bugs and rework
- Non-standard implementations → more maintenance
Paradox: Making one phase 40% faster might make other phases 20-50% slower if it increases volume/complexity.
3. Coordination Overhead Scales with Output Volume
More PRs created means:
- More context switching for reviewers
- More deployment coordination
- More release notes and documentation
- More cross-team communication about changes
These coordination costs might be consuming the productivity gains.
4. The Wrong Things Get Built Faster
This is the most concerning: AI doesn’t help us build the right things, just implement things faster.
Discovery, user research, requirements validation, hypothesis testing—none of these are accelerated by coding tools.
So if we’re still spending the same time figuring out what to build, and we’re no better at picking the right features, then faster implementation just means we deliver wrong solutions faster.
The Alternative Measurement Framework
Instead of measuring “how much code can developers write,” what if we measured:
End-to-End Value Delivery
Metric: Time from idea to validated customer value
- Start: Feature concept approved
- End: Feature in production with adoption data showing value
Current average: 10-12 weeks
Target with AI: 7-8 weeks (if we actually unlock the productivity)
This captures the full value chain, not just coding.
Quality-Adjusted Throughput
Metric: Features delivered × (1 - defect rate) × adoption rate
Features that ship with bugs or don’t get adopted shouldn’t count as “productive output.”
Current calculation:
- 12 features shipped per quarter
- 15% have significant bugs requiring rework
- 60% see meaningful adoption
Quality-adjusted output: 12 × 0.85 × 0.60 = 6.12 effective features
Customer Impact Per Engineering Hour
Metric: Business value delivered / total engineering time invested
This requires defining “business value”—could be revenue, engagement, retention improvement, etc.
Forces the question: Did that feature that was easy to build actually matter to customers?
Time to Validated Learning
Metric: How quickly can we test a hypothesis with real users
This measures how fast we learn, not just how fast we ship.
- Can we get a prototype in front of users in 1 week instead of 4?
- Can we A/B test two approaches in 2 weeks instead of 8?
Faster learning → better product decisions → higher chance of building the right things.
What This Means for AI Productivity Evaluation
If we adopt these alternative metrics, the AI productivity evaluation changes:
Traditional view: “AI makes developers 40% more productive”
Alternative view: “AI makes implementation 40% faster, but total value delivery improves <5% because implementation is only 15-20% of the cycle, and faster implementation may increase downstream costs”
That’s still valuable—5% improvement in value delivery is meaningful. But it’s not the 40% transformation the individual metrics suggest.
The Strategic Question for Leadership
Should we optimize for developer velocity or for end-to-end value delivery velocity?
These might require different investments:
Optimizing for developer velocity:
- Better AI coding tools
- Better code review processes (current discussion)
- Better testing and deployment automation
Optimizing for value delivery velocity:
- Better requirements discovery processes
- Faster customer feedback loops
- More disciplined scope management
- Better hypothesis testing and validation
- Cross-functional collaboration improvements
Most AI productivity discussions focus on the first. I’m arguing we should focus on the second.
What I’m Proposing
-
Stop celebrating story points and commits as success metrics
- These are inputs, not outputs
- They don’t correlate with business value
-
Start measuring idea-to-customer-value cycle time
- This captures the full process, not just coding
- It forces visibility into all the bottlenecks, not just implementation
-
Track feature effectiveness, not just feature delivery
- Quality-adjusted throughput
- Adoption rates and customer value
- ROI per engineering hour
-
Measure learning velocity, not just shipping velocity
- How fast can we validate hypotheses?
- How quickly do we discover we built the wrong thing?
-
Include product discipline in productivity initiatives
- Requirements rigor
- Scope management
- Post-launch validation
The Uncomfortable Conclusion
Maybe the productivity paradox isn’t a paradox at all. Maybe we’re measuring the wrong productivity.
Developers are more productive at writing code. That’s real and valuable.
But team productivity (delivering customer value) hasn’t improved because coding was never the primary constraint.
The constraint is:
- Figuring out what to build
- Coordinating across teams
- Validating we built the right thing
- Maintaining what we built
Until AI tools help with those constraints, individual coding productivity won’t translate to business productivity—no matter how good our code review processes get.
Does this resonate with others? Or am I missing how faster coding should translate to business value?