Following up on Michelle’s thread about the AI productivity paradox, I want to propose a concrete measurement framework.
Keisha mentioned we need multi-dimensional measurement—I agree completely. But most organizations don’t know where to start or what to actually track.
Here’s what I’m experimenting with in our fintech engineering org.
The Problem with Current Metrics
Most teams track what’s easy, not what’s meaningful:
Vanity Metrics (easy to measure, misleading):
- AI acceptance rate (“developers accepted 73% of AI suggestions!”)
- Token consumption (“we used 2M tokens this month!”)
- Lines of AI-generated code
- Time saved in coding (based on developer self-reporting)
These tell us people are using the tools. They don’t tell us if it matters.
A Three-Layer Framework
I’m proposing we measure AI productivity across three layers, from input to outcome:
Layer 1: Input Metrics (Are we using AI?)
Track adoption and usage:
- % of developers actively using AI tools weekly
- % of code commits with AI assistance
- Developer satisfaction scores with AI tools
Purpose: Understand adoption patterns and identify blockers
Warning: High usage doesn’t mean high value
Layer 2: Process Metrics (Is the work moving differently?)
Track how work flows through the system:
- End-to-end cycle time: Time from commit to production across different work types
- DORA metrics: Deployment frequency, lead time for changes, change failure rate, MTTR
- Queue times: How long work waits in review, testing, deployment stages
- Rework rate: How often AI-generated code needs significant revision
Purpose: Identify bottlenecks and understand where AI helps vs hinders
Key insight: If cycle time doesn’t improve despite faster coding, the bottleneck is elsewhere
Layer 3: Outcome Metrics (Are we delivering more value?)
Track business impact:
- Feature delivery rate: Customer-facing capabilities shipped per sprint (not story points)
- Time-to-value: Days from customer request to production solution
- Quality outcomes: Defect rates, security findings, performance metrics
- Business KPI movement: Revenue, activation, retention, engagement tied to engineering work
Purpose: Connect engineering activity to business results
Key insight: This is what actually matters, but it’s hardest to measure
Specific Implementation
Here’s what we’re tracking weekly:
AI-assisted work:
- Cycle time: 8.2 days average (commit → production)
- Change failure rate: 12%
- Rework rate: 18% (code needing significant post-AI revision)
Non-AI work (control group):
- Cycle time: 9.1 days average
- Change failure rate: 8%
- Rework rate: 7%
Interpretation: AI speeds up initial coding (~11% faster end-to-end), but introduces more failures and requires more rework. Net benefit is marginal, not the 45% individual gains we expected.
What This Reveals
The measurement itself is revealing problems:
- Review bottleneck: AI code spends 35% longer in review than human code
- Testing challenges: AI-generated code has higher defect escape rate (catching less in pre-production)
- Type-dependent value: AI helps significantly with isolated bug fixes, marginally with new features
We’re learning that AI productivity is highly context-dependent.
The Measurement Warning
Here’s the trap: research shows that “without changes in measurement and governance, productivity gains rarely translate into business outcomes.”
You can’t just measure. You also need to:
- Act on what you learn (fix bottlenecks AI exposes)
- Adjust processes (review, testing, deployment for AI era)
- Train teams (how to use AI well, how to review AI code)
Measurement without action is just surveillance.
Questions for the Community
- What are you measuring to track AI productivity?
- Are you seeing different results for different types of work (bugs vs features vs refactoring)?
- How are you connecting engineering metrics to business outcomes?
- What measurement mistakes have you made that others should avoid?
I’m particularly interested in Layer 3 metrics—connecting engineering work to business KPIs. That’s where we’re struggling most.