The discussions Maya and Luis started have surfaced a critical question that I think every engineering org needs to grapple with: If traditional metrics don’t capture AI’s real impact, what should we actually be measuring?
The Measurement Problem
Here’s what I’m seeing across my teams: We’re drowning in data but starving for insight.
Metrics We’re Tracking (And They’re Not Helping):
- Lines of code written
- PR count and merge rate
- Story points completed per sprint
- Individual commit frequency
- Time to “code complete”
The Problem:
All of these metrics are up with AI usage. So by these measures, we’re “more productive.”
But as Luis showed, bug volume is up. As Michelle showed, full-cycle time is worse. As we’re all seeing, actual value delivered to customers is flat or down.
We’re measuring activity, not outcomes. And that’s dangerous.
The Three-Level Measurement Framework
I’ve been working with our data analytics team to rethink how we measure AI productivity impact. We’re using a three-level framework:
Level 1: Developer Experience (Individual)
What we measure:
- Perceived productivity (survey: “Do you feel AI makes you more productive?”)
- Tool satisfaction and engagement
- Time saved on specific tasks (self-reported)
- Learning and skill development
Current data:
- 87% say AI makes them feel more productive
- 4.2/5 satisfaction with AI tools
- Self-reported 3.8 hours saved per week
- But… 34% report feeling less confident in their code understanding
Interpretation:
Developer experience is positive on balance. AI makes work feel better. But there are early warning signs about skill development and code understanding.
Level 2: Team Effectiveness (Delivery System)
What we measure:
- Full cycle time (idea → production stable)
- DORA metrics:
- Deployment frequency
- Lead time for changes
- Change failure rate
- Mean time to recovery
- Code review time and quality
- Defect escape rate (bugs reaching production)
- Rework percentage (time spent fixing vs building)
Current data:
- Cycle time: +11% (worse)

- Deployment frequency: +3% (flat)
- Lead time: +8% (worse)
- Change failure rate: +14% (worse)

- MTTR: About flat
- Review time: +47% (worse)
- Defect escape: +22% (worse)

- Rework: 38% of time (up from 24%)
Interpretation:
Team delivery effectiveness has declined. We’re not shipping faster or more reliably—we’re shipping more bugs and spending more time on rework.
Level 3: Business Outcomes (Value)
What we measure:
- Features delivered to production
- Feature adoption rate (% of users using new features)
- Customer satisfaction (NPS for new features)
- Time-to-value (from idea to customer impact)
- Revenue impact of shipped features
- Customer-reported bug/issue rate
Current data:
- Features shipped: +12%
- Feature adoption: -3% (slightly worse)
- NPS for new features: +1 (flat)
- Time-to-value: +15% (worse)

- Revenue impact: Roughly flat per feature
- Customer issues: +18% (worse)

Interpretation:
We’re shipping more features, but they’re delivering less value per feature, taking longer to reach customers, and generating more support burden.
The Disconnection
Here’s what’s fascinating (and concerning): The three levels tell completely different stories.
Level 1 (Developer):
AI is great! I feel faster and more satisfied.
Level 2 (Team):
We’re slower, buggier, spending more time on rework.
Level 3 (Business):
Value delivery hasn’t improved and might be getting worse.
This is the perception-reality gap Maya identified, but at an organizational scale.
Why Traditional Metrics Fail for AI
I think traditional engineering metrics fail for AI productivity because they were designed for a different bottleneck.
Traditional assumption: Coding speed is the constraint.
Modern reality: Requirements clarity, architectural decisions, coordination, and quality are the constraints.
AI makes the non-constraint faster. But as Theory of Constraints teaches us, optimizing a non-constraint doesn’t improve system throughput—it just creates inventory (in our case, code awaiting review, bugs awaiting fixes, features awaiting adoption).
What We Should Measure Instead
Based on what I’m learning, here are metrics that actually matter:
For Individual Developers:
- Time spent on high-value activities (architecture, design, problem-solving vs debugging, rework)
- Code understanding (can you explain and defend your code?)
- Learning velocity (are you developing new skills or relying on AI as a crutch?)
- Code quality (defects per KLOC, security issues, maintainability)
For Teams:
- Flow efficiency (value-add time / total cycle time)
- Quality metrics (defect density, change failure rate, security findings)
- Waste metrics (rework time, code churn, abandoned work)
- Collaboration effectiveness (code review quality, knowledge sharing)
For Organizations:
- Time-to-value (idea → customer impact)
- Customer outcomes (adoption, satisfaction, problem resolution)
- Business impact (revenue, retention, growth attributable to features)
- Strategic progress (are we advancing our platform, or just churning?)
The Uncomfortable Truth
When I look at this data with our exec team, the conclusion is hard to avoid:
By the metrics that actually matter—team effectiveness and business outcomes—AI hasn’t made us more productive. It might have made us less productive.
But by the metrics that are easy to measure and feel good—individual activity and velocity—AI looks great.
We’re at risk of optimizing for the wrong things.
Questions for the Community
-
What metrics are you tracking? Are they measuring activity or outcomes?
-
How do you balance qualitative and quantitative data? Surveys say one thing, delivery metrics say another—which do you trust?
-
What’s a realistic measurement framework that doesn’t require a data science PhD but actually captures AI’s impact?
-
How do you communicate this to non-technical leadership? Executives see “PR count up 60%!” and want to know why we’re not shipping more value.
-
Are there leading indicators we should track? Metrics that predict future productivity impact before it shows up in delivery?
Where I’m Headed
I’m moving toward a dashboard that shows all three levels side by side:
- Developer experience (important for retention)
- Team effectiveness (important for delivery)
- Business outcomes (important for value)
And being honest when they don’t align. If developers feel great but outcomes are poor, that’s a signal we need to investigate, not ignore.
The goal isn’t to prove AI is good or bad—it’s to understand where and how it creates value, and where it doesn’t, so we can make informed decisions.
What measurement approaches are working for y’all? I’d love to learn from teams that have figured this out better than we have.
Sources: DORA Report and Developer Metrics, 2025 Stack Overflow Developer Survey - AI