Following up on our productivity measurement discussion and junior developer concerns, I want to get tactical about measurement.
The problem: We’re using the same metrics we always have—DORA, velocity, commit counts—to evaluate a fundamentally different way of working.
And those metrics are showing no improvement despite massive AI adoption.
What DORA Isn’t Capturing
Our DORA metrics over the last 12 months (since we went all-in on AI tools):
- Deployment Frequency: Basically flat (2.3 deployments/day → 2.4)
- Lead Time for Changes: Slightly worse (6.2 hours → 6.8 hours)
- Change Failure Rate: Worse (4.1% → 5.7%)
- Time to Restore Service: Flat (52 minutes → 54 minutes)
So by traditional engineering metrics, AI tools have been net negative or neutral.
But I don’t think these metrics are capturing what’s actually happening. They’re too coarse-grained.
What We Should Be Measuring Instead
I’m proposing three new categories of AI-specific metrics:
1. Code Quality and Durability Metrics
Code Durability Score:
- % of code that survives 30/60/90 days without modification
- Hypothesis: AI-generated code gets rewritten more often
Bug Attribution:
- Production bugs per 1000 lines, segmented by code source (human vs AI-heavy)
- Security vulnerabilities by code source
- Hypothesis: AI code has higher defect density
Review Efficiency:
- Review comments per PR (normalized by PR size)
- Review time per line of code
- Acceptance without changes rate
- Hypothesis: AI code requires more review scrutiny
Technical Debt Accumulation:
- Code complexity metrics (cyclomatic, cognitive) over time
- Test coverage requirements (do we need more tests for AI code?)
- Refactoring frequency
- Hypothesis: AI code accumulates debt faster
2. Developer Capability Metrics
Debugging Proficiency:
- Time to resolve production incidents (by developer experience level)
- Ability to solve issues without AI assistance
- Root cause analysis quality
- Hypothesis: AI-dependent developers struggle with novel debugging
Knowledge Transfer:
- Code explanation quality in PRs
- Documentation completeness
- Mentorship effectiveness (can seniors teach juniors who used AI?)
- Hypothesis: AI reduces deep understanding, harming knowledge transfer
Skill Development Trajectory:
- Time-to-promotion (Junior → Mid → Senior)
- Technical interview performance over time
- Architecture decision quality
- Hypothesis: Heavy AI usage slows skill acquisition
3. Business Outcome Metrics
Value Delivery:
- Time from idea to customer value (not just deployment)
- Feature adoption and usage metrics
- Customer-reported bugs per release
- Hypothesis: More code ≠ more value
Cost of Quality:
- Security incident costs (time + $ + reputation)
- Support burden from production bugs
- Engineering time spent on rework vs new features
- Hypothesis: AI code quality issues have real business costs
Engineering Efficiency:
- Revenue per engineer (ultimate productivity measure)
- Cost per delivered feature
- Engineering headcount as % of company (are we actually needing fewer engineers?)
- Hypothesis: AI should improve these if it’s really a productivity boost
Proposed Measurement Framework
Tier 1 - Business Metrics (Monthly Review):
- Revenue per engineer
- Customer-reported critical bugs
- Time-to-market for revenue features
- Security incidents and compliance violations
If Tier 1 is improving: AI is working, keep going.
If Tier 1 is flat/declining: Investigate with Tier 2.
Tier 2 - Team Metrics (Weekly Review):
- Code durability scores
- Review burden and PR cycle time
- Change failure rate by code source
- Technical debt indicators
If Tier 2 shows AI code quality problems: Adjust policies (restrict AI use for critical paths, require senior review for AI-heavy PRs).
Tier 3 - Individual Metrics (Daily/Real-time):
- AI tool usage patterns
- Code authorship attribution
- Developer sentiment
- Learning and skill development
Use Tier 3 for coaching: Identify developers who are too AI-dependent or not leveraging AI effectively.
The Baseline Problem
Here’s my challenge: We don’t have pre-AI baselines for most of these metrics.
We adopted AI tools in a rush (“everyone else is doing it”) without establishing measurement frameworks first. Now we’re trying to retrofit baselines from historical data that may not be comparable.
If I could do it over:
Establish baseline metrics (DORA + new metrics above)
Run controlled pilot (one team with AI, one without, same projects)
Measure for 2 quarters before org-wide rollout
Make data-driven adoption decision
But we skipped steps 1-3 and went straight to org-wide adoption. Now we’re flying blind.
What I’m Implementing Next Quarter
Starting Q2, we’re instrumenting:
Week 1-2: Baseline measurement
- Current code durability scores
- Review metrics across all teams
- Developer capability assessments
Week 3-12: A/B testing within teams
- Some features: “AI-heavy” approach (full tool access)
- Some features: “AI-light” approach (restricted to boilerplate/tests only)
- Track everything in the framework above
End of quarter: Business case review
- Did AI-heavy features ship faster? With what quality?
- Did AI-light features cost more engineer time but have better outcomes?
- Which approach delivered more customer value?
Then we make evidence-based policy decisions about where AI helps and where it hurts.
The Question for This Group
What metrics are you using to evaluate AI tools beyond “developer sentiment”?
Has anyone found leading indicators that actually correlate with business outcomes? Or proven that DORA metrics are sufficient and I’m overthinking this?
Because right now, I’m preparing to defend (or cancel) a K/year tool budget, and “developers like it” isn’t going to cut it with our CFO.