Measuring AI Coding Tools: A Multi-Level Framework for Engineering Leaders
Throughout this discussion, we’ve identified the problems with AI measurement:
- Velocity is a vanity metric
- Verification overhead is real and costly
- Outcome metrics matter more than output metrics
Now, what’s the practical framework to actually measure this?
Context: Scaling Engineering with AI
We’re scaling from 25 to 80+ engineers. AI tools are already embedded in our workflow. We need measurement that works at scale and guides decision-making.
The Three-Level Measurement Framework
I propose measuring AI impact at three distinct levels, each with its own metrics, cadence, and interventions.
Level 1: Individual Developer Experience
What we’re measuring: Flow state, cognitive load, satisfaction, learning
Metrics:
- Context switches per day (tool instrumentation)
- Flow time percentage (self-reported + calendar analysis)
- Developer satisfaction scores (quarterly survey)
- Skill development trajectory (career conversations)
AI impact questions:
- Does AI help or hurt flow state?
- Is cognitive load from verification worth generation speed?
- Are developers happier and growing?
Cadence: Weekly team check-ins, monthly trends
Level 2: Team Delivery Effectiveness
What we’re measuring: Quality, collaboration, delivery reliability
Metrics:
- Code review cycles and duration
- Change failure rate (post-deploy issues)
- Mean time to recovery (MTTR)
- Deployment frequency
- Test coverage and quality
- Technical debt accumulation
AI impact questions:
- Is code quality improving or degrading?
- Is review burden sustainable?
- Are we delivering reliably?
Cadence: Sprint retrospectives, monthly metrics reviews
Level 3: Organization Business Impact
What we’re measuring: Customer value, innovation capacity, sustainability
Metrics:
- Time to customer value (idea → adoption)
- Customer-reported bugs and satisfaction
- Technical debt trajectory (long-term)
- Talent retention and growth
- Innovation capacity (new capabilities delivered)
AI impact questions:
- Are we delivering more customer value?
- Is the codebase healthier?
- Are we more capable as an organization?
Cadence: Quarterly business reviews, annual strategic planning
Key Insight: Multi-Level Optimization
AI may help at one level but hurt at another:
Example from our data:
- Individual: Velocity up 25% (positive)
- Team: Quality down 15% (negative)
- Organization: Customer impact neutral (concerning)
Intervention needed: Focus AI on debugging (clear win), restrict code generation (mixed results).
The Dashboard Approach
Leading indicators (individual level) predict lagging indicators (org level).
If individual flow state drops, team quality will eventually degrade. If team quality degrades, customer impact will eventually suffer.
This creates early warning signals for intervention.
Practical Implementation Steps
Step 1: Establish Baselines
Measure current state before changing AI usage:
- Individual: Developer survey, flow state assessment
- Team: DORA metrics, code quality benchmarks
- Org: Customer satisfaction, technical debt index
Step 2: Instrument at All Three Levels
Don’t just track velocity:
- Individual: Weekly surveys, tool analytics
- Team: GitHub metrics, incident tracking
- Org: Customer metrics, business outcomes
Step 3: Track Trends, Not Absolutes
Absolute numbers vary by context. What matters is:
- Are things getting better or worse?
- Is AI helping or hurting the trend?
Step 4: Adjust AI Usage Based on Data
If data shows problems, intervene:
- Restrict AI in areas where it’s causing harm
- Expand AI in areas where it’s delivering value
- Continuously adjust based on measurement
Real Example: Our Multi-Level Analysis
Last quarter findings:
Individual level:
- Flow state: Improved for debugging, degraded for complex work
- Satisfaction: Mixed (love debugging help, frustrated with verification burden)
Team level:
- MTTR: Down 40% (excellent)
- Change failure rate: Up 18% (concerning)
- Review cycles: Up 35% (problematic)
Organization level:
- Customer impact: Neutral (faster fixes, more bugs initially)
- Technical debt: Increasing (AI code requires refactoring)
Decision: Focus AI on debugging and error analysis (tier 1), be very cautious with complex code generation (tier 3).
The Multi-Level Question Framework
For any AI impact assessment, ask:
Individual: Are developers happier, growing, and effective?
Team: Are we delivering quality work reliably?
Organization: Are we creating customer value sustainably?
All three must be “yes” for AI to be truly successful.
Call to Action
What frameworks are you using?
I’d love to hear:
- How others are measuring across levels
- What metrics are proving most useful
- What interventions are working
- What early warning signals have proven reliable
Let’s learn together. AI measurement is too important to get wrong, and we’re all figuring this out in real-time.