We’ve crossed a threshold in engineering: 91% of organizations now use AI coding tools. But the conversation has fundamentally shifted—we’re no longer debating whether to adopt AI, we’re struggling with how to measure what it’s actually doing.
I’m leading a 120-person engineering organization through AI tool rollout, and I’m hitting a measurement wall that I suspect many of you are facing too.
The Velocity Trap We’re All Walking Into
Here’s what I’m seeing: Teams feel faster. Developers report increased productivity in surveys. Managers point to higher PR counts. But when I look at our delivery metrics—actual features shipped to customers, time from idea to production—the needle hasn’t moved. In some cases, it’s moved backward.
We’re measuring the wrong thing.
The One Clear Win: Stack Trace Analysis
There’s one use case where the ROI is undeniable: debugging. When a developer hits a cryptic error message—especially in our distributed systems—AI tools excel at answering “What does this error actually mean?”
We’ve measured this: our mean time to recovery (MTTR) for infrastructure-related bugs dropped 40% after deploying AI-assisted debugging. The workflow is clean: copy stack trace, get explanation, verify fix, ship. No ambiguity, minimal cognitive load, easy to measure.
This is the AI success story we can actually articulate to the board.
But Here’s What Keeps Me Up at Night
While our debugging metrics improved, other signals are flashing yellow:
- Pull requests increased 20% (sounds great!)
- Incidents per pull request jumped 23.5% (not great)
- Change failure rate increased 30% (definitely not great)
Research from METR’s study on AI-assisted development found that on complex, novel tasks, senior developers were actually 19% slower when using AI. The culprit? Cognitive load from verification overhead. AI generates code that “looks right” but requires deep inspection to ensure it’s actually right.
The time cost isn’t in the generation—it’s in the review, the debugging of subtle issues, the refactoring of code that technically works but doesn’t fit our patterns.
The Metric We’re Not Tracking: Cognitive Load
What if velocity is the wrong metric entirely?
When developers switch from “coding mode” to “prompting mode” to “verification mode,” they’re paying a context-switching tax that flow state research shows is measurable and significant. The dopamine hit from instant AI suggestions creates a halo effect—developers believe they’re 20% faster even when the data shows they’re slower.
I’m starting to think the real value isn’t speed—it’s cognitive load reduction. And we’re barely measuring that at all.
Stack trace analysis works precisely because it reduces cognitive load (no more hunting through documentation and Stack Overflow for obscure errors). Code generation often increases cognitive load (now you’re verifying someone else’s code that you didn’t write).
So Here’s My Question to This Community
What metrics are you actually using to measure AI coding tools’ impact?
Are you tracking:
- Flow time (sustained focus periods without context switching)?
- Friction points (where developers get stuck and for how long)?
- Developer satisfaction (subjective but important)?
- Quality metrics tied to AI usage (defect rates, review cycles, technical debt)?
Or are you, like many organizations, defaulting to velocity metrics (PRs, lines of code, commits) because they’re easy to measure, even if they don’t tell the full story?
We’re at a inflection point. The teams that figure out how to measure AI’s actual impact—not just its perceived speed—are going to make much better decisions about where to invest in these tools and where to pull back.
I’d love to hear what’s working for you. What are you measuring? What have you tried and abandoned? Where are you seeing clear ROI vs. measurement confusion?
Related: How to measure AI’s impact on your engineering team and Beyond the Hype: Measuring AI’s Impact on Engineering Teams