Over the past four years, we’ve watched engineering productivity measurement evolve from DORA alone to DORA + SPACE, and now in 2026, to the unified DX Core 4 framework. Each iteration promises better insight into how our teams actually work. Each framework gets more sophisticated, with more dimensions, more metrics, more surveys.
And yet, I’m starting to wonder: are we measuring the right things?
The Framework Evolution
At my company—a Fortune 500 financial services firm—we’ve implemented all three frameworks over the years:
- DORA gave us deployment frequency, lead time, change failure rate, and time to restore. Clean, actionable, focused on delivery and stability.
- SPACE added satisfaction, performance, activity, communication, and efficiency. More holistic, but also more complex to instrument.
- DX Core 4 now unifies these into four dimensions: Speed, Effectiveness, Quality, and Impact. The “Impact” dimension is particularly compelling—measuring the percentage of time spent on new capabilities vs. maintenance. That’s a business outcome, not just an engineering activity metric.
But here’s the paradox: the more sophisticated our frameworks become, the less clarity we seem to have about actual productivity.
The AI Productivity Mirage
This year, the data is impossible to ignore:
- 41% of all code written in 2025 was AI-generated (source)
- 76% of developers now use AI coding assistants
- Individual coding speed is up—dramatically, for simple tasks
But at the organizational level:
- Code churn increased by 41%
- Delivery stability dropped 7.2%
- Security vulnerabilities in AI-assisted code are up 23.7%
- Only 33% of developers trust AI results (source)
So developers are writing code faster, but we’re shipping less reliably, with more technical debt and more vulnerabilities. The productivity gains at the individual level aren’t translating to organizational throughput.
Outputs vs. Outcomes
I think the core issue is this: we’re still measuring outputs (commits, story points, lines of code) when we should be measuring outcomes (customer value delivered, system health, sustainable pace).
The moment you tell developers they’ll be evaluated on cycle time, you’ll get optimized cycle time—and everything else (code quality, documentation, mentoring, thoughtful design) will suffer. Goodhart’s Law in action.
Even the DX Core 4’s “Impact” dimension, which I genuinely like, is hard to implement in practice. How do you categorize work as “new capability” vs. “maintenance” when most real-world changes involve both? How do you avoid creating perverse incentives where teams game the classification to look more innovative?
What I’m Struggling With
DORA metrics work because they tie delivery speed directly to system stability. You can’t game both simultaneously—if you optimize for deployment frequency while ignoring quality, your change failure rate will expose you.
But with AI changing how we write code, are even DORA metrics still sufficient? When 41% of code is AI-generated and code churn has doubled, traditional velocity metrics feel disconnected from actual value delivery.
I don’t have answers here—just questions:
- What metrics actually predict success at your organizations?
- How are you adapting measurement to account for AI-assisted development?
- Have you found ways to tie engineering metrics to business outcomes without creating perverse incentives?
- Is the DX Core 4 framework overkill, or does the additional complexity actually yield better insight?
I’d love to hear how other engineering leaders are thinking about this. Because right now, I’m worried we’re optimizing for measurement theater instead of actual productivity.
References: