DORA metrics have been the gold standard for measuring engineering performance for nearly a decade. Deployment frequency, lead time for changes, change failure rate, and mean time to recovery - these four metrics have shaped how we think about DevOps success.
But after spending the past year deeply analyzing engineering metrics at my company, I’m convinced that DORA in 2026 needs significant caveats - especially with AI in the mix.
The AI Productivity Paradox
Here’s what caught my attention from recent DORA research:
Individual metrics look great:
- 21% more tasks completed with AI coding assistants
- 98% more pull requests merged
Organizational metrics stay flat:
- AI adoption associated with 1.5% decrease in delivery throughput
- 7.2% reduction in delivery stability
The most striking finding: developers believed AI made them 20% more efficient, while in reality they were slowed down by 19%.
That’s a 39-percentage-point gap between perception and reality.
The “Mirror and Multiplier” Effect
The DORA researchers describe something they call the “mirror and multiplier” effect:
AI doesn’t inherently make engineering better - it magnifies whatever system it operates within.
In teams with well-defined processes and clean architectures, AI enhances quality and flow. In teams with tangled pipelines or unclear governance, AI accelerates chaos.
This means DORA metrics can improve for the wrong reasons - more deployments might just mean more partially-tested AI-generated code hitting production.
What DORA Metrics Miss in 2026
The traditional DORA framework wasn’t built for a world where:
-
AI works on multiple tasks simultaneously - Lead time calculations become murky when AI is generating code faster than humans can review it.
-
The bottleneck shifts from writing to validating - Deployment frequency can increase while actual quality work gets backlogged in review queues.
-
Batch sizes are getting larger - AI tempts teams to abandon small batch principles. When developers can produce more code faster, they create larger, riskier changes.
-
A flat metric can hide growing problems - If deployments double but change failure rate stays at 5%, you’re now dealing with twice the absolute number of failures.
The New Caveats I Apply
Based on all this, here’s how I now interpret DORA metrics:
Caveat 1: Normalize by Complexity, Not Just Time
Traditional lead time measures time-to-deploy. But if AI is generating boilerplate and developers are spending more time on reviews, the same lead time might hide very different work patterns.
I now track lead time per complexity point, not just lead time alone.
Caveat 2: Watch the Review Queue
If deployment frequency is up but the review queue is growing, you’re building debt. The metric looks good, but the system is degrading.
I add pending review age as a companion metric to deployment frequency.
Caveat 3: Track Business Outcomes, Not Just Delivery
DORA tells you about the delivery system. It doesn’t tell you whether you’re delivering the right things.
We now pair DORA metrics with customer impact metrics: feature adoption, support ticket reduction, revenue per release.
Caveat 4: Segment by AI-Assisted vs. Human-Only
The stats look different depending on whether AI was involved. We now track separate DORA metrics for AI-assisted changes vs. human-authored changes.
Early findings: our AI-assisted code has a 15% higher change failure rate than human-authored code. That’s a problem we wouldn’t have seen without segmentation.
Questions for the Community
-
Are you seeing the AI productivity paradox - individual metrics up, organizational metrics flat or down?
-
How do you account for AI in your DORA calculations? Any tricks for maintaining meaningful comparisons?
-
What companion metrics do you use alongside DORA to get the full picture?
DORA metrics are still valuable, but in 2026, they’re necessary but not sufficient. Context matters more than ever.