DORA Metrics in 2026: Still the Gold Standard, But With New Caveats

DORA metrics have been the gold standard for measuring engineering performance for nearly a decade. Deployment frequency, lead time for changes, change failure rate, and mean time to recovery - these four metrics have shaped how we think about DevOps success.

But after spending the past year deeply analyzing engineering metrics at my company, I’m convinced that DORA in 2026 needs significant caveats - especially with AI in the mix.

The AI Productivity Paradox

Here’s what caught my attention from recent DORA research:

Individual metrics look great:

  • 21% more tasks completed with AI coding assistants
  • 98% more pull requests merged

Organizational metrics stay flat:

  • AI adoption associated with 1.5% decrease in delivery throughput
  • 7.2% reduction in delivery stability

The most striking finding: developers believed AI made them 20% more efficient, while in reality they were slowed down by 19%.

That’s a 39-percentage-point gap between perception and reality.

The “Mirror and Multiplier” Effect

The DORA researchers describe something they call the “mirror and multiplier” effect:

AI doesn’t inherently make engineering better - it magnifies whatever system it operates within.

In teams with well-defined processes and clean architectures, AI enhances quality and flow. In teams with tangled pipelines or unclear governance, AI accelerates chaos.

This means DORA metrics can improve for the wrong reasons - more deployments might just mean more partially-tested AI-generated code hitting production.

What DORA Metrics Miss in 2026

The traditional DORA framework wasn’t built for a world where:

  1. AI works on multiple tasks simultaneously - Lead time calculations become murky when AI is generating code faster than humans can review it.

  2. The bottleneck shifts from writing to validating - Deployment frequency can increase while actual quality work gets backlogged in review queues.

  3. Batch sizes are getting larger - AI tempts teams to abandon small batch principles. When developers can produce more code faster, they create larger, riskier changes.

  4. A flat metric can hide growing problems - If deployments double but change failure rate stays at 5%, you’re now dealing with twice the absolute number of failures.

The New Caveats I Apply

Based on all this, here’s how I now interpret DORA metrics:

Caveat 1: Normalize by Complexity, Not Just Time

Traditional lead time measures time-to-deploy. But if AI is generating boilerplate and developers are spending more time on reviews, the same lead time might hide very different work patterns.

I now track lead time per complexity point, not just lead time alone.

Caveat 2: Watch the Review Queue

If deployment frequency is up but the review queue is growing, you’re building debt. The metric looks good, but the system is degrading.

I add pending review age as a companion metric to deployment frequency.

Caveat 3: Track Business Outcomes, Not Just Delivery

DORA tells you about the delivery system. It doesn’t tell you whether you’re delivering the right things.

We now pair DORA metrics with customer impact metrics: feature adoption, support ticket reduction, revenue per release.

Caveat 4: Segment by AI-Assisted vs. Human-Only

The stats look different depending on whether AI was involved. We now track separate DORA metrics for AI-assisted changes vs. human-authored changes.

Early findings: our AI-assisted code has a 15% higher change failure rate than human-authored code. That’s a problem we wouldn’t have seen without segmentation.

Questions for the Community

  1. Are you seeing the AI productivity paradox - individual metrics up, organizational metrics flat or down?

  2. How do you account for AI in your DORA calculations? Any tricks for maintaining meaningful comparisons?

  3. What companion metrics do you use alongside DORA to get the full picture?

DORA metrics are still valuable, but in 2026, they’re necessary but not sufficient. Context matters more than ever.

Rachel, this is the most nuanced take on DORA metrics I’ve seen this year. The “mirror and multiplier” framing is exactly right.

On the AI segmentation approach:

We started doing this 6 months ago after noticing our change failure rate was creeping up despite no obvious process changes. When we segmented by AI-assisted vs. human-authored, the pattern was clear:

  • Human-authored code: 3.2% change failure rate (stable)
  • AI-assisted code: 8.7% change failure rate (and rising)

The aggregate number was 5.1% - looked “acceptable” but hid a significant quality problem in one category.

Why this happens:

In my observation, AI-assisted code fails more often for specific reasons:

  1. Edge cases get missed - AI generates the happy path beautifully, but subtle edge cases get overlooked
  2. Integration points are fragile - AI doesn’t have the contextual knowledge of our system boundaries
  3. Testing is shallower - Developers trust the AI-generated code more than they should and don’t test as rigorously

What we changed:

We now have different review requirements for AI-assisted PRs:

  • Mandatory edge case checklist
  • Explicit integration testing requirements
  • Flag for “AI-heavy” PRs that triggers additional review

This has brought the AI-assisted failure rate down to 5.5% - still higher than human-only, but converging.

On the “necessary but not sufficient” conclusion:

I’d go further. In an AI-heavy workflow, DORA metrics can actually be misleading if interpreted naively.

A team could have great DORA metrics (high deployment frequency, low lead time) while actually degrading quality - because AI enables rapid low-quality output. You need the companion metrics Rachel mentions to catch this.

The scary scenario: leadership celebrates DORA improvements while technical debt and failure rates quietly compound.

I want to push back slightly on how we’re framing this, because I think there’s a risk of over-correcting.

DORA metrics aren’t broken - our interpretation is.

The four DORA metrics have always been about system performance, not developer performance. They measure how well your delivery pipeline works, not whether individual engineers are productive.

The AI paradox Rachel describes makes perfect sense through this lens:

  • Individual productivity can increase (more PRs merged)
  • System throughput can stay flat (review becomes the bottleneck)
  • System stability can decrease (more untested code hits production)

That’s not DORA failing to measure something. That’s DORA accurately telling you your system isn’t keeping pace with your input.

The real problem: We use DORA as a proxy for things it doesn’t measure.

DORA never claimed to measure:

  • Developer experience
  • Code quality
  • Business value
  • Team health

We’ve been treating these four metrics as a complete picture because they were easy to measure. The answer isn’t to abandon DORA - it’s to stop asking DORA to do things it wasn’t designed for.

What I’ve added to our measurement stack:

  1. Developer Experience Score (quarterly survey): How do engineers feel about their tools, processes, and ability to do good work?

  2. Code Review Cycle Time: DORA measures commit-to-deploy, but the hidden bottleneck is often review. This surfaces it.

  3. Rework Rate: What percentage of commits are fixing things we shipped in the last 30 days? High rework = low quality, regardless of DORA metrics.

  4. Feature Adoption Rate: Of features shipped, how many actually get used? This connects delivery to value.

On AI specifically:

I’m less worried about AI-assisted code quality than most. The problem isn’t AI - it’s process. If your process relies on developers catching all their own bugs, AI will expose that fragility.

The teams I see struggling with AI-assisted quality are the same ones who had quiet quality problems before AI - it’s just more visible now.

Adding an infrastructure perspective on why AI is particularly tricky for DORA metrics.

The batch size problem is real and underappreciated:

Rachel mentioned that AI tempts teams to abandon small batch principles. I’ve seen this first-hand.

With AI assistance, a developer can generate 500 lines of code in the time it used to take to write 50. The temptation is to ship it all as one PR instead of breaking it into smaller pieces.

What happens next:

  • Review takes 3x longer (big PRs are harder to review)
  • More bugs slip through (cognitive overload on reviewers)
  • Changes are harder to roll back (bigger blast radius)

All while DORA metrics look stable or improving (more code deployed faster).

The 39% trust number matters:

The research says 39% of developers report little to no trust in AI-generated code. That trust gap creates a weird dynamic:

  • Developers who trust AI too much: ship faster, more bugs
  • Developers who trust AI too little: ship slower, review everything manually

The aggregate DORA metrics hide these two very different populations with very different outcomes.

What we track in AI infrastructure:

Beyond standard DORA, we monitor:

  1. Inference-to-deploy ratio: How many AI suggestions end up in production? Dropping ratio might indicate declining AI quality or increasing developer skepticism.

  2. AI-suggested test coverage: Are AI-generated tests actually testing the right things? We’ve found AI is great at generating tests that pass but poor at generating tests that would catch real bugs.

  3. Prompt-to-ship time: For AI-assisted work, what’s the time from initial prompt to production? This surfaces the review bottleneck specifically.

The uncomfortable truth:

Many organizations are using AI to game DORA metrics without realizing it. More PRs, faster lead times, but the same amount of actual work getting done - just with more overhead from reviewing AI output.

@eng_director_luis is right that the problem isn’t DORA. It’s that we haven’t updated our mental models for what the numbers mean in an AI-assisted world.