Skip to main content

DORA in the Age of AI: When Deployment Frequency Lies

· 9 min read
Tian Pan
Software Engineer

Here is a number that should unsettle you: according to the 2025 DORA State of AI-Assisted Software Development report, developer PRs merged per person rose 98% while incidents per PR rose 242.7%. Deployment frequency looks elite. The system is breaking more often per unit of change than at any point DORA has measured.

Your dashboard is green. Your on-call engineers are exhausted. Something is wrong with the measuring tape.

DORA metrics — deployment frequency, lead time for changes, change failure rate, mean time to restore — were built to distinguish high-performing engineering organizations from low-performing ones. For a decade, that framework worked. Shipping often correlated with discipline. Fast lead times reflected tight feedback loops. But that correlation was never causal; it was a proxy that held because human capacity constrained throughput. AI tooling has decoupled the proxy from the underlying reality it was measuring.

Deployment Frequency: The Volume Trap

DORA's foundational insight was that elite teams deploy frequently because they've made deployments small, safe, and routine. The causality runs from capability to frequency. AI inverts this: it generates volume independent of capability, and volume feeds directly into the numerator.

One documented case: a team went from 5 to 20 weekly deployments after adopting AI tooling. Of the 15 new deployments, one was a new feature. The rest were AI-generated tests, configuration scaffolding, and boilerplate — valid commits, real merges, zero capability signal. At ecosystem scale, Claude Code alone accounts for 4.5% of all public GitHub commits. AI agent-generated PRs jumped from 4 million in September 2024 to 17 million in March 2025. This volume is now baked into every dashboard in engineering.

The structural problem isn't that AI generates bad code — it's that it generates code faster than humans can understand it. LinearB data puts the AI PR rejection rate at 67.3% versus 15.6% for manual code. Most of what inflates your deployment frequency metric never should have been merged, and a nontrivial fraction of what did get merged was approved under review conditions that deteriorated under volume pressure.

Lead Time: The Bottleneck Just Moved

Lead time for changes measures the elapsed time from first commit to production. AI compresses the coding phase dramatically — a complete PR with tests and documentation that once took days now takes minutes. The metric improves. What actually happens is that the bottleneck relocates.

Faros AI tracking 1,255 engineering teams found that PR review time grew 441% (median) as AI adoption rose. Review now consumes over 50% of total lead time in high-AI-adoption teams, compared to roughly 20% before. Lead time looks better on the scoreboard because the denominator includes AI-generated merges that never received substantive review. The median organization is now merging 31% more PRs with no human review at all compared to pre-AI baselines.

This matters for MTTR directly: when review stops being a genuine comprehension checkpoint and becomes a rubber stamp on AI output, the organizational knowledge that on-call engineers rely on during incidents simply isn't built. Faster lead time and shallower understanding of the resulting system are two sides of the same coin.

Change Failure Rate: Flat Percentage, Rising Absolute Pain

Change failure rate looks at what fraction of deployments cause a production failure. Here the AI effect is particularly deceptive because the metric is a ratio: if AI doubles your deployments and doubles your failures, CFR stays flat while your incident workload doubles.

The underlying data from GitClear's analysis of 211 million changed lines tells the story more clearly. Code churn — lines rewritten or reverted within two weeks of being written — more than doubled from 3.1% in 2020 to 5.7–7.1% in 2024. Refactored ("moved") code collapsed from 24.1% to 9.5% of total changes. Copy-pasted code rose from 8.3% to 12.3%, with duplicated blocks rising eightfold. Most of this churn doesn't immediately trigger production failures, so CFR looks stable while the codebase accumulates silent debt.

Where it does produce failures, the pattern is clear. A CodeRabbit analysis of 470 open-source PRs found AI-co-authored PRs generated 1.7x more issues overall, with 75% more logic errors and 3x more readability problems. A controlled study by Uplevel following 800 developers over three months found a 41% increase in bugs within PRs after Copilot adoption, with no improvement in cycle time and no improvement in throughput. The change failure rate denominator grows fast enough to absorb this, until it doesn't.

Comprehension Debt and the MTTR Time Bomb

Mean time to restore is where the accumulated cost of AI-generated code comes due. MTTR depends on three capabilities during an incident: forming a mental model of the failing system, reading unfamiliar code under pressure, and running hypothesis tests quickly. AI-generated code degrades all three.

Addy Osmani has written about "comprehension debt" — the gap between code volume and team understanding of that code. Unlike technical debt, it's invisible. The codebase compiles. Tests pass. CI is green. Nobody can explain why the payment service behaves the way it does at 3am when the feature is degrading.

The mechanism is well-documented. AI inverts the historical speed asymmetry between writing and reading code. Historically, producing a changeset was slower than auditing it, which meant review could serve as a genuine comprehension checkpoint. When a junior engineer can generate code faster than a senior engineer can critically evaluate it, that checkpoint collapses. An Anthropic internal study found that developers using AI for code delegation scored 17% lower on comprehension tests — 50% vs. 67% — despite completing tasks in similar timeframes. Debugging comprehension showed the steepest decline: exactly the skill that drives MTTR.

The METR randomized controlled trial published in 2025 adds another dimension. Experienced open-source contributors using AI tools (Cursor Pro with Claude Sonnet) took 19% longer on real repository tasks than those working without AI. The perception gap was stark: participants self-reported being 20% faster. They were slower, they didn't know it, and they had less practice with unassisted comprehension by the end of the study.

When an incident happens, the on-call engineer finds themselves staring at a 47-file diff containing a 3-line AI-generated change that introduced the failure. Nobody owns it. The approving engineer reviewed it cursorily because thirty similar PRs came through that week. MTTR balloons not because the fix is hard but because the diagnosis is.

What to Measure Instead

DORA isn't wrong — it's incomplete under the new conditions. The 2024 and 2025 DORA reports themselves acknowledge this: the 2025 report formally added Rework Rate as a fifth metric and elevated developer experience signals — cognitive load, review burden, trust in tooling — as leading indicators.

The supplemental metrics that matter most when AI writes significant portions of your codebase:

AI code attribution and churn rate. Tag PRs by AI involvement level. Track what percentage of recently written AI code is deleted or rewritten within two weeks. GitClear currently measures this at 5.7–7.1% industry-wide and rising. A team's churn rate relative to that baseline is a leading indicator of review quality.

Human review coverage rate. What fraction of merged PRs received substantive human review? This is not binary — a 90-second approval on a 500-line AI PR is not equivalent to genuine review. The 11% of PRs currently merging with zero review at the median organization is a floor on incident risk; it's rising.

Review load per senior engineer. Senior engineers are the chokepoint for meaningful review in an AI-heavy codebase. Review pressure on those engineers predicts review quality degradation the way ignored alerts predict production incidents. Oobeya describes this as one of the strongest leading indicators of future escaped defects.

Time-to-first-hypothesis during incidents. How long before the on-call engineer has a working theory about what's failing? This is harder to instrument than MTTR but is a genuine measure of comprehension quality. A team that consistently takes 40 minutes to form a hypothesis despite 10-minute MTTR is burning cognition, not capability.

Rework rate. Now official DORA canon: the proportion of unplanned deployments made to address user-visible production failures. This catches what CFR misses when AI-generated churn doesn't immediately trigger incidents — the reactive firefighting load that emerges weeks later.

The DX Core 4 framework, developed by the original authors of DORA, SPACE, and DevEx, wraps these dimensions coherently: speed with review-time context, effectiveness measuring feature work versus AI output correction, quality segmenting defect density by AI versus human code, and impact connecting delivery to business outcomes. Organizations using it report 14% more R&D time on new features versus maintenance — the real metric hiding underneath deployment frequency.

The Dashboard Problem Is a Trust Problem

There's a reason this situation persists even at organizations where the data is available: DORA metrics are easy to present to executives and they currently look good. Deployment frequency is up. Lead time is down. Few engineering leaders want to walk into a quarterly review and say "actually the metrics are misleading and our real delivery health is unclear."

But the 2024 DORA report found something that should be alarming for anyone managing an AI-assisted team: the High Performance cluster shrank from 31% to 22% of organizations while the Low Performance cluster grew from 17% to 25%. Aggregate metrics improved. Distribution worsened. Teams clustered at the extremes rather than improving across the board. That's not what genuine capability improvement looks like — it's what metric inflation looks like.

The underlying question for any engineering organization that has adopted AI tooling is not "how often are we deploying?" but "how many engineers on this team could diagnose a 3am incident in this codebase if the AI tools were unavailable?" If the honest answer is fewer than it was two years ago, deployment frequency is lying to you.

DORA metrics were never meant to be optimized; they were meant to be evidence of something deeper. In an era where AI can inflate every numerator simultaneously, the discipline is recovering that something — and being honest when the evidence no longer supports the story the dashboard tells.

References:Let's stay in touch and Follow me for more thoughts and updates