Skip to main content

Measuring Real AI Coding Productivity: The Metrics That Survive the 90-Day Lag

· 9 min read
Tian Pan
Software Engineer

Most teams adopting AI coding tools hit the same wall. Month one looks like a success story: PR throughput is up, sprint velocity is climbing, and the engineering manager is putting together a slide deck to share with leadership. By month three, something has quietly gone wrong. Incidents creep up. Senior engineers are spending more time in review. A simple bug fix now requires understanding code nobody on the team actually wrote. The productivity gains have evaporated — but the measurement system never caught it.

The problem is that the metrics most teams reach for first — lines generated, PRs merged, story points burned — are the wrong unit of measurement for AI-assisted development. They measure the cost of producing code, not the cost of owning it. And AI has made production nearly free while leaving ownership costs untouched.

The 90-Day Divergence

The research on this is consistent enough to call it a pattern. Analysis across hundreds of engineering teams shows that code churn — the percentage of code revised within two weeks of being written — has more than doubled since AI adoption became widespread, rising from roughly 3.3% in 2021 to 5.7–7.1% in recent years. Change failure rates have jumped 30% in high-adoption environments. Incidents per pull request have climbed 23.5% even as PR volume has surged.

Why the lag? The first 30 days look good because AI removes friction from the typing phase. Writing code gets faster. Developers feel productive. But the delivery process continues long after the code is written: review, integration, debugging in production, and eventually maintenance — all activities where AI-generated code creates friction rather than removing it.

Code written by an AI optimizes for appearance, not correctness. It's coherent, it compiles, it passes the linter. But it often lacks the judgment that comes from understanding the broader system: why a particular edge case matters, how a data model evolved, what will happen when this module is touched six months from now. That judgment shows up in review comments, in incidents, in the time it takes a new engineer to understand what they're looking at.

The metrics that capture this failure mode don't accumulate until 60–90 days after the code ships.

The Velocity Proxies That Mislead

Lines of code stopped being meaningful years ago. When AI can generate ten thousand lines in the time a developer writes fifty, counting lines measures almost nothing. It says nothing about correctness, maintainability, or whether the problem was worth solving.

PR throughput is more seductive because it feels like a real output. One study of 1,255 engineering teams found that high-AI-adoption teams merge 98% more pull requests per developer. But PR review time increased 91% at the same time — the bottleneck shifted from writing to reviewing, and nobody was measuring the reviewer's time cost.

Story point velocity suffers from a related problem. Teams shipping more features aren't necessarily delivering more value. When AI-generated code is 1.75× more likely to contain logic errors and carries 322% more privilege escalation paths than human-written code, high velocity is actively risky if review quality doesn't scale proportionally.

Experienced developers who understand the codebase deeply tend to catch these problems. Developers who are busy or working outside their area of expertise tend to accept AI code faster. The research on automation bias is clear: people accept erroneous automated advice at a 26% higher rate than equivalent advice from humans. AI-generated code arrives with an implicit authority that makes reviewers less likely to push back.

The Lagging Indicators That Tell the Truth

The metrics that actually reveal AI coding ROI take time to accumulate, but they're measuring the right thing.

Change failure rate is the most important single number. It's the percentage of deployments that require remediation — rollback, hotfix, or emergency patch. CFR captures code quality failures that velocity metrics systematically hide. In AI-heavy environments, it's the earliest aggregated signal of a quality problem. Monitor it with a 90-day rolling window and segment by commit origin when possible.

Code churn (percentage of code revised within two weeks of being written) has emerged as the critical leading indicator within this category. Rising churn at day 30 reliably predicts incident rate jumps by day 90. Teams with high AI adoption see commit patterns that include rewriting the same files repeatedly — a sign the AI code was wrong in ways that weren't obvious during review. Tracking churn by author or by tool adoption cohort surfaces this signal early enough to act.

Incident rate per PR connects shipping decisions to production outcomes. As PR volume increases, this rate needs to stay flat or decline. When it rises, the math on AI adoption becomes unfavorable regardless of what velocity numbers say.

Review burden is underinstrumented at most companies. Senior engineers reviewing AI-generated PRs spend more time per review than on human-written code — the PRs are larger (roughly 18% larger on average), they contain more subtle errors, and they require more verification effort. If the total time senior engineers spend reviewing AI code exceeds the time the team saves by using AI tools, the ROI is negative, full stop.

Leading Signals Before the Maintenance Cliff

The lagging indicators are definitive, but they arrive late. Several signals emerge at 4–8 weeks and give enough runway to correct course.

Review queue depth rising faster than PR volume is one of the earliest signals. It means reviewers are becoming a bottleneck, which usually indicates code quality is forcing more careful review — either consciously or because reviewers are catching problems that require rework.

Review comments per line of code going up means code is becoming harder to understand. This is distinct from reviewers being more thorough: if the comment rate is rising without any change in reviewer behavior, the code itself has become more opaque.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates