GitHub’s 2025 Octoverse report dropped some numbers that should make every engineering leader pause: 43 million pull requests merged per month, 82 million code pushes, and 41% of new code is now AI-assisted. Meanwhile, Faros AI’s research shows that teams with high AI adoption merge 98% more PRs – but PR review time has increased by 91%.
I have been staring at these numbers for weeks, and I think we are witnessing a fundamental shift in where the actual bottleneck lives in software delivery. For years, we optimized for writing code faster. Agile, CI/CD, microservices, developer experience tooling – all aimed at reducing time-to-code. AI coding assistants were supposed to be the final accelerator.
And they are. Code generation is essentially solved. But we accidentally created a new crisis: the verification bottleneck.
The Numbers That Keep Me Up at Night
Here is what the data actually says:
- 96% of developers do not fully trust the functional accuracy of AI-generated code (SonarSource 2026 State of Code Survey)
- 75% of developers will not merge AI code without manual review, even when AI gets it “right”
- 59% of developers rate the effort of reviewing, testing, and correcting AI output as “moderate” or “substantial”
- PR sizes are up ~18% as AI adoption increases, which compounds the review burden
- Incidents per PR are up ~24% and change failure rates are up ~30%
So we have a situation where AI generates code 10x faster, but real-world productivity improves only 10-20% because human review cannot scale at the same rate. Teams are producing dramatically more output, but the humans who need to verify that output are the same people, with the same cognitive bandwidth, working the same hours.
The Uncomfortable Question
This leads to a question I think we need to confront honestly: should we accept that AI-generated code receives less rigorous review than human-written code?
I am not asking whether we should lower standards in an ideal world. I am asking whether it is already happening de facto, and whether we should be intentional about it rather than pretending the old model still works.
Think about it from a reviewer’s perspective. You used to review 3-5 PRs per day from colleagues you know, whose coding patterns you understand, whose reasoning you can infer from the changes. Now you are reviewing 8-12 PRs per day, many containing AI-generated code that follows different patterns, uses unfamiliar abstractions, and comes with explanations that may or may not reflect what the code actually does.
The realistic options seem to be:
- Maintain the same review depth – which means accepting that review becomes the permanent bottleneck and AI coding gains are largely illusory
- Create tiered review processes – where AI-generated code gets different (not necessarily less) review treatment
- Invest heavily in automated verification – static analysis, AI-assisted review, property-based testing to augment human reviewers
- Accept higher defect rates – as the cost of faster delivery, and invest more in production monitoring and rollback capabilities
What I am Seeing in Practice
At my company, we have started tracking AI-generated vs human-written PRs separately. The data is revealing:
- AI-generated PRs have a 32.7% acceptance rate on first review vs 84.4% for human PRs (consistent with Faros AI’s findings)
- Average review time for AI PRs is 2.3x longer than human PRs
- But AI PRs that pass review have comparable production incident rates to human PRs
That last point is interesting. It suggests that the review process itself is effective at catching issues – the problem is just the volume and the cognitive load on reviewers.
The DORA Metrics Problem
Here is another angle that concerns me: our existing measurement frameworks are not designed for this reality. DORA metrics – deployment frequency, lead time, change failure rate, MTTR – assume a relatively stable relationship between code volume and review capacity. When AI multiplies code output by 5-10x without proportionally increasing review capacity, these metrics can paint a misleading picture.
A team might show “improving” deployment frequency while actually degrading review quality. GitClear predicts code churn will double in 2026 because of AI, which means a significant portion of those “deployments” are rework.
What I Think We Should Do
I do not think the answer is lowering standards. But I also do not think the answer is pretending that a 5-person team can maintain the same review quality on 98% more PRs.
My current thinking:
- Separate the review workflow for AI-assisted PRs with mandatory automated checks before human review
- Invest in AI-assisted review tools (the irony is not lost on me) to handle the first pass
- Redesign PR sizes – if AI generates large PRs, we need tooling that breaks them into reviewable chunks
- Track review fatigue metrics – not just throughput, but reviewer cognitive load over time
- Be honest about capacity – if your team’s review bandwidth supports 30 PRs/week, that is the real throughput ceiling, regardless of how fast AI can generate code
What are others seeing? Is code review the bottleneck on your team? And are you changing how you review AI-generated code, or maintaining the same process?