43 Million PRs Merged Monthly on GitHub but Code Review Is the Bottleneck — Should We Accept That AI-Generated Code Gets Less Review?

GitHub’s 2025 Octoverse report dropped some numbers that should make every engineering leader pause: 43 million pull requests merged per month, 82 million code pushes, and 41% of new code is now AI-assisted. Meanwhile, Faros AI’s research shows that teams with high AI adoption merge 98% more PRs – but PR review time has increased by 91%.

I have been staring at these numbers for weeks, and I think we are witnessing a fundamental shift in where the actual bottleneck lives in software delivery. For years, we optimized for writing code faster. Agile, CI/CD, microservices, developer experience tooling – all aimed at reducing time-to-code. AI coding assistants were supposed to be the final accelerator.

And they are. Code generation is essentially solved. But we accidentally created a new crisis: the verification bottleneck.

The Numbers That Keep Me Up at Night

Here is what the data actually says:

  • 96% of developers do not fully trust the functional accuracy of AI-generated code (SonarSource 2026 State of Code Survey)
  • 75% of developers will not merge AI code without manual review, even when AI gets it “right”
  • 59% of developers rate the effort of reviewing, testing, and correcting AI output as “moderate” or “substantial”
  • PR sizes are up ~18% as AI adoption increases, which compounds the review burden
  • Incidents per PR are up ~24% and change failure rates are up ~30%

So we have a situation where AI generates code 10x faster, but real-world productivity improves only 10-20% because human review cannot scale at the same rate. Teams are producing dramatically more output, but the humans who need to verify that output are the same people, with the same cognitive bandwidth, working the same hours.

The Uncomfortable Question

This leads to a question I think we need to confront honestly: should we accept that AI-generated code receives less rigorous review than human-written code?

I am not asking whether we should lower standards in an ideal world. I am asking whether it is already happening de facto, and whether we should be intentional about it rather than pretending the old model still works.

Think about it from a reviewer’s perspective. You used to review 3-5 PRs per day from colleagues you know, whose coding patterns you understand, whose reasoning you can infer from the changes. Now you are reviewing 8-12 PRs per day, many containing AI-generated code that follows different patterns, uses unfamiliar abstractions, and comes with explanations that may or may not reflect what the code actually does.

The realistic options seem to be:

  1. Maintain the same review depth – which means accepting that review becomes the permanent bottleneck and AI coding gains are largely illusory
  2. Create tiered review processes – where AI-generated code gets different (not necessarily less) review treatment
  3. Invest heavily in automated verification – static analysis, AI-assisted review, property-based testing to augment human reviewers
  4. Accept higher defect rates – as the cost of faster delivery, and invest more in production monitoring and rollback capabilities

What I am Seeing in Practice

At my company, we have started tracking AI-generated vs human-written PRs separately. The data is revealing:

  • AI-generated PRs have a 32.7% acceptance rate on first review vs 84.4% for human PRs (consistent with Faros AI’s findings)
  • Average review time for AI PRs is 2.3x longer than human PRs
  • But AI PRs that pass review have comparable production incident rates to human PRs

That last point is interesting. It suggests that the review process itself is effective at catching issues – the problem is just the volume and the cognitive load on reviewers.

The DORA Metrics Problem

Here is another angle that concerns me: our existing measurement frameworks are not designed for this reality. DORA metrics – deployment frequency, lead time, change failure rate, MTTR – assume a relatively stable relationship between code volume and review capacity. When AI multiplies code output by 5-10x without proportionally increasing review capacity, these metrics can paint a misleading picture.

A team might show “improving” deployment frequency while actually degrading review quality. GitClear predicts code churn will double in 2026 because of AI, which means a significant portion of those “deployments” are rework.

What I Think We Should Do

I do not think the answer is lowering standards. But I also do not think the answer is pretending that a 5-person team can maintain the same review quality on 98% more PRs.

My current thinking:

  1. Separate the review workflow for AI-assisted PRs with mandatory automated checks before human review
  2. Invest in AI-assisted review tools (the irony is not lost on me) to handle the first pass
  3. Redesign PR sizes – if AI generates large PRs, we need tooling that breaks them into reviewable chunks
  4. Track review fatigue metrics – not just throughput, but reviewer cognitive load over time
  5. Be honest about capacity – if your team’s review bandwidth supports 30 PRs/week, that is the real throughput ceiling, regardless of how fast AI can generate code

What are others seeing? Is code review the bottleneck on your team? And are you changing how you review AI-generated code, or maintaining the same process?

Rachel, this data is exactly what I have been trying to articulate to my leadership team. The 91% increase in PR review time is staggering but matches what I see on the ground.

I want to push back slightly on your framing though. You frame this as “should we accept less rigorous review” – but I think the real question is “should we redefine what rigorous review means in an AI-assisted context?”

In my organization, we have started separating review objectives:

  • Correctness review: Does this code do what it claims? (Humans are still best at this)
  • Style/pattern review: Does this follow our conventions? (AI tools handle this better and faster)
  • Security review: Are there vulnerabilities? (Automated SAST catches 80% of known patterns)
  • Architecture review: Does this fit our system design? (Humans only, no substitute)

By splitting the review into layers and assigning each to the most effective reviewer (human or automated), we maintained quality while reducing the burden on human reviewers by roughly 40%.

The key insight for me was this: “rigorous” does not mean “one human reads every line.” It means “every category of potential issue has an appropriate detection mechanism.” If static analysis catches type errors faster and more reliably than a human, having a human check types is not “more rigorous” – it is less efficient with no quality benefit.

Your point about DORA metrics breaking is spot-on. We stopped using deployment frequency as a standalone metric precisely because it was being artificially inflated by AI-generated volume. We now track “verified deployments” – deployments where the review process completed all quality gates – as the real metric.

I appreciate the data-driven approach here, but I have to push back on one thing: the premise that we should even be considering reduced review for AI code.

From a security perspective, AI-generated code needs MORE review, not less. The 45% vulnerability rate in AI-generated code is not a statistic you optimize around – it is a flashing red warning.

The problem with the “tiered review” approach that keeps coming up is that it creates an explicit fast path through your security controls. Every attacker’s dream is a known shortcut through the target’s review process. If I know that AI-flagged PRs get lighter human review, I know exactly which PRs to target.

And here is the thing about that 32.7% first-pass acceptance rate for AI PRs – that means 67.3% of AI PRs had issues serious enough to reject on first review. We are talking about a category of code where the majority fails review, and the proposed solution is to review it less carefully?

I understand the capacity problem is real. But the solution should not be “review AI code less.” It should be “generate less AI code until your review capacity can handle it” or “invest in automated security tooling that compensates.” The answer to “we cannot review everything” should never be “then review less” when the code in question has a demonstrated higher defect rate.

Rachel, you mentioned that AI PRs that pass review have comparable production incident rates. That is because the review process is catching the problems. If you weaken the review, you lose that filter, and incident rates will spike. The data you are citing actually argues against relaxing review standards, not for it.

Both Rachel and Sam raise excellent points, but I want to add a leadership lens to this conversation.

The code review bottleneck is fundamentally a resource allocation and organizational design problem. And like most org design problems, the people living it every day understand it better than the executives making decisions about tooling budgets.

Here is what concerns me about the current discourse: the AI tool vendors frame this as “adopt AI review tools to solve the review bottleneck created by AI coding tools.” That is a treadmill. You adopt AI coding, which creates review pressure. You adopt AI review, which enables more AI coding, which creates more review pressure. At no point does anyone ask: is the current volume of code production actually delivering proportional business value?

I have started asking my teams to track a metric I call “value-per-PR.” It is imperfect, but it measures whether the incremental PRs from AI adoption are actually moving product metrics or just inflating activity. Early data suggests that roughly 30-40% of the PR volume increase from AI adoption is rework, refactoring, or changes that could have been avoided with better initial design.

If that holds, the real solution to the review bottleneck is not “review faster” – it is “generate less noise.” AI tools that produce higher-quality output with fewer revision cycles would do more for the bottleneck than any review optimization.

Rachel, your point about tracking review fatigue metrics is something I have not seen anyone else suggest, and I think it is crucial. We measure developer burnout in many ways, but review-specific fatigue is a blind spot. I am going to propose adding review load metrics to our next developer experience survey.

The industry needs to get serious about treating review capacity as a first-class resource constraint, not an afterthought to AI adoption decisions.

I want to bring a slightly different perspective as someone in the trenches writing and reviewing code every day.

The conversation so far has been about organizational solutions – tiered review, tooling, metrics. All valid. But there is a human dynamics problem here that nobody is naming directly.

AI coding tools created a social pressure to produce more. When half your team doubles their PR output with AI, the other half feels pressure to match. When your manager sees velocity metrics going up, the implicit expectation is that everyone should be moving at that pace. Nobody explicitly says “review quality matters less now” – but the organizational signals all point that way.

I have started noticing something in my team: the engineers who use AI most heavily also review least carefully. Not because they are lazy, but because their mental model of code has shifted. When you spend all day generating code through prompts and accepting suggestions, your relationship with code changes. You start seeing code as disposable and easily regenerated rather than carefully crafted. That mindset carries into reviews: “if this has a bug, we will just regenerate it.”

That works for some types of code. It absolutely does not work for code that handles money, personal data, or critical infrastructure.

Rachel, your option 4 – “accept higher defect rates as the cost of faster delivery” – scares me because I think it is already the de facto choice most teams have made, they just have not admitted it yet. The 9% increase in bugs per developer that DORA reports is the quiet acceptance in action.

I do think the PR size point is the most actionable recommendation. In my experience, the single best thing we can do for review quality is enforce smaller PRs. Not just because they are easier to review, but because they force the developer (and the AI) to think in smaller, more testable increments. AI tools love generating 500-line monster PRs. Making engineers break those down before review is a quality forcing function.