Following up on the AI productivity paradox discussion—I want to dig deeper into what I think is the core bottleneck: our code review processes were designed for a different volume and quality profile of code, and they’re breaking under AI-generated workload.
The Traditional Review Model’s Assumptions
Traditional code review was designed around these assumptions:
- Developers produce 10-15 PRs per week (per reviewer)
- Code is well-understood by the author
- Reviews take 15-30 minutes for most PRs
- 2-3 review cycles per PR on average
- Review is primarily about catching bugs and logic errors
All of these assumptions are failing in AI-heavy teams.
What We’re Actually Seeing
Our data from 40+ engineers in a heavily regulated fintech environment:
Before widespread AI adoption (Q1 2025):
- 120 PRs/week across team
- Average review time: 25 minutes per PR
- Review SLA: 4 hours (median time from PR creation to approval)
- Review comments per PR: 4.2 average
- Approval rate on first review: 65%
After widespread AI adoption (Q1 2026):
- 230 PRs/week across team (+92%)
- Average review time: 35 minutes per PR (+40%)
- Review SLA: 18 hours (median time from PR creation to approval) (+350%)
- Review comments per PR: 3.1 average (-26%)
- Approval rate on first review: 52% (-13 points)
The paradox in those numbers: We’re spending more time per review but leaving fewer comments and catching fewer issues on first pass.
Why Reviews Take Longer for AI Code
This surprised me, but after watching our senior engineers review AI-generated PRs, the pattern is clear:
1. The Author Doesn’t Deeply Understand Their Own Code
When a senior engineer writes code, I can ask “why did you choose this approach?” and get a thoughtful answer about trade-offs considered.
When a junior engineer uses AI to generate code, they often can’t explain the architectural choices because they didn’t make them—AI did. The review becomes a teaching session (“do you understand what this does?”) before it can be a quality gate.
2. Unfamiliar Patterns and Libraries
AI suggests whatever patterns are common in its training data, not what’s common in our codebase. This means reviewers encounter code that’s syntactically correct, idiomatically reasonable, but architecturally inconsistent with the rest of the system.
Last week: an engineer used Copilot to implement caching. Copilot suggested a Redis pattern we don’t use anywhere else. The code worked, tests passed, but now we had a new pattern to maintain. Three senior engineers spent 90 minutes discussing whether to accept it or ask for refactoring to match our existing approach.
3. Subtle Business Logic Errors That Tests Miss
Here’s the scary one: AI is excellent at generating code that looks correct and passes tests but has subtle business logic errors.
Example from last month: Payment processing code that correctly handled the happy path and obvious error cases (which tests covered) but had wrong behavior for a specific edge case (international transactions over $10k with partial refunds) that only a domain expert would spot.
The tests passed because we didn’t have a test for that edge case. The code looked reasonable. Only caught it because a reviewer who’d worked on our payment system for 3 years noticed something that “felt off.”
The Degrading Review Quality Problem
Here’s what worries me most: Review comments per PR are down 26% despite code complexity being unchanged.
I think reviewers are overwhelmed. When you’re looking at 40 PRs per week instead of 15, you start doing shallower reviews:
- Focus on obvious bugs, skip architectural discussion
- Trust that tests are comprehensive, skip edge case reasoning
- Approve if it “looks fine,” skip questioning design choices
This is how technical debt accumulates silently.
The Solutions We’re Experimenting With
Michelle mentioned automated quality gates in the other thread—we’re going all-in on this:
Tier 1: Pre-PR Automated Checks
- Linting, formatting, security scanning (standard stuff)
- Design pattern validation: Custom tooling that checks if new code follows our architectural patterns
- Compliance scanning: Flags any code touching financial data for enhanced review
- Test coverage requirements: AI-generated code must have >90% coverage (vs. 80% for human code)
Early results: Catching ~60% of issues that would’ve required review comments. Review time per PR down from 35 min to 28 min.
Tier 2: AI-Assisted Review
Testing CodeRabbit and similar tools to provide automated review comments before humans look:
- “This looks similar to pattern X elsewhere in the codebase—consider using that instead”
- “This function has high cyclomatic complexity—consider refactoring”
- “This logic differs from similar implementations in files Y and Z—intentional?”
Mixed results so far. False positives are high (~40%), so reviewers are learning to ignore some automated comments, which might be training them to ignore warnings in general. Concerning.
Tier 3: Risk-Based Review Depth
Not all code needs the same review rigor:
High risk (payments, auth, customer data):
- 2+ senior engineer reviewers
- Compliance review
- Architecture review
- Required edge case discussion
Medium risk (business logic, UI):
- 1 reviewer + automated checks
- Optional architecture discussion
Low risk (tests, docs, internal tools, configs):
- Automated checks only
- Async review (can merge, reviewer comments post-facto)
We tag PRs automatically based on files changed. Only ~20% of PRs are high-risk, but those get 80% of review attention.
Tier 4: Differential Review Standards for AI Code
This is controversial, but we’re testing it: PRs labeled as “AI-assisted” get different review criteria:
- Required explanation of why AI-suggested approach was chosen
- Mandatory comparison to existing patterns in codebase
- Higher test coverage requirements
- Edge case discussion required
Some engineers hate this (“it stigmatizes AI usage”), but I think it’s honest about the different risk profile.
The Questions I’m Wrestling With
-
Can code review scale to 2× volume without degrading quality? Or is there a fundamental human attention limit we’re hitting?
-
Should we invest more in AI-assisted review to match AI-assisted coding? Or is automated review fundamentally limited in ways that human review isn’t?
-
Is tiered review (different standards for different risk levels) pragmatic or dangerous? Are we creating a two-tier system where low-risk code gets poor review and accumulates debt?
-
Should AI-generated code have different standards than human code? Or does that create perverse incentives (engineers avoiding AI to avoid scrutiny)?
What I Need From This Community
How are others adapting review processes for AI-generated code volume?
- Are you accepting longer review times?
- Reducing review depth?
- Investing in automated tooling?
- Limiting AI usage?
- Redesigning review entirely?
Has anyone successfully scaled review capacity to match AI coding capacity? What worked? What failed?
Because right now, our review process is the constraint preventing AI productivity gains from translating to team velocity—and I don’t think “review faster” or “hire more reviewers” are viable solutions at scale.
We need structural changes to how review works. I’d love to hear what others are trying.