Code review time is up 12% since we adopted AI coding tools—are we doing this wrong?

Six months ago, we rolled out AI coding assistants across our EdTech engineering team. The promise was compelling: developers would ship code faster, spend less time on boilerplate, and focus on high-value problem-solving. Our engineering metrics seemed to validate this—developers were saving an average of 3.6 hours per week on coding tasks.

But here’s what nobody warned us about: our code review cycle time increased by 12%.

The Paradox We’re Living

I was celebrating our productivity gains in a leadership meeting when our VP of Engineering Operations pulled me aside with concerning data. Yes, individual developers were shipping PRs faster. But the time from PR submission to approval had grown significantly. And when we dug deeper, we found something more troubling: in the handful of PRs that skipped thorough review, bug density was 23% higher.

A recent McKinsey study surveying over 4,500 developers confirms we’re not alone. Teams that didn’t adapt their review processes saw review time increase by similar margins. Meanwhile, research from CodeRabbit shows that AI-generated code creates 1.7x more issues than human-written code:

  • Logic and correctness errors: 1.75x higher
  • Code quality and maintainability issues: 1.64x higher
  • Security findings: 1.57x higher
  • Performance problems: 1.42x higher

Why AI Code Takes Longer to Review

The issue isn’t obvious at first glance. AI-generated code often looks clean—proper formatting, follows conventions, passes linting. But the problems are subtle:

It’s “almost right but not quite.” The logic works for the happy path but misses edge cases. The error handling exists but doesn’t match our patterns. The code is syntactically correct but semantically fragile.

This creates a cognitive load problem for reviewers. Instead of quickly spotting obvious issues, they have to think deeply about whether the approach is sound. Senior engineers tell me reviewing AI-generated PRs requires more focus and mental energy than reviewing junior developer code—because juniors make obvious mistakes, while AI makes plausible mistakes.

The Team Impact

We’re seeing two concerning patterns:

  1. Senior engineer burnout. Our most experienced developers are spending 30% more time on code review. We can’t scale review capacity linearly with PR volume, and asking seniors to work longer hours isn’t sustainable.

  2. Review coverage gaps. We’ve had 984 PRs merge without thorough review this quarter. Some teams are rubber-stamping approvals just to keep velocity up. The short-term optics look good; the long-term quality risk is real.

One of our Anthropic research links points out an especially concerning trend: when AI generates code and AI tools review it, we get a feedback loop where similar training biases mean the review model may miss errors that the generation model introduced. AI reviewing AI’s work isn’t the safety net we assumed.

The Uncomfortable Questions

I’m struggling with a few questions that don’t have easy answers:

Is this sustainable? Can we maintain quality with current review practices, or are we accumulating “review debt” that will bite us later?

What’s the right tradeoff? Individual developer productivity is up, but team throughput is questionable. How do we balance speed and quality?

Are we measuring the right things? We’re tracking PR volume and individual velocity. Should we be tracking review effectiveness, bugs caught in review, or production defect rates instead?

How do other teams handle this? I’m genuinely curious—for those of you who’ve adopted AI coding tools at scale, what’s working? What changed in your review process?

What We’re Trying

We’re experimenting with a few approaches:

  • Tiered review based on risk: Fast-track for low-risk changes, deeper review for business logic and security-sensitive code
  • AI-assisted review: Using AI review tools for the mechanical first pass (style, common patterns), reserving human attention for architecture and logic
  • Review SLAs by priority: Different review turnaround times based on feature priority
  • Explicit “AI-generated” tagging: PRs with significant AI contributions get flagged for extra attention

Early results are mixed. We’ve reduced some bottlenecks, but we haven’t solved the fundamental tension between volume and quality.

I’d love to hear how other engineering leaders are thinking about this. Are you seeing similar patterns? What’s your review process look like in the AI era? And honestly—is the 12% slower review time just the actual cost of quality that we were cutting corners on before?


Sources:

This resonates deeply. We’re experiencing almost identical patterns at our financial services company, and the compliance dimension makes it even more challenging.

The Numbers Don’t Lie

Our team went from averaging 12 PRs per week to 19 PRs per week over the past six months—a 58% increase. Leadership celebrated this as validation of our AI investment. But here’s what they didn’t see: time from PR merge to production deployment increased by 40%.

The bottleneck isn’t the pipeline. It’s review capacity.

The “Almost Right” Problem in Regulated Environments

Your point about AI code being “almost right but not quite” hits especially hard in financial services. We can’t skip reviews—compliance requires complete audit trails. But the nature of AI-generated issues makes reviews more cognitively demanding:

  • Logic that works until it doesn’t. A payment processing function that handles USD correctly but fails on fractional cent rounding in other currencies.
  • Security patterns that look correct. Input validation that catches SQL injection but misses a subtle LDAP injection vector.
  • Edge cases nobody thought to test. Date handling that works fine until you hit a timezone boundary during daylight saving transitions.

Senior engineers tell me they can review a junior developer’s 200-line PR in 10 minutes because the issues are obvious. That same 200 lines from an AI assistant might take 25 minutes because they have to mentally simulate execution paths to find the subtle flaws.

The Review Dilemma

We’re caught between competing constraints:

  1. Can’t skip reviews: Regulatory requirements demand human oversight
  2. Can’t scale linearly: We have 8 senior engineers capable of reviewing security-sensitive code. We can’t just hire 3 more.
  3. Can’t slow down: Product commitments assumed the AI velocity gains would translate to faster delivery

The math doesn’t work. 58% more PRs × 12-15% longer review time per PR = senior engineers who are underwater.

Our Tactical Response

We implemented a tiered review process three weeks ago:

Tier 1: AI Pre-Review

  • Automated checks for code style, common anti-patterns, dependency vulnerabilities
  • AI review tools (we’re testing CodeRabbit and Qodo) flag potential issues
  • Catches ~40% of what would otherwise be human review findings

Tier 2: Human Review - Categorized by Risk

  • Low risk (internal tools, non-customer-facing): Single reviewer, junior-level acceptable
  • Medium risk (customer-facing, no PII/payments): Two reviewers, at least one senior
  • High risk (payments, PII, security): Two senior reviewers, security team approval

Results so far:

  • Reduced review bottleneck by approximately 20%
  • Maintained (arguably improved) quality—we’re catching issues AI tools miss
  • Senior engineer review load more sustainable, though still elevated from pre-AI baseline

The Question I Can’t Answer

Is anyone successfully using AI review tools to catch the subtle bugs that AI coding assistants introduce? Or is this a fundamental limitation—that models trained on similar data will share similar blind spots?

We’re seeing AI review tools excel at finding obvious issues (unused variables, missing null checks, style violations) but struggle with the same edge cases and logic flaws that AI coding assistants create. Which means we still need deep human review for anything that matters.

I’m curious: for teams that have made this work—what’s your review effectiveness metric? How do you measure whether reviews are actually catching issues vs. just rubber-stamping faster?

Coming from the design side, I’ve watched this exact pattern play out with our design systems work. And honestly? I learned some painful lessons about trusting AI-generated code without sufficient review.

My Expensive Education

Three months ago, I was excited about using AI to generate component code for our design system. The pitch was compelling: designers could describe a component, AI would generate accessible React code, and we’d ship faster.

What actually happened: We shipped three components with accessibility issues that made it into production. Screen reader users couldn’t navigate our new form components. Keyboard navigation was broken in subtle ways. The code looked accessible—it had ARIA attributes, semantic HTML, proper roles. But the implementation was wrong in ways that only became obvious when tested with actual assistive technology.

The AI knew the syntax of accessibility but not the substance. It could add but didn’t understand when was more appropriate. It created keyboard event handlers that technically worked but violated expected patterns.

My failure: I trusted that “looks correct” meant “is correct.” I should have known better.

The Nuance Problem

This maps directly to what you’re both describing with code review. AI is phenomenal at:

  • Boilerplate and patterns: Standard CRUD operations, data transformations, common utility functions
  • Syntax and structure: Code that compiles, passes linting, follows style guides
  • Surface-level correctness: Happy path logic that works for obvious cases

AI struggles with:

  • Nuanced requirements: Accessibility, i18n edge cases, cultural considerations in UX
  • Context-dependent decisions: When to optimize for performance vs. readability, which abstraction fits the evolving architecture
  • Implicit knowledge: The unwritten rules of your codebase, the lessons learned from past incidents

A Framework That’s Helping

After my accessibility incident, I created explicit boundaries for our team:

Green Zone - AI Can Handle, Light Review:

  • Standard form inputs following existing patterns
  • CSS layout code matching design tokens
  • Data transformation utilities with clear specs
  • Repetitive component variants (button sizes, color variants)

Yellow Zone - AI Assists, Human Decides:

  • Complex interactive components
  • State management logic
  • Integration with third-party libraries
  • Anything customer-facing or brand-critical

Red Zone - Human-First, AI Reference Only:

  • Accessibility-critical components
  • New architectural patterns
  • Security-sensitive code (auth, data handling)
  • Anything that sets precedent for other developers

The key insight: We’re not asking “Can AI do this?” We’re asking “What’s the cost if AI gets this wrong?” Low cost of failure = more AI autonomy. High cost = more human judgment.

The Meta-Question

Here’s what keeps me up at night: Are we using AI as a tool to think with or a crutch that replaces thinking?

When I review AI-generated component code now, I notice myself engaging less critically at first. The code looks reasonable, so my brain wants to approve it. I have to consciously force myself to ask: “But does this actually solve the problem correctly?”

Maybe the 12% slower review time isn’t a bug—it’s a feature. Maybe we should be thinking more carefully about code that gets generated quickly. The old pace of typing might have given our brains natural time to process and question our assumptions.

Uncomfortable truth: Perhaps we were rushing through reviews before, and AI just exposed that we weren’t being as thorough as we should have been. The “slowness” might be the actual cost of quality that we were cutting corners on.

That said, @eng_director_luis, I’m deeply curious about your tiered review process. How are you deciding what goes in which tier? And more importantly—how are you preventing “tier creep” where everything ends up in the high-risk bucket because teams are risk-averse?

Reading through this thread, I keep coming back to one framing: This is an organizational design problem, not a tooling problem.

We’ve added AI coding assistants to workflows designed for pre-AI volume. Of course things break.

Introducing: Review Debt

I want to propose a concept that’s been useful in our strategic planning: Review Debt.

Just like technical debt, review debt accumulates when we take shortcuts:

  • Rubber-stamping approvals to maintain velocity
  • Skipping deep review for “small” changes that compound
  • Relying on post-merge testing to catch what review should have caught
  • Trusting AI-generated code because it “looks right”

And just like technical debt, review debt eventually comes due—usually through production incidents, security vulnerabilities, or accumulated complexity that nobody understands.

The Data Validates Thoughtful Process Design

That McKinsey study @vp_eng_keisha referenced is worth examining more closely. Teams with strong review culture saw review cycles shorten by 35%—not lengthen.

What distinguished these teams? They redesigned their review process for the AI era rather than grafting AI tools onto existing workflows:

  1. Clear risk categorization (similar to what @eng_director_luis described)
  2. AI review as first-pass automation, not replacement for human judgment
  3. Explicit review SLAs that account for AI-generated code volume
  4. Metrics focused on review effectiveness, not just review speed

The volume changed. Our processes didn’t. That’s on us.

What Needs to Change: Organizational Level

From a CTO perspective, here’s what I’m working on with our leadership team:

1. Review SLAs That Acknowledge Reality

  • P0 customer commitments: 4-hour review target, senior reviewer required, explicit approval from tech lead
  • P1 roadmap features: 24-hour review target, standard review process
  • P2 improvements: 48-72 hour review target, can batch review similar changes
  • P3 tech debt / refactoring: 1-week review target, allows for deeper architectural discussion

2. Review Effectiveness Metrics (Not Just Speed)

We’re tracking:

  • Bugs caught in review (what percentage of eventual issues were identified pre-merge?)
  • Review cycle time by risk category (are we spending time proportional to risk?)
  • Post-merge defect rate by reviewer (which reviewers are most effective?)
  • AI-generated vs human-authored defect rates (are we treating them appropriately differently?)

3. Explicit Guidelines for AI-Generated Code

We now require:

  • Tagging PRs with significant AI contribution (>30% of lines)
  • Additional review checklist for AI-generated code (edge cases, error handling, security implications)
  • Mandatory human verification before merge for certain domains (payments, auth, PII handling)

4. Investment Reallocation

This is the hardest conversation: We can’t reduce headcount just because AI increased individual productivity.

The board saw “59% productivity increase” and wanted to slow hiring. I had to explain: we’re not 59% more effective at delivering value to customers. We’re generating 59% more code that requires review, testing, and deployment.

If anything, we need more capacity in review, QA, and infrastructure to absorb the increased volume.

The Leadership Communication Challenge

@vp_eng_keisha, you asked about communicating this to stakeholders. Here’s the framing that worked for me:

Leading indicators (what boards tend to focus on):

  • PR volume :white_check_mark: UP
  • Lines of code generated :white_check_mark: UP
  • Individual developer velocity :white_check_mark: UP

Lagging indicators (what actually matters):

  • Deployment frequency :warning: FLAT
  • Mean time to recovery :warning: UP
  • Production defect rate :warning: UP
  • Customer-impacting incidents :cross_mark: UP

I tell the board: “We’re measuring activity, not outcomes. PR volume is a vanity metric. Deployment success rate and customer value delivered are sanity metrics.”

The Question We Should All Be Asking

What’s your review effectiveness metric?

Not “How fast do we review?” but “How well do our reviews catch issues before they hit production?”

If you don’t have an answer to that question, you’re optimizing for the wrong thing. And when AI increases code volume by 50-60%, optimizing for speed over effectiveness is a recipe for disaster.

I’d love to hear: what metrics are other engineering leaders using to measure review quality vs. just review velocity?