AI Code Review Bottleneck: Our PR Queue Grew 91% and Nobody Knows How to Fix It

Okay, I need to share something that’s been driving me absolutely bonkers :exploding_head:

Three months ago, my design systems team started using AI coding assistants. The promise? Ship components faster, iterate quicker, spend less time on boilerplate. The reality? Our PR queue has basically exploded, and I’m watching talented engineers drown in review work.

The Numbers Don’t Lie (But They’re Confusing)

Here’s what actually happened after we adopted AI tools:

  • Our team is merging 98% more pull requests than before
  • Individual developers report feeling way more productive
  • BUT… PR review time increased by 91% :chart_increasing:

Wait, what? How does that even make sense?

It gets weirder. The average PR size grew by 154%. Turns out, when AI can generate a whole component implementation in minutes instead of hours, developers don’t break their work into smaller, reviewable chunks anymore. They just… ship the whole thing at once.

The Design Systems Nightmare

From a design systems perspective, this created a weird dynamic I didn’t expect.

Engineers can now generate UI code really fast based on our component specs. That should be great, right? More design system adoption, faster implementation, happy designers! :artist_palette:

Except the review bottleneck broke our entire feedback loop. We’d iterate on a component design, engineering would update the implementation with AI in like 20 minutes, and then… crickets for 3-4 days while it sat in the review queue. By the time we got feedback that something didn’t work, we were already two sprints ahead.

The collaboration that made our design system actually good — that tight back-and-forth between design and engineering — basically collapsed under the weight of the queue.

Why AI Code Is Harder to Review

I’m not an engineer (well, not primarily :sweat_smile:), but I’ve been watching this closely, and here’s what I’ve noticed about reviewing AI-generated code:

1. Less familiar patterns: AI doesn’t write code like your teammates do. It’s technically correct, but the patterns are just… different. Reviewers spend extra mental energy parsing unfamiliar approaches.

2. More verbose: AI really likes to be thorough. That PR that would’ve been 50 lines from a human? AI writes 200 lines. All technically fine, but way more surface area to review.

3. Lacks context: When a human writes code, they leave breadcrumbs — variable names that reflect the domain, comments about why they chose an approach. AI code often looks good but doesn’t tell you the story of the implementation.

4. Edge cases everywhere: I’ve noticed AI-generated code tends to handle edge cases I didn’t even ask for. Again, seems good! But reviewers have to validate all of it, and sometimes those edge cases introduce subtle bugs.

What We Tried (Spoiler: Nothing Really Worked)

Our team has been scrambling to address this:

  • Rotating review responsibilities: Just spread the pain around. Everyone’s equally underwater now! :ocean:
  • Trying AI code review tools: Anthropic just launched their Code Review tool in March, we’re experimenting with it. Early days, but it’s finding stuff humans miss (and also flagging stuff that’s totally fine, so… mixed results)
  • Review time limits: “Keep reviews under 30 minutes!” Great in theory, terrible in practice when PRs are 400 lines long
  • Async review cycles: Tried doing reviews in batches. Just meant bigger context-switching overhead.

None of it really solved the fundamental problem. We’re generating code faster than we can responsibly review it.

The Productivity Paradox

The really frustrating part? According to the data, developers using AI believe they’re 20% more productive. They feel faster because they’re writing code in less time.

But a recent study found that developers are actually 19% slower at completing tasks when you measure end-to-end delivery. The review bottleneck completely wipes out the generation speed gains.

And then there’s the bug rate. Our incidents went up about 9% since adopting AI tools. Nothing catastrophic, but definitely the wrong direction.

The Real Question

So here’s what I’m struggling with: Is this just the new normal?

Like, is this a temporary transition period while we figure out new workflows? Or did we fundamentally break something about how code review is supposed to work?

Some days I think we need to completely redesign our review process from scratch — maybe treat AI-generated PRs differently, have different approval gates, I don’t know.

Other days I think we should just slow down the AI code generation to match our review capacity, but that feels like leaving productivity on the table (even if the productivity gains are somewhat illusory).

How Are You All Handling This?

I can’t be the only one seeing this pattern, right?

  • Have you found workflows that actually work with AI-assisted development?
  • Are you treating AI-generated code differently in your review process?
  • Did you just… accept the slower velocity as the price of AI code?
  • Or did you find some magic solution I’m missing?

At our current pace, I’m genuinely worried about long-term quality. Our design system is too critical to let technical debt sneak in because we couldn’t keep up with reviews. But I also don’t want to be the person who says “stop using the productivity tools” when everyone feels like they’re shipping faster.

Help? :folded_hands:


Sources that made me realize this isn’t just us:

@maya_builds Oh wow, I felt this in my bones. We’re seeing the exact same pattern across our entire engineering org (80+ engineers), and honestly, it’s been one of my top concerns for the past quarter.

The Organizational Compound Effect

What you’re experiencing at the team level gets exponentially worse at scale. When the bottleneck hits a distributed organization, those small delays cascade into massive queues. We’ve got PRs sitting in review for 5-7 days now, which is wild considering we merged them in 1-2 days before AI adoption.

What We’ve Tried (Narrator: It Didn’t Work)

I’m going to be real with you — we tried a lot of things that sounded good in theory:

Review SLAs: We set a 24-hour SLA for reviews. What actually happened? Just created pressure and stress without solving the root cause. Reviewers rushed through to hit the SLA, which defeated the whole purpose of careful review.

Pair programming for AI code: This one showed promise! Having two people review AI-generated code together caught more issues and built shared context. But it’s resource-intensive, and we couldn’t scale it across all PRs.

Review load as a team health metric: Started tracking how many PRs each engineer had in their review queue. This at least made the problem visible to leadership, which helped with prioritization conversations.

The Real Issue: Process Mismatch

Here’s what I think we’re all learning the hard way: This is a process problem, not a people problem.

Our existing code review workflows were designed with certain assumptions:

  • Reviewers can pattern-match against familiar coding styles
  • PR size reflects complexity of the change
  • The author understands every line they wrote

AI-assisted development breaks all three assumptions. The code looks different, PRs are huge but not necessarily complex, and sometimes developers don’t fully understand the AI-generated implementation they’re submitting.

We need to rethink the entire review process, not just optimize the old one. Adding more reviewers or setting stricter SLAs is like putting a faster engine in a car that’s structurally unsound.

The Question I Keep Coming Back To

What if we need fundamentally different approval workflows for different types of code?

Like, should a component library PR (which has long-term architectural impact) go through different gates than AI-generated UI implementation (which is more like filling in templates)?

I don’t have the answer yet, but I think the one-size-fits-all review process might be part of what’s breaking.

Real Talk on Velocity

The thing that keeps me up at night: We’re slower overall, but leadership sees “98% more PRs merged” and thinks we’re killing it :bar_chart:

The disconnect between activity metrics and actual delivery velocity is creating misalignment. Product is frustrated because features aren’t shipping faster. Engineering feels productive because they’re writing more code. I’m stuck in the middle trying to explain that more output ≠ more outcomes.

How are you navigating that conversation with your leadership and product partners? Because that might be the harder problem to solve than the technical workflow itself.

@maya_builds @vp_eng_keisha This hits close to home. Running a 40+ person engineering team at a financial services company, and we’re dealing with this exact bottleneck — except our compliance constraints make it even trickier.

The Financial Services Reality Check

Here’s our version of the problem: We legally cannot skip code reviews. Banking regulations require audit trails for every change that touches financial data. So when Maya asks “should we just accept slower velocity?” — for us, that’s not even optional. We have to do thorough reviews, period.

We Tracked the Data (It’s Worse Than You Think)

I’m a metrics person, so we actually measured what was happening:

  • 91% increase in review time (matches the industry data exactly)
  • AI-generated code has 2.3x more edge cases on average than human-written code
  • Bug rate up 9% — and in financial services, that’s a regulatory nightmare waiting to happen

That 9% bug increase might sound small, but in our world, a bug in transaction processing can trigger audits, compliance violations, and seriously expensive problems. We can’t afford to let quality slip.

What’s Actually Working (Sort Of)

After a lot of experimentation, here’s what’s showing promise:

1. Tiered Review Process:
We started flagging PRs based on AI usage intensity:

  • Heavy AI (>70% AI-generated): Two reviewers required, extra focus on edge cases
  • Moderate AI (30-70%): Standard review + automated checks
  • Light AI (<30%): Normal process

It’s not perfect, but it helps reviewers calibrate their scrutiny level.

2. AI Code Patterns Documentation:
We created a wiki of common patterns AI tools generate, what to watch for, and what’s typically safe. Helps reviewers spot problems faster instead of questioning every line.

3. Junior Engineer Shadowing:
This one surprised me — we have junior engineers shadow reviews of AI-heavy PRs as a learning opportunity. They’re learning to be better reviewers AND understanding code patterns they might not write themselves.

The Uncomfortable Truth

Our team velocity is down overall. Features take longer to ship. Product is not happy.

BUT — and this is important — our quality hasn’t dropped yet. Incident rate is stable. Production bugs are under control. We’re slower, but we’re sustainable.

I think of it like this: We’re paying the real cost of AI-assisted development, not just enjoying the initial speed boost. The question is whether the long-term value (more code written, more features possible) justifies the near-term velocity hit.

The Mentorship Angle

Here’s something nobody’s talking about: This is actually creating an opportunity to teach engineers better code review skills.

When you review human code, you can rely on “does this look like how we usually do things?” But AI code forces reviewers to ask deeper questions:

  • Does this actually solve the problem correctly?
  • What are the performance implications?
  • Are the edge cases handled properly?

These are skills we should have been teaching all along. AI is just forcing us to be more intentional about it.

My Question for the Group

Does anyone have data on the long-term quality impact of AI-generated code?

We’re 6 months into this, and quality seems stable. But I’m worried we’re accumulating technical debt invisibly — code that works now but will be hard to maintain later. I’d love to hear from teams that are 12-18 months in: What happened to your codebases?

@maya_builds Coming at this from the product side, and honestly, this bottleneck is killing us. Not in a dramatic way — but in the slow, grinding way that makes you question every roadmap commitment.

The Product Velocity Paradox

Here’s what we’re seeing from the product org:

Engineering tells us: “We’re shipping way more code! Look at all these PRs!”
Sales tells customers: “With our new AI tools, we can iterate super fast!”
Product sees: Features still taking just as long (or longer) to actually deliver to customers.

The gap between “code is done” and “feature is shipped” has become this black hole where everything sits in review for days. We’ve had demos scheduled with enterprise prospects where the feature was ‘done’ on Monday but still stuck in review on Friday when we had to present.

The Business Case That Fell Apart

Let me tell you about the AI ROI projection we made 6 months ago (spoiler: it was wrong):

What we assumed:

  • AI tools would speed up coding by ~30%
  • Faster coding = faster feature delivery
  • Faster delivery = more customer value = revenue growth

What actually happened:

  • Coding DID speed up by ~30%
  • Review time increased by 91%, offsetting the gains
  • Feature delivery velocity: basically flat
  • Customer value: no change
  • Revenue growth: completely unrelated to AI tools

The ROI we sold to leadership was based on a fundamentally flawed assumption: that code generation was the bottleneck. Turns out the bottleneck was somewhere else entirely.

Cross-Functional Tension

This is creating real alignment problems:

Engineering perspective: “We’re being more productive! Look at our commit numbers!”
Product perspective: “Why aren’t features shipping faster if you’re so productive?”

Neither side is wrong, exactly. Engineering IS writing more code. Product ISN’T seeing faster delivery. The metrics just measure different things.

@vp_eng_keisha mentioned this disconnect, and it’s exactly what I’m wrestling with. How do we align around the right metrics when the old ones stopped making sense?

What Product Did to Adapt

Since we couldn’t fix the review bottleneck directly, we adjusted our planning:

1. Estimate inflation: Added ~40% to all feature estimates to account for review time. Feels terrible to plan for inefficiency, but at least commitments are realistic now.

2. Priority-based review: Working with engineering to identify “high-value” features that get review priority. Not ideal (creates queue-jumping) but at least strategic work moves.

3. Smaller scope per iteration: Trying to break features into smaller chunks that generate smaller PRs. Mixed success — sometimes AI just generates a big implementation regardless.

4. Pushed back on leadership: Had uncomfortable conversations about why AI tools didn’t deliver promised productivity gains. Leadership doesn’t love hearing that their investment isn’t paying off the way they expected.

The Framework Shift We Need

Here’s what I think went wrong: We optimized for the wrong metric.

AI tools optimize for “code written.” But what we should measure is “value delivered to customers.”

Those are not the same thing. Not even close.

If generating more code creates a review bottleneck that slows down actual delivery, then the “productivity” from AI is illusory. We’re just moving the bottleneck, not removing it.

My Questions for Product Leaders

How are you adjusting your planning processes to account for this reality?

Are you still using velocity-based estimates, or have you moved to something that captures the full delivery cycle?

And how are you having the conversation with engineering about what “productivity” actually means in the AI era?

Because right now, it feels like we’re all measuring activity (code written, PRs merged) instead of outcomes (customer value delivered, problems solved). And I don’t think we fix this until we align on what success actually looks like.

@maya_builds @vp_eng_keisha @eng_director_luis @product_david

This entire thread is giving me flashbacks to a CTO roundtable I attended last month. Eight CTOs in the room, and this exact issue came up from literally every single one of us. You’re not alone — this is an industry-wide pattern.

The Strategic Mistake We All Made

Looking back, I think we made a fundamental error in how we thought about AI adoption:

What we optimized for: Code generation speed
What we should have optimized for: Full development lifecycle effectiveness

We assumed the bottleneck was “writing code” and threw AI at it. But the actual bottleneck in most mature engineering organizations isn’t code generation — it’s review, testing, integration, deployment, and maintenance.

AI made the non-bottleneck faster, which just exposed (and worsened) the real bottlenecks downstream.

The Three Root Causes

After talking with other CTOs and analyzing our own data, I think there are three core issues:

1. Process Mismatch
Our code review processes were designed for human-authored code, which has a certain quality baseline. Humans make typos but understand context. AI gets syntax perfect but sometimes misses the bigger picture.

We’re trying to review AI code with human code assumptions, and it doesn’t work.

2. Individual vs Team Optimization
AI tools are optimized for individual developer productivity: “How fast can one person write code?”

But engineering effectiveness is a team sport. If I write code 3x faster but create 2x more review burden for my teammates, did I make the team more productive? No.

The tools optimized for the wrong level of the system.

3. Vanity Metrics
@product_david nailed this. Leadership celebrated “98% more PRs merged” without asking whether that translated to better outcomes.

Code volume became a vanity metric. More PRs merged sounds impressive in a board deck, but if customer value didn’t increase, what did we actually accomplish?

What We’re Doing at Enterprise Scale

Here’s our multi-pronged approach (still early, but showing promise):

AI Reviewing AI:
We implemented Anthropic’s Code Review tool that launched on March 9th. Having AI review AI-generated code before human eyes touch it. The AI catches a lot of the verbose/redundant code issues, which reduces human review burden.

Early results: ~30% reduction in human review time for AI-heavy PRs. Not a silver bullet, but meaningful.

AI Code Standards:
Created explicit guidelines for when to use AI vs when to hand-craft code:

  • Boilerplate, repetitive patterns → AI encouraged
  • Core business logic, complex algorithms → AI as assistant only
  • Security-critical code → human-written, AI review only

Revised Engineering Metrics:
Stopped tracking “PRs merged per sprint” and started tracking “value delivered per sprint” (measured by feature completion, customer adoption, and reliability).

This shifted the conversation from activity to outcomes.

Automated Testing Investment:
Massively increased investment in automated testing infrastructure. If we can catch issues before human review, the review process can focus on architecture, design, and business logic instead of correctness.

The Cultural Shift Required

Here’s the uncomfortable part: Teams need explicit permission to say AI isn’t the right tool for a task.

Right now, there’s implicit pressure to use AI everywhere because leadership invested in it and expects ROI. That creates situations where developers use AI even when hand-writing code would be faster and cleaner.

We had to tell teams: “It’s okay to not use AI for everything. We’re measuring outcomes, not AI adoption rates.”

The Long-Term Warning

@eng_director_luis asked about long-term quality impact, and I have thoughts.

We’re about 12 months into heavy AI usage, and I’m seeing patterns that worry me:

Invisible Technical Debt:
AI-generated code often looks good on the surface but accumulates subtle issues:

  • Inconsistent patterns across the codebase
  • Over-engineered solutions (AI likes to be thorough)
  • Poor naming that technically works but doesn’t reflect domain knowledge
  • Edge case handling that’s technically correct but not actually needed

None of this breaks the code. But it makes the codebase harder to maintain, harder for new engineers to understand, and more brittle over time.

Teams that aren’t adapting their quality gates are going to wake up in 18-24 months with a codebase that “works” but is incredibly hard to change.

The Call to Action

I genuinely think this is a collective industry challenge that requires shared learning.

We need to be talking about:

  • What workflows actually work with AI-assisted development?
  • What metrics measure real productivity vs activity?
  • How to maintain quality when code generation outpaces review capacity?
  • What skills engineers need in an AI-augmented workflow?

This isn’t a problem any one company can solve in isolation. The AI tool vendors are optimizing for individual productivity, but we need to solve for team effectiveness.

My Question to All of You

What metrics are you using to measure actual productivity (not just activity)?

Because I think until we align on better success metrics, we’re going to keep celebrating increased code output while missing the fact that delivery velocity stayed flat (or got worse).

The old metrics don’t work anymore. What should we measure instead?


P.S. Maya, to your original question: No, I don’t think this is the “new normal.” I think this is a transition period where workflows haven’t caught up to tools. But it requires intentional change management, not just hoping teams figure it out on their own.