We're Writing 59% More Code with AI, But Shipping Less to Production. Where's the Bottleneck?

We’re Writing 59% More Code with AI, But Shipping Less to Production. Where’s the Bottleneck?

I need to share something that’s been bothering me for the past few months. My team at our financial services company has fully embraced AI coding assistants—GitHub Copilot, Cursor, you name it. Our developers genuinely feel more productive. They’re writing code faster, trying more experiments, and feeling less stuck on boilerplate.

But here’s the thing: Our deployment frequency hasn’t improved. In fact, it’s gotten slightly worse.

I started digging into our metrics, and then CircleCI’s 2026 State of Software Delivery report landed on my desk. The numbers stopped me cold:

The 70.8% Paradox

Across 28 million CI workflows, CircleCI found that average engineering throughput increased 59% year over year—the biggest jump they’ve ever measured. That’s enormous. AI-assisted development is clearly accelerating code generation.

But here’s the paradox: Main branch success rates dropped to 70.8%, the lowest in over five years. The industry benchmark is 90%. That means nearly 3 out of 10 attempts to merge into production are failing.

Even more telling: Feature branch throughput is up 15.2%, but main branch throughput is down 6.8%. We’re creating more on branches but shipping less to production.

The Review Bottleneck Theory

I think code review has become the binding constraint. Here’s what I’m seeing with my team of 40+ engineers:

  1. Same review capacity: We still have roughly the same number of senior engineers doing reviews as we did two years ago
  2. 59% more code to review: But now there’s dramatically more code flowing through the pipeline
  3. Longer wait times: Developers are context-switching while waiting for review, which kills productivity
  4. Recovery time up 13%: When things do fail, it takes longer to fix (average 72 minutes)

Three Hypotheses

I see three possible explanations:

  1. We need AI-assisted code review - Maybe the solution is AI reviewing AI-generated code?
  2. Quality gates are working - Maybe 70.8% is actually good—the review process is correctly filtering out low-quality AI output
  3. Process mismatch - Our review process was designed for human-speed code generation, not AI-speed

The Financial Services Context

In our world, we can’t just “move fast and break things.” Compliance, security, and regulatory requirements are non-negotiable. But we’re also under pressure to deliver features faster to compete with fintechs.

The painful irony: AI promised to accelerate development, but it’s created a different bottleneck downstream.

Questions for the Community

  • Are you seeing similar patterns in your organizations?
  • Is code review the bottleneck, or have you identified something else?
  • How are you scaling review capacity to match increased code generation?
  • Should we even be aiming for 90% merge success in the AI era, or is 70% the new normal?

I’m genuinely curious whether this is a process problem, a tooling problem, or maybe a sign that our quality gates are actually working as intended by catching problematic AI-generated code.

What’s your take?

Sources:

This post hit home, but from a totally different angle. When I was running my design system at our startup, we had a similar problem—except it was engineers shipping UI components without design review.

They’d move fast, create technically perfect components, but the UX would be a mess. Users couldn’t find things. Forms were confusing. Everything technically worked but the experience was broken.

Maybe 70.8% is better than 90% with hidden problems?

Here’s my question: Are we measuring the right thing?

Luis, you’re focused on merge success rate. But what if that 29.2% failure rate is actually catching code that shouldn’t ship? What if we were back at 90% success but half of that was shipping technical debt bombs, security vulnerabilities, or features that look good in isolation but break user experience?

I learned this the hard way at my failed startup—we optimized for velocity and shipped features nobody wanted. We had high “success” rates but low actual value.

The metrics we’re not tracking

What about:

  • Bug reports filed in the first week after deployment
  • Post-deployment hotfixes required
  • Customer-facing incidents
  • Code that has to be refactored within 3 months

Your CircleCI data shows recovery time is up 13% (72 minutes). That suggests we’re already shipping too much broken code. Maybe the review process isn’t filtering enough?

A different frame: Rejection as quality signal

In my design system work, we reject about 30% of component PRs. Not because they’re bad code, but because they don’t meet our standards—they’re not accessible enough, they don’t follow patterns, or they solve problems we’ve already solved better elsewhere.

High rejection rate isn’t a bug, it’s a feature. It means our quality gates are working.

What if instead of asking “how do we get back to 90% merge success,” we asked “what’s the right rejection rate for code quality in the AI era?”

Maybe celebrating higher rejection rates during review is actually the signal of good quality control, not a bottleneck problem.

That said…

I totally get your pain point about wait times. When designers have to wait 3 days for engineering review, context switching kills us. So there’s definitely something to optimize. But I don’t think the answer is “approve more code faster.”

The answer might be “get better at quickly identifying what deserves deep review vs what’s low risk.”

Curious what others think—are we solving the wrong problem by optimizing for merge success?

Luis, this resonates deeply. We saw the exact same pattern at our EdTech startup when we scaled from 25 to 80 engineers over the past year.

This is a systems problem, not a people problem.

Your team isn’t working wrong. Your developers aren’t lazy. Your reviewers aren’t slow. The system itself wasn’t designed for this velocity.

Let me share what we experienced and tried:

The Junior-Senior Dynamic

The most interesting pattern we noticed: Our junior engineers got dramatically faster with AI assistance—they could tackle features that would have required senior help before. Onboarding time dropped by weeks.

But here’s the catch: They still need senior review before shipping. So AI effectively concentrated the bottleneck at the review stage rather than distributing it across development.

We accelerated one part of the pipeline without scaling the downstream capacity. Classic systems thinking problem.

Three Experiments We Ran

  1. Junior-junior peer review as first pass: Two junior engineers review each other’s AI-assisted code before it goes to a senior. Catches obvious issues, reduces senior review time by about 20%. But seniors still need final sign-off.

  2. Automated quality gates with AI scanners: We implemented CodeRabbit and similar tools. They catch about 40% of issues—syntax, simple security patterns, style violations. But they miss architectural concerns, business logic errors, and subtle security issues. Human oversight still essential.

  3. Risk-based review tracks: We categorize changes as low/medium/high risk. Low-risk changes (documentation, UI copy, config) get automated gates + junior review. High-risk (auth, payments, data model) get senior review + security review. This helped the most.

The Uncomfortable Truth

Even with all these optimizations, our merge success rate is around 73%—not that different from the CircleCI benchmark you cited.

And honestly? I think that’s good.

Here’s why: The review bottleneck is preventing a technical debt explosion.

Think about it—if AI helps developers generate 59% more code, but we were shipping 90% of it with the same review depth as before, we’d be accumulating technical debt at an accelerating rate.

The review process slowing things down is a feature, not a bug. It’s the system’s way of saying “we’re generating code faster than we can thoughtfully integrate it.”

The Real Question

Maybe instead of asking “how do we get back to 90% merge success,” we should ask “what’s the right balance between velocity and quality in the AI era?”

I don’t think 90% was ever a magic number—it was a benchmark from a different technological context. When humans wrote all the code at human speed, 90% merge success indicated a healthy, efficient process.

But in an AI-augmented world where code generation dramatically outpaces our ability to thoughtfully review and integrate, maybe 70-75% is the new healthy baseline.

What would concern me more:

  • If our post-deployment incident rate was climbing
  • If technical debt was accumulating faster than we could pay it down
  • If developer satisfaction was dropping due to frustration

Those are better indicators of system health than merge success rate alone.

What are you seeing on those dimensions at your financial services company?

This discussion mirrors something we experienced during our cloud migration last year—we dramatically accelerated infrastructure provisioning with automation, which immediately revealed bottlenecks elsewhere in the deployment pipeline.

Luis, I think you’re onto something important, but I want to add a layer of complexity based on what GitClear has been warning about.

The Quality Problem We’re Not Talking About

GitClear’s analysis shows that AI-assisted code has 1.7× more issues and security findings when there’s no governance framework in place. This isn’t just about volume—it’s about the nature of AI-generated code.

Best of 2025: AI in Software Development: Productivity at the Cost of Code Quality?

The research also shows that 46% of developers actively distrust AI output accuracy. Think about that—nearly half of developers don’t fully trust the code they’re shipping with AI assistance.

That’s your 70.8% signal right there.

AI-Specific Review Challenges

Traditional code review was designed for human-written code with human patterns. But AI introduces new failure modes:

  1. Hallucinated APIs: AI confidently generates calls to methods that don’t exist or are deprecated
  2. Outdated patterns: Training data includes old practices—security vulnerabilities, performance anti-patterns
  3. Subtle logic errors: Code that passes tests but fails edge cases humans would catch
  4. “Looks right but isn’t”: Syntactically perfect but architecturally wrong

These aren’t things automated tests easily catch. They require human judgment from someone who understands both the codebase and common AI failure patterns.

Reframing the Benchmark

I’d challenge the 90% benchmark altogether. That number came from an era when:

  • Humans wrote all code at human speed
  • Developers personally understood every line they committed
  • Code review was catching human errors, not AI hallucinations

In 2026, 41% of commits are AI-assisted according to the statistics. By late 2026, some estimates suggest it’ll cross 50%.

At that point, “developer” means something fundamentally different. The role shifts from “writes code” to “architects solutions and validates AI output.”

If half your code is AI-generated and requires different review patterns, maybe 70% is the right success rate. Maybe attempting to push that higher risks shipping more problematic code.

The Real Question

Should we be aiming for 90% merge success, or 70% with higher quality?

In our SaaS business, a production incident costs us—customer trust, revenue, engineering time. I’d rather have a 70% merge rate with strong quality than 90% with hidden time bombs.

The uncomfortable truth: Maybe AI hasn’t actually made us more productive yet. Maybe it’s made us generate more code that requires more scrutiny, which nets out to the same or slower overall delivery.

The productivity gains might come later—when we develop better review processes, better AI governance, and better quality gates specifically designed for AI-assisted development.

What I’d measure instead:

  • Post-deployment defect rates comparing AI-assisted vs human-written code
  • Time from commit to production for successfully merged changes
  • Developer confidence in the code they’re shipping
  • Accumulation rate of technical debt

Thoughts? Are we measuring velocity when we should be measuring sustainable delivery?

Coming from the product side, this whole discussion feels like a capacity planning failure dressed up as a code review problem.

Don’t get me wrong—I hear the quality concerns Michelle and Maya are raising. But Luis, let me ask a blunt question:

If your code generation throughput increased 59%, why didn’t you increase review capacity?

This is like scaling customer support. If your user base grows 10×, you can’t expect the same 3 support agents to handle the volume and maintain quality. You either:

  1. Hire more support agents
  2. Implement better tooling (chatbots, knowledge bases, routing)
  3. Accept degraded response times
  4. Accept degraded quality

Engineering isn’t different.

The Business Case Math

Let’s think about this from a resource allocation perspective:

Cost of the bottleneck:

  • Features delayed reaching customers = delayed revenue
  • Engineers context-switching while waiting for review = productivity loss
  • Developer frustration and potential turnover = hiring/retention costs

Cost of fixing the bottleneck:

  • Hiring senior engineers who can review = ~$200K per head
  • AI review tools and automation = ~$50K-100K per year
  • Process improvements and training = time investment but minimal cash

Cost of shipping broken code:

  • Production incidents = customer trust loss, potentially millions in revenue
  • Technical debt = compounding drag on future velocity
  • Security vulnerabilities = potentially catastrophic

Here’s what I’m not seeing: Quantification of these trade-offs.

Luis, do you know:

  • What’s the average wait time for code review?
  • How many revenue-generating features are in the review queue right now?
  • What’s the dollar impact of delayed features?
  • What percentage of that 29.2% failure rate is legitimate quality issues vs flaky tests or process problems?

Because here’s my hypothesis:

If throughput really increased 59%, that’s a massive business opportunity—IF you can unlock it. The question is whether the constraint is:

  • Capacity: Not enough reviewers
  • Process: Inefficient review workflow
  • Quality: AI is generating mostly bad code that rightfully fails

These require different solutions.

If it’s capacity: Hire or train more reviewers. Make the business case based on opportunity cost.

If it’s process: Implement risk-based review routing (like Keisha mentioned), better automation, async review tools.

If it’s quality: Maybe the AI tools need better prompt engineering, or developers need training on using AI effectively.

The Meta Question

Should review capacity be planned like sprint capacity?

If a team plans for 40 story points per sprint, and suddenly AI lets them attempt 60, do you:

  1. Keep sprint at 40 and use AI to reduce team size?
  2. Scale review capacity to handle 60?
  3. Accept that effective capacity is still 40, so throughput doesn’t actually increase?

This is a classic scaling decision, and it has a right answer based on your business model and competitive position.

In your financial services context, maybe the answer is “accept 40, bank the quality.” But at a fast-moving fintech competing with you, maybe it’s “scale to 60, beat you to market.”

Different contexts, different trade-offs.

What I’d want to see:

  • Time-to-review distribution (are there outliers?)
  • Review failure reasons categorized (quality vs process)
  • Business impact analysis of delayed features
  • Cost-benefit of interventions

Then you can make an informed decision rather than just accepting that “this is how it is now.”

Am I oversimplifying? Genuinely curious how engineering leaders think about this trade-off.