59% More Engineering Throughput, Zero Velocity Gains — CircleCI's 2026 Data Proves the Bottleneck Was Never Code Generation

CircleCI just dropped their 2026 State of Software Delivery report, and the headline number is staggering: 59% more daily workflow runs year-over-year — the biggest throughput increase they have ever measured.

But here is the part nobody is putting on their slides: main branch throughput increased only 1%.

Feature branch activity grew almost 50% for top-performing teams. Code is being written faster than ever. AI agents are opening PRs at 2 AM. And yet the stuff that actually ships to customers is barely moving.

The Numbers That Should Worry Every Product Leader

  • Main branch success rate dropped to 70.8% — the lowest in five years, well below the 90% benchmark
  • Recovery time climbed to 72 minutes to get back to green, up 13% from last year
  • Feature branch throughput up 15.2% for the median team, but main branch throughput down 6.8%
  • Top 5% of teams saw 97% throughput growth — meaning 95% of teams are treading water

Waydev frames it well: developer tools companies market time-to-first-commit because it reflects well on AI productivity. Nobody markets MTTR because that is where productivity gains disappear.

The Bottleneck Shifted, and Most Orgs Have Not Noticed

We have been pouring resources into the code generation phase of the SDLC. AI coding assistants, agentic workflows, faster iteration loops. And it worked — developers are producing more changes than ever.

But the review pipelines, deployment gates, test infrastructure, and organizational approval processes were built for human-speed output. As Harness documented, teams using high-adoption multi-agent workflows report 98% more PRs merged but also 91% longer code review times and 154% larger PR sizes.

The code generation phase is no longer the constraint. The constraint is everything downstream of it.

What the Top 5% Do Differently

CircleCI’s data shows the top 5% are not just “better at AI.” They have invested in:

  1. Validation infrastructure that keeps pace with generation speed
  2. Automated quality gates that do not require human review for every change
  3. Main branch health as a first-class metric rather than an afterthought

The gap is not tooling — it is organizational readiness.

The Product Question

As a product leader, I keep coming back to this: we measure engineering output in PRs merged and features shipped. But if 59% more throughput translates to zero velocity gains, what are we actually measuring?

Are we optimizing for the appearance of productivity rather than the reality of delivery?

I would love to hear from folks managing engineering orgs: are you seeing this bottleneck shift play out in your teams? And if so, what are you doing about it?

This is hitting close to home. I manage 40+ engineers at a financial services company where we adopted AI coding tools aggressively in Q4 last year, and I have been staring at exactly this paradox for months.

Here is what I see on the ground: my teams are generating roughly 40% more PRs per sprint than they were a year ago. Our JIRA velocity charts look incredible. Leadership loves the trend line. But our deployment frequency has not changed, and our mean time to recovery actually got worse — from about 45 minutes to just over an hour.

The reason is exactly what David is describing. Our review infrastructure was built for the old volume. We still require two human approvals for any PR touching financial transaction code (regulatory requirement). We still run a full integration test suite that takes 38 minutes. We still have a change advisory board that meets twice a week to approve production deployments.

None of those processes scaled with the AI-generated throughput increase.

What We Are Trying

We started a “delivery pipeline modernization” initiative in January. Three things we are doing:

  1. Risk-tiered review: Not every PR needs the same scrutiny. We built a classifier that routes low-risk changes (test updates, config, documentation) through an automated approval path. High-risk changes (financial logic, auth, data models) still get two human reviewers. This alone cut average review wait time by about 35%.

  2. Parallel test execution: We invested in splitting our integration suite into independent test shards that run concurrently. Dropped the 38-minute suite to 12 minutes. Cost us about $4K/month more in CI compute, but the throughput improvement made it worth it immediately.

  3. Main branch health dashboard: We now have a real-time dashboard showing main branch success rate, and every standup starts with it. This was the cheapest change and probably the most impactful. Engineers started self-policing because nobody wants to be the one who broke the dashboard.

The CircleCI top-5% finding resonates. The differentiator is not the AI tooling — we all have roughly the same tools. It is whether your downstream systems can absorb the volume. In regulated industries like ours, that is an especially hard problem because you cannot just automate away compliance review.

David, your question about “what are we actually measuring” is the right one. I have been pushing my leadership to shift our primary metric from PRs merged to deployment frequency with quality — basically, how often are we shipping to production with a main branch success rate above 85%? It is harder to game than raw throughput numbers.

I want to add a perspective from the design systems side of the house, because I think this bottleneck is even wider than the CircleCI data captures.

We have been shipping design system component updates through the same PR pipeline as feature work. AI tools have made it trivially easy for engineers to scaffold new component variants — I am seeing PRs for new button styles, card layouts, and form patterns that would have taken a couple days now landing in hours. Output is way up.

But here is the thing: those PRs still need design review, and that review process has not changed at all. Every component update needs visual regression checks, accessibility audit, cross-browser verification, and design system consistency review. That is still a human doing it with a Figma file open in one tab and a staging environment in the other.

The result? Our design review queue went from an average of 3 days to 8 days this quarter. Engineers are generating component PRs faster than we can validate them. Some of those PRs sit so long that by the time we review them, the design spec has changed and the work needs to be redone.

The “Review Fatigue” Problem Nobody Talks About

The 91% longer review times stat from Harness lands differently when you think about the humans doing the reviewing. My team is not getting 91% more capacity. They are the same four people they were last year, now looking at nearly twice the volume.

I have seen quality of reviews degrade noticeably. Reviewers are approving things faster because the queue pressure is constant. We caught two accessibility regressions in production last month that should have been caught in review — they were in PRs that got rushed through because the backlog was growing.

This is the part that worries me most: AI scales the generation side linearly while the review side stays fixed. And when you pressure a fixed-capacity system with increasing volume, quality drops. That is just physics.

What I Think Is Missing From This Conversation

Everyone is talking about automating review — and yes, automated visual regression testing and AI-assisted code review can help. But there is a category of review that is fundamentally about judgment: Does this component fit the design language? Is the interaction pattern intuitive? Does the error state actually help the user?

Those are not automatable yet, and I am skeptical they will be anytime soon.

Luis, your risk-tiered review approach is interesting — I wonder if we could apply something similar on the design side. Not every component change needs a full design review. Maybe config changes and minor style tweaks go through automated checks while new patterns get the full human review.

The broader question for me is whether the industry is going to invest in scaling the review infrastructure, or whether we are just going to let review quality degrade until we start shipping worse products faster.

I have been sitting with this CircleCI data for a few days, and I think the framing of “bottleneck shift” understates what is actually happening. This is not a pipeline problem. This is an organizational design problem.

Let me explain what I mean.

The Top 5% Are Not Doing One Thing Differently — They Are Organized Differently

When I look at the teams in my org that match CircleCI’s top 5% profile — high throughput AND high main branch success rate — they share a structural characteristic that has nothing to do with their CI/CD tooling: they own their entire delivery path end to end.

These are teams where the same group that writes the code also owns the tests, the deployment pipeline, the monitoring, and the on-call rotation. They do not hand off PRs to a separate review board. They do not wait for a change advisory committee. They have the authority and the infrastructure to validate and ship autonomously.

The teams that are stuck in the “59% more throughput, zero velocity gains” trap? They are almost always teams where generation and validation are organizationally separated. One group writes, another reviews, a third approves, a fourth deploys. AI accelerated step one and left steps two through four unchanged.

Why This Is a Leadership Problem, Not a Tools Problem

Maya’s point about review fatigue is crucial. When you have a fixed-capacity review function absorbing AI-scaled generation volume, you get exactly what she described: quality degradation, rushed approvals, and regressions in production.

But the answer is not “hire more reviewers” or even “automate more reviews.” The answer is to ask why generation and validation are separate functions in the first place.

In most organizations, that separation exists because of historical trust boundaries. We separated code writing from code review because we did not trust individual contributors to ship safely on their own. We created change advisory boards because we did not trust teams to assess deployment risk autonomously.

Those trust boundaries made sense when deployment was expensive and rollback was hard. In 2026, with feature flags, canary deployments, and instant rollback capability, the cost of shipping has dropped dramatically but the organizational permissions have not updated.

What I Am Doing About It

Three changes I have been driving at the org level:

  1. Collapsing the generation-validation gap: Embedding automated quality gates directly into the development workflow rather than having them as a separate phase. If your CI catches 80% of issues before a human reviewer ever sees the PR, the review becomes a judgment call rather than a line-by-line inspection. Luis’s risk-tiered approach is heading in this direction.

  2. Team-level deployment authority: Teams that maintain a main branch success rate above 85% and MTTR under 30 minutes earn the right to deploy without external approval. Teams below those thresholds get additional review gates. This creates a natural incentive to invest in validation infrastructure.

  3. Measuring delivery, not output: We stopped reporting PRs merged or lines of code at the executive level. The metrics that matter are deployment frequency, change failure rate, and time to restore service. If AI generates 59% more throughput and those numbers do not move, the throughput number is noise.

The uncomfortable truth is that most organizations bought AI coding tools hoping for a productivity free lunch. The CircleCI data is showing that the lunch is not free — the cost is organizational transformation that most companies have not started yet.