We're Shipping More Code Than Ever—So Why Are Our Releases Slowing Down?

Last quarter, our engineering team merged 98% more pull requests than the previous year. When our CEO asked during our board meeting when customers would actually see the impact of all this “productivity,” I had to explain something that didn’t make sense on the surface: despite shipping more code than ever, our release velocity to production had actually decreased.

This is the paradox of 2026 that nobody warned us about.

The Numbers Don’t Lie (But They Do Mislead)

According to CircleCI’s 2026 State of Software Delivery report, AI-assisted development drove a 59% increase in average engineering throughput. Sounds incredible, right? But here’s what the aggregate numbers hide: feature branch throughput increased 15%, while main branch throughput—where code actually gets promoted to production—fell by 7%.

We’re generating more code, merging more PRs, and closing more tickets. But we’re releasing less frequently to customers.

The Supervision Paradox

Here’s what I’ve learned the hard way: the faster AI generates code, the more human attention is required to ensure that code actually works in the context of a real system with real users and real business constraints.

The production bottleneck didn’t disappear—it moved from writing to understanding. And understanding is much harder to speed up.

At my company, PR review time increased 91% after widespread AI adoption. Our senior engineers now spend 60-70% of their time reviewing code instead of writing it. We can’t just rubber-stamp AI-generated code—we’ve found subtle bugs that would have been production incidents, edge cases that weren’t considered, architectural decisions that didn’t align with our patterns.

We’re Measuring the Wrong Things

This is a classic case of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

We’re tracking:

  • Commit counts (vanity metric)
  • PRs merged (activity, not outcome)
  • Lines of code (actively harmful)
  • Velocity points (disconnected from customer value)

We’re not tracking:

  • Time from feature kickoff to customer hands
  • Deployment frequency paired with failure rates
  • Customer-impacting releases
  • Actual business outcomes

Commit counts don’t tell you actual value delivered. They’re easy to game and completely disconnected from what customers care about.

The Validation Bottleneck Is Real

The data from Harness’s 2026 report is sobering: 47% of frequent AI tool users report that manual work—QA, remediation, validation—has become more problematic, not less. 69% experience deployment problems when AI-generated code is involved.

Our deployment systems were designed for a world where writing code was the bottleneck. They weren’t designed for a world where we have 3x the code volume flowing through our pipelines.

Our CI/CD infrastructure can handle maybe 10 deployments per day. AI wants to push 50. Our test suites take 45 minutes to run. Our deployment approval processes are still manual. Our observability tools weren’t scaled for this volume of changes.

The Human Cost

Here’s the statistic that keeps me up at night: 96% of very frequent AI coding users report being required to work evenings or weekends multiple times per month due to release-related work. Compare that to 66% of occasional users.

We’ve traded writing code during business hours for reviewing code, fixing broken deployments, and firefighting production issues on weekends. Developers report spending 36% of their time on repetitive manual tasks—chasing tickets, rerunning failed jobs, copy-paste configuration, human approvals.

This isn’t sustainable. We’re burning people out.

The Path Forward

The teams that are succeeding aren’t just measuring commits—they’re investing in the entire delivery lifecycle with the same urgency they invested in AI tools:

  • Modernizing CI/CD infrastructure to match the new velocity
  • Creating dedicated release engineering teams to own the deployment pipeline
  • Implementing comprehensive automated testing at every layer
  • Building observability into everything to catch issues earlier
  • Establishing clear SLAs on review time so PRs don’t pile up indefinitely
  • Shifting metrics from activity to outcomes—cycle time, deployment frequency, customer value delivered

Top performers treat validation as a first-class engineering investment, not an afterthought.

Questions for the Community

I’m curious about your experiences:

  1. Has anyone successfully modernized their delivery pipeline to match AI coding velocity? What did you invest in? What worked and what didn’t?

  2. What metrics are you actually tracking to measure customer value delivery instead of just engineering activity?

  3. How are you balancing the speed benefits of AI with the increased supervision and validation it requires? Have you changed your team structure or processes?

  4. What are you doing to prevent developer burnout when the new bottleneck is human review and validation?

The conversation has to shift from “whether to adopt AI” to “how to build systems that can actually deliver the value AI makes possible.” We’re generating more code than ever, but if it’s not reaching customers, we’re just creating expensive inventory.

What are you seeing in your organizations?

Michelle, this resonates deeply with what I’m seeing from the product side.

Our engineering leadership keeps showing green dashboards in planning meetings—more commits, more PRs merged, higher velocity scores. But when I look at our product roadmap, we’re actually shipping features to customers slower than we were a year ago.

The Product-Engineering Metrics Disconnect

Here’s the uncomfortable truth: customers don’t care about commit counts. They care about features shipping. They care about bugs getting fixed. They care about the product getting better in ways they can actually see and use.

But somehow we’ve created this bizarre parallel universe where engineering metrics are green and product metrics are red:

  • Engineering dashboard: 59% increase in throughput, 98% more PRs merged
  • Product dashboard: 23% longer cycle time from feature kickoff to GA, 15% decline in features shipped per quarter
  • Business dashboard: No improvement in revenue velocity, flat customer retention, NPS actually declined

The disconnect is real and it’s getting harder to explain to the board.

The “Done” Redefinition Problem

We’ve had this recurring pattern in sprint reviews where engineering says a feature is “done” (code merged, tests passing), but it’s not actually deployed to production. Or it’s deployed behind a feature flag but not enabled. Or it’s enabled but not announced. Or it’s announced but customers can’t actually use it because documentation isn’t ready.

So when is it actually “done”? When code merges? Or when customers get value?

AI has made the first definition of “done” much faster. But it hasn’t touched the second definition at all. In fact, it’s made it worse because we now have this massive inventory of “done” code sitting in a deployment queue.

What I’m Proposing We Track Instead

I’ve been pushing for a shift in our shared metrics between product and engineering:

Stop measuring:

  • Commit velocity
  • PR merge rate
  • Story points completed
  • Lines of code

Start measuring:

  • Cycle time from feature kickoff to customer hands (not just to merge)
  • Time-to-value (how long until customers can actually use the feature)
  • Feature adoption rate (are customers even using what we shipped?)
  • Customer satisfaction with release quality (bugs, polish, completeness)
  • Revenue or retention impact of shipped features

The goal is to align engineering and product around customer outcomes, not internal activity.

The Real Question

Here’s what I keep coming back to: if AI is making us so much more “productive,” why aren’t our customers seeing the benefit? Why isn’t our revenue growing faster? Why aren’t we capturing more market share?

Something is fundamentally broken in how we’re thinking about this.

Michelle, you asked: “How do we get engineering and product to agree on shared outcome metrics instead of activity metrics?”

I think the answer starts with acknowledging that most of our current metrics were designed for a different era—when writing code was the bottleneck. Now that AI has eliminated that bottleneck, we need metrics that reflect the new reality: validation, deployment, and adoption are the constraints.

Would love to hear how other product leaders are navigating this. Are we the only ones experiencing this productivity paradox where engineering is “faster” but products are slower?

This hits home, Michelle. My teams are exhausted—not from writing code, but from reviewing it.

I’m going to be blunt about what’s happening on the ground, because I think engineering leadership needs to stop celebrating the productivity numbers and start looking at the human and operational reality.

The Review Bottleneck Is Crushing My Senior Engineers

My senior engineers are spending 60-70% of their time on code review now. Not 30%. Not 50%. Seventy percent.

And here’s the thing—we can’t just rubber-stamp it. We’ve found:

  • Subtle race conditions that would have been production incidents
  • Security vulnerabilities that our automated scanners missed
  • API design choices that violated our architectural principles
  • Edge cases that weren’t considered (null handling, timezone issues, unicode problems)
  • Performance issues that only appear at scale
  • Code that technically “works” but doesn’t match our team’s patterns and will be unmaintainable

AI is very good at writing code that compiles and passes basic tests. It’s not good at understanding your specific system’s constraints, your team’s architectural decisions, or your organization’s non-functional requirements.

Infrastructure Built for a Different World

David mentioned the deployment queue problem—let me tell you what that looks like operationally.

Our CI/CD pipelines were designed when we deployed 8-10 times per day. Now AI wants to push 40-50 changes per day and our infrastructure is melting:

  • Test suites take 45 minutes to run (and that’s after optimization)
  • We’re hitting rate limits on our cloud provider
  • Our deployment approval workflow still requires manual sign-off from a tech lead
  • Our observability tools can’t keep up with the volume of changes
  • Our rollback procedures take 20+ minutes because we never optimized them

Every time a deployment fails (and 69% of teams report frequent deployment problems with AI-generated code), it blocks everything behind it. We end up with this massive queue of “ready to deploy” code that can’t actually get to production.

What We’re Doing About It (And What’s Working)

I’ll share what we’ve implemented in the last three months, because sitting around complaining wasn’t going to fix it:

1. Invested in Test Infrastructure

  • Parallelized our test suite (cut time from 45min to 12min)
  • Added smoke tests that run in 2 minutes for fast feedback
  • Implemented visual regression testing for UI changes
  • Created “AI code quality” tests for common patterns we’ve seen fail

2. Created “AI Code Review Checklist”

  • Documented the top 20 issues we find in AI-generated code
  • Automated checks for about half of them
  • Training for junior engineers on what to watch for
  • Significantly reduced review time for senior engineers

3. Feature Flags Everywhere

  • Decoupled deployment from release
  • Can deploy code without exposing it to users
  • Gradual rollouts (1% → 10% → 50% → 100%)
  • Easy rollback without redeployment

4. Dedicated Release Engineering Team

  • Took deployment off individual teams’ plates
  • Owns CI/CD pipeline, observability, deployment process
  • Can optimize the delivery system independent of feature development

5. SLAs on Review Time

  • PRs can’t sit for more than 24 hours without feedback
  • Prevents the pile-up problem
  • Forces us to balance code generation velocity with review capacity

The Hard Truth Michelle Touched On

We need to invest in delivery infrastructure with the same urgency we invested in AI tools.

Most companies spent $50-100 per developer per month on AI coding assistants. Great. Did they also invest in:

  • Test infrastructure that can handle 3x the volume?
  • Observability that can track 3x the changes?
  • CI/CD systems that can deploy 3x as often?
  • Training for engineers on how to review AI-generated code?
  • Release engineering capacity to manage the increased throughput?

If the answer is no, you’re going to hit exactly the bottleneck Michelle described.

To Product Leaders Like David

You asked how to align product and engineering on outcome metrics. From the engineering side, here’s what I need:

  • Stop measuring story points and commit velocity
  • Start measuring cycle time from commit to production
  • Track deployment frequency AND deployment failure rate (you need both)
  • Measure MTTR (mean time to recovery) when things break
  • Look at change failure rate

If my deployment frequency is high but my change failure rate is also high, that’s a problem. If my cycle time is fast but my MTTR is slow, that’s a problem. The metrics need to be in tension with each other to show the trade-offs.

And honestly? I’d rather deploy less frequently with higher quality than deploy constantly and spend my weekends firefighting.

The Question I’m Wrestling With

Michelle asked: “How are you balancing the speed benefits of AI with the increased supervision and validation it requires?”

I’m not sure we are balancing it. I think we’re still in the phase where we’re chasing velocity without paying the supervision cost, and it’s showing up as weekend incidents, engineer burnout, and delayed releases.

The teams that are succeeding are the ones that accepted that AI didn’t eliminate the bottleneck—it moved it. And they restructured around the new bottleneck instead of pretending it doesn’t exist.

Anyone else dealing with the review capacity problem? How are you scaling code review to match AI code generation?

Michelle, that 96% statistic—that very frequent AI users are working evenings or weekends multiple times per month—is keeping me up at night too.

We’re not just talking about a productivity paradox. We’re talking about an impending retention crisis.

This Isn’t Sustainable

Let me paint the picture of what I’m seeing across my engineering organization:

Three months ago, we rolled out AI coding assistants to all 80+ engineers. Leadership celebrated. Productivity metrics went up. Everyone was excited.

Today? I’ve had five senior engineers come to me in the last month about burnout. Two of them are actively interviewing. One already put in notice.

The pattern is consistent:

  • They’re generating more code during work hours
  • They’re spending evenings and weekends reviewing that code
  • They’re getting paged more often because more code = more surface area for bugs
  • They’re firefighting production issues from AI-generated code that looked fine in review but broke in production
  • They feel responsible for catching issues but don’t have enough hours in the day

This is the hidden cost nobody talks about when they show the “59% productivity increase” slides.

Organizational Design for a World That No Longer Exists

Luis is right about the infrastructure lag, but there’s an organizational design problem underneath it.

Our entire engineering org structure was designed around the assumption that writing code is the bottleneck.

We hired for that world:

  • Lots of IC engineers to write code
  • Fewer senior engineers to review and architect
  • Even fewer SREs and release engineers to deploy
  • Minimal investment in test automation engineers

But in the AI era, the bottleneck has flipped:

  • Writing code is easy (AI does it)
  • Reviewing code is hard (requires senior judgment)
  • Deploying code is overwhelmed (infrastructure can’t keep up)
  • Validating code is manual (we never invested in comprehensive automated testing)

We have an inverted pyramid problem. We’re heavy on junior/mid engineers who are great at writing code (which AI now does), and light on senior engineers, SREs, and platform engineers who are critical for the new bottlenecks.

Culture and Expectations Mismatch

Here’s the cultural problem I’m wrestling with:

Leadership sees the productivity numbers and expects faster delivery. They don’t understand why “98% more PRs merged” isn’t translating to “98% more features shipped.”

So they push harder. More urgent deadlines. More pressure to ship. More “why is this taking so long?”

Meanwhile, engineers know they can’t just ship AI-generated code without thorough review. They know the deployment pipeline is fragile. They know the observability isn’t good enough to catch issues early.

So they do what responsible engineers do: they stay late. They work weekends. They review code thoroughly. They manually test before deploying. They monitor deployments closely.

And they burn out.

What I’m Implementing (And What’s Hard)

I’m trying to restructure around the new reality, but it’s harder than it sounds:

1. Created a Dedicated Release Engineering Team

  • Took 4 senior engineers off feature work (leadership was NOT happy)
  • They own CI/CD, observability, deployment automation, incident response
  • Goal: make deployment a service that feature teams consume, not a problem they each solve

2. Implemented SLAs on Review Time

  • No PR sits for more than 24 hours without initial review
  • No PR sits for more than 48 hours without merge or explicit “needs work”
  • Forces us to staff for review capacity, not just code generation capacity

3. “No Deploy Fridays” Policy

  • Protects weekends from deployment incidents
  • You can merge code on Fridays, but no production deploys after 2pm
  • Controversial with product, but retention matters more than velocity

4. Shifted Hiring Strategy

  • Actively hiring for: SREs, test automation engineers, release engineers
  • Less emphasis on junior IC engineers (AI is filling that gap)
  • Prioritizing senior engineers who can review AI-generated code effectively

5. Investment in Observability

  • If we’re going to deploy more frequently, we need to catch issues faster
  • Comprehensive monitoring, tracing, alerting
  • Automated canary deployments with automatic rollback
  • Cost: $15K/month. Worth every penny.

The Retention Risk Nobody’s Talking About

Here’s what I told our CEO last week:

“We can optimize for velocity or we can optimize for retention. With the current approach, we’re going to lose our best engineers within six months. They’ll go to companies that haven’t created this AI-induced burnout machine.”

The engineers who are most valuable—the ones with the judgment to catch subtle issues in AI-generated code, the architectural knowledge to know when AI’s solution doesn’t fit our system, the operational experience to prevent production incidents—those are exactly the engineers who are burning out fastest.

Because they feel responsible. Because they’re the ones getting paged at 2am when the AI-generated code breaks in production. Because they’re the ones who see the issues everyone else misses.

What I’m Asking Leadership For

I need three things from exec leadership, and I’m not getting enough support:

1. Permission to slow down

  • Stop measuring commit velocity
  • Start measuring sustainable delivery velocity
  • Accept that “98% more PRs” doesn’t mean “98% faster shipping”

2. Budget to invest in the delivery system

  • Test infrastructure
  • CI/CD modernization
  • Observability and monitoring
  • Release engineering headcount

3. Culture change around “productivity”

  • Productivity is not code volume
  • Productivity is customer value delivered sustainably
  • If engineers are working weekends regularly, we’re not more productive—we’re just burning people out faster

The Question for Other VPs

Michelle asked: “What are you doing to prevent developer burnout when the new bottleneck is human review and validation?”

Honestly? I’m fighting an uphill battle. The AI productivity narrative is so strong that it’s hard to get leadership to see the human cost.

David mentioned the business metrics haven’t improved despite engineering “productivity” gains. That’s my leverage point. I’m showing the board:

  • Engineering metrics: up
  • Product delivery: flat or down
  • Employee satisfaction: down
  • Voluntary attrition: up (especially among senior engineers)

Something has to change. We can’t sustain this.

How are other engineering leaders handling the retention risk? What’s working to protect your teams from AI-induced burnout?

As someone who sits between engineering and users (and has lived through a failed startup), this paradox is painfully visible from the design side.

The View from Design: More Code ≠ Better Product

Here’s what I’m seeing in my day-to-day:

Engineering: “Feature X is done! Shipped to main branch.”
Design: “Great! When can we tell users about it?”
Engineering: “Well… it’s behind a feature flag. And we need to monitor it for a week. And documentation isn’t ready. And we found a bug in edge case Y. So maybe two weeks?”

This conversation happens constantly now. Engineering keeps marking things as “done” that aren’t actually usable by customers.

More Isn’t Better—It’s Just More

I’m going to say something that might be controversial: maybe the slowdown in releases is actually necessary.

Here’s what I witnessed during our AI coding assistant rollout:

Before AI:

  • 3-4 features per quarter
  • Each feature was thoroughly designed, tested, polished
  • User satisfaction: high
  • Bug reports: manageable

After AI:

  • Engineering “ships” 8-10 features per quarter
  • But only 2-3 actually reach customers in usable form
  • The rest are half-baked, buggy, or don’t match designs
  • User satisfaction: declining
  • Bug reports: overwhelming support team

More code doesn’t mean better product. It often means worse product because quality got sacrificed for velocity.

The Design-Engineering Disconnect

Luis mentioned finding issues in AI-generated code during review. I’m finding issues after deployment that nobody caught:

  • Accessibility violations: AI-generated UI code that “works” but is unusable for screen readers
  • Inconsistent patterns: New components that don’t match our design system because AI didn’t know about it
  • Missing states: Error states, loading states, empty states—AI focuses on the happy path
  • Poor mobile experience: Code works on desktop but breaks on mobile
  • Internationalization gaps: Hardcoded strings, date formats, etc.

These aren’t bugs in the traditional sense—the code “works.” But the user experience is broken.

And you know what? Catching these issues requires human review from someone who understands users. AI can write code. It can’t understand whether that code creates a good user experience.

Quality vs Quantity

Michelle’s question about balancing speed with supervision resonates from a design perspective too.

In design systems work, I’ve learned that constraints breed quality. When you have to design within a limited set of components and patterns, you think more carefully about each decision.

AI removed the constraint of “writing code is hard.” And maybe that was a mistake? Or at least, maybe we needed to replace it with different constraints:

  • “Deploying code requires comprehensive testing”
  • “Features ship only when they meet quality bars”
  • “User experience validation is non-negotiable”

Right now, we’re in this Wild West phase where code generation is easy but we haven’t established new quality gates to replace the old constraint.

What I’m Pushing For

From the design side, here’s what I’m advocating:

1. Measure User Satisfaction, Not Code Volume

  • NPS or CSAT for each major feature release
  • Support ticket volume and sentiment
  • Feature adoption rates (are users even using what we shipped?)
  • User testing results before marking features as “done”

2. Slow Down to Speed Up

  • Better to ship 3 polished features than 10 half-baked ones
  • Quality compounds—polished features reduce support burden, improve retention
  • Technical debt from rushed AI-generated code will slow us down later

3. Include Design in “Done” Definition

  • Code merged ≠ done
  • Passes automated tests ≠ done
  • Matches designs and meets UX quality bars = done
  • AI makes coding faster, but design validation still takes time

4. User-Centric Metrics for AI Code

  • Does it work on mobile? Tablet? Different browsers?
  • Is it accessible (WCAG AA minimum)?
  • Does it match our design system?
  • Does it handle error states gracefully?
  • Is it localized for international users?

The Startup Lesson I Learned the Hard Way

My startup failed partly because we optimized for shipping fast instead of shipping right. We celebrated velocity. We tracked features shipped per sprint. We moved fast and broke things.

And you know what happened? Users churned. Because fast shipping of mediocre features doesn’t create value.

The companies that win aren’t the ones that ship the most code. They’re the ones that ship the features users actually want, in a form users can actually use, with quality that builds trust.

To Keisha’s Point About Retention

You mentioned losing senior engineers to burnout. I’m seeing the same thing from the design side—our best designers are exhausted from constantly firefighting quality issues in rushed releases.

The designers who care deeply about craft, about user experience, about doing things right—they’re burning out trying to maintain quality standards in a system optimized for volume.

And the ones who stay? They’re learning to let things slide. To accept “good enough” instead of “good.” To stop fighting for quality because the pressure to ship is too intense.

That’s a culture problem, and it’s going to hurt us long-term.

My Answer to Michelle’s Question

“What metrics are you actually tracking to measure customer value delivery?”

From the design side, I want to see:

  • Customer satisfaction for each feature (post-launch surveys)
  • Feature adoption rates (usage data 30/60/90 days after launch)
  • Support ticket volume per feature (quality indicator)
  • Time from design complete to customer hands (not just code complete to merge)
  • Accessibility audit results (are we building for all users?)
  • Design system compliance (consistency matters)

And honestly? I’d rather see us ship fewer features with higher satisfaction than more features with lower satisfaction.

The “more code, fewer releases” paradox might actually be a blessing in disguise—it’s forcing us to confront the fact that code volume was never the right metric. Maybe the slowdown is the system trying to protect quality?

Would love to hear from other designers or product folks—are you seeing similar quality issues with AI-accelerated development?