We're Shipping 60% More PRs But Velocity Stayed Flat: Did We Optimize the Wrong Metrics?

Three months ago, I walked into our Monday leadership meeting with what I thought was great news. “Engineering productivity is up 40%,” I announced. “We’re merging 60% more pull requests than last quarter.”

Our CEO leaned forward. “That’s fantastic. So when do customers get the new features we promised?”

I looked at my VP Product. He looked at me. Neither of us had a good answer.

The timeline hadn’t changed at all.

The Productivity Paradox We’re All Living

Here’s what the data actually shows across the industry right now:

  • Developers save an average of 3.6 hours per week using AI coding tools (DX’s analysis of 135,000+ developers)
  • Daily AI users merge approximately 60% more pull requests than developers not using AI
  • Yet companies consistently report flat delivery velocity despite these individual gains

At my EdTech startup, we’re living this paradox. Our engineers feel more productive—they’re writing code faster, churning through tickets, celebrating green checkmarks in Jira. But our features still take the same amount of time to reach production.

Where did the gains disappear to?

The Bottleneck Moved, We Just Didn’t Notice

Recent research from Agoda nailed it: “Coding was never the bottleneck.”

The constraint shifted upstream:

  • Specification: What should we actually build? (Still requires human judgment)
  • Verification: Is this correct and safe? (Still requires senior review)
  • Review time ballooned by ~91% in high-AI-adoption teams (Faros AI data)

In practical terms: My junior engineers write 5 PRs per day with AI assistance. My senior engineers now spend 80% of their time reviewing AI-generated code—hunting for subtle semantic bugs, architectural anti-patterns, and security vulnerabilities that AI introduces.

The seniors can’t write code anymore because they’re drowning in review queues.

We optimized one bottleneck and created another.

The Quality Tax We’re Not Talking About

Here’s the part that keeps me up at night:

  • AI-coauthored PRs show 1.7× more issues than human-only PRs
  • The DORA 2025 report found AI correlates with increased instability—teams merge faster but break production more often
  • We’re shipping more code, but are we shipping better products?

One of my leads put it this way: “AI makes it cheaper to build the wrong thing really fast.”

Gartner Says We’re Measuring the Wrong Things Entirely

The 2026 Gartner research is challenging everything I thought I knew about engineering effectiveness.

Platform teams struggle to communicate ROI because they’re speaking in the wrong language:

  • Traditional metrics: Deployments per day, PR count, velocity points
  • Business metrics that matter: Revenue enabled, costs avoided, profit contribution

And here’s the prediction that hit me hardest: Creativity and innovation—not velocity or deployment frequency—will define effectiveness in 2026.

Not “how fast did you ship?”

But “did you build something customers actually value?”

The Questions I’m Wrestling With

I’m re-evaluating everything about how we measure engineering effectiveness:

  1. Are deployment frequency and PR velocity now misleading vanity metrics? If we’re shipping more but delivering the same value, what are we really measuring?

  2. How do you measure creativity and innovation? These are the outcomes that matter, but they’re fuzzy and hard to quantify. What’s the operational metric for “we built the right thing”?

  3. Should we shift from “velocity” to “value delivered”? And if so, how do we define and measure value in a way that engineers can optimize for?

  4. Is the solution more AI (to review AI code) or better human processes? Are we about to get trapped in an AI-reviewing-AI infinite loop?

What I’m Trying Next

At my company, we’re experimenting with new metrics:

  • Customer problem resolution rate (not features shipped)
  • Time from insight to customer value (not time to deploy)
  • Innovation experiments validated (not story points completed)
  • Technical debt reduction (not just feature velocity)

It’s messy and imperfect, but at least we’re asking the right questions.

I’d love to hear from this community:

For engineering leaders: What metrics are you using to measure effectiveness beyond throughput?

For product folks: How are you helping engineering teams understand “value” vs. “output”?

For CTOs: How are you communicating technical productivity to the board when velocity metrics no longer tell the story?

Did we spend the last five years optimizing for the wrong things? And if so, what should we be optimizing for instead?


Keisha Johnson | VP Engineering @ EdTech Startup
Formerly Google, Slack | Building inclusive, high-performing teams

Keisha, this hits close to home. We’re experiencing the exact same pattern at our SaaS company, and it’s forcing me to have uncomfortable conversations with the board about what “engineering productivity” actually means.

We Measured the Wrong Things and Called It Success

Last quarter, our engineering team celebrated:

  • 40% more features shipped
  • Deployment frequency up 35%
  • Sprint velocity increased from 80 to 112 points

Then I sat in a customer advisory board meeting and listened to them describe our product as “feature-bloated but lacking the core improvements we’ve been asking for.”

Our NPS hadn’t moved. Expansion revenue was flat. Customer satisfaction surveys were unchanged.

We’d optimized for output while ignoring outcomes.

The Metrics Disconnect Is a Translation Problem

I think the challenge you’re describing goes deeper than just AI productivity tools—though those certainly accelerate the problem. The real issue is that engineering metrics have been disconnected from business outcomes for years, and we didn’t notice because correlation used to be good enough.

When I was at Twilio, we tracked “deployments per day” religiously. More deployments meant we were agile, responsive, shipping value. Except we never actually validated that deployment frequency correlated with customer satisfaction or revenue growth. We just assumed it did.

Now AI tools have broken that assumption completely. We can deploy constantly while delivering zero incremental business value.

What Gartner Got Right: Speak Business Language

You referenced the Gartner research about platform teams struggling to communicate ROI, and it crystallized something for me. When I present to the board now, I’ve stopped talking about:

  • Deployments per day
  • PR velocity
  • Sprint points completed
  • Code coverage percentages

Instead, I talk about:

  • Revenue-impacting releases: Features that moved a specific business metric (expansion ARR, activation rate, retention)
  • Costs avoided: Technical debt prevented, security incidents avoided, infrastructure optimization
  • Profit contribution: Engineering efficiency gains that dropped to the bottom line

The shift in conversation quality has been dramatic. Suddenly the CFO understands why we’re investing in platform engineering. The board gets why “slowing down” on feature velocity to focus on infrastructure actually makes business sense.

But How Do You Measure Creativity and Innovation?

This is where I’m still struggling, and I’d love the community’s input.

Gartner predicts that creativity and innovation—not velocity—will define effectiveness in 2026. I believe that’s right, but operationalizing it is hard.

Here’s what we’re experimenting with:

Innovation Experiments Validated
We give teams explicit time for exploration. We track: How many experiments did we run? How many validated a meaningful customer insight? What was the learning velocity?

Not all experiments need to ship. But all experiments should teach us something.

Customer Impact Metrics
For each engineering initiative, we define the customer outcome we’re trying to move. Feature adoption depth (not just breadth). Time to value for new users. Reduction in support tickets for a specific pain point.

Technical Debt Reduction as a First-Class Metric
We’re treating tech debt reduction with the same rigor as feature delivery. How much legacy system load did we eliminate? How much did we improve system reliability? These directly correlate with engineering morale and velocity over time.

Team Autonomy and Problem-Solving Capability
This one’s fuzzy, but we’re trying to measure: How often do teams identify and solve problems without escalation? How many architectural improvements were proposed from the team level vs. top-down?

Creativity is hard to measure directly, but you can measure the conditions that enable it.

The Real Question You’re Asking

Your post asks: “Did we optimize for the wrong things?”

I think the answer is yes, but it’s not entirely our fault. We optimized for what was easy to measure and what looked like productivity in a world where coding was the perceived bottleneck.

Now that AI has removed the coding bottleneck, the upstream problems are exposed: unclear requirements, weak strategy, poor prioritization, disconnection from customer value.

AI didn’t create these problems. It just made them impossible to hide behind “we shipped a lot of code.”

The uncomfortable truth: Many organizations don’t have a velocity problem. They have a strategy problem disguised as a velocity problem.


Michelle Washington | CTO @ Mid-stage SaaS
Formerly Twilio, Microsoft | Scaling technical teams and business impact

Keisha and Michelle, you’re both describing the strategic view from the executive level, but I want to add perspective from the engineering trenches where this bottleneck shift is actually happening day-to-day.

The Review Bottleneck Is Real and It’s Breaking My Senior Engineers

At our fintech company, the data matches exactly what you’re seeing:

  • PR volume increased 98% over the past year
  • PR review time increased 91% for those same PRs
  • Senior engineers now spend 70-80% of their time on code review

This isn’t a small problem. It’s fundamentally changing what it means to be a senior engineer.

A Typical Day for My Tech Leads Now

One of my senior architects walked me through her day last week:

  • 9:00 AM: Start reviewing PRs from overnight (team is distributed across 3 time zones)
  • 10:30 AM: Still reviewing PRs—found subtle semantic bug in AI-generated code where logic was syntactically perfect but business rules were wrong
  • 12:00 PM: Join architecture discussion, but mind is distracted by growing review queue
  • 1:00 PM: Back to reviews—another junior dev’s AI-assisted PR that implemented wrong endpoint contract
  • 3:00 PM: Attempted to write code for critical infrastructure work, got interrupted by Slack asking for urgent PR review
  • 4:30 PM: Gave up on own coding, back to review queue
  • 5:30 PM: Review queue still has 8 PRs pending

She’s supposed to be our platform architect. She hasn’t written architecture code in three weeks.

The Quality Problem Is More Subtle Than We Realized

You mentioned the 1.7× issue rate in AI-coauthored PRs, Keisha, and that number undersells the challenge because the types of issues are different.

Traditional bugs: Syntax errors, null pointer exceptions, off-by-one errors
→ Easy to catch, compilers/linters help, obvious in testing

AI-generated bugs: Plausible but incorrect business logic, architectural anti-patterns, security vulnerabilities that “look right” at first glance
→ Require deep domain knowledge to catch, pass automated tests, surface in production

Example from last month:
Junior engineer used AI to build an order processing service. Code was clean, tests passed, deployed to staging. Senior engineer caught it three days later: The AI had implemented synchronous database writes in a critical path that should have been async. Performance would have been fine in dev/staging but would have choked under production load.

That’s a 45-minute code review that used to be a 2-minute “LGTM.”

The Math Doesn’t Work: Exponential PR Volume Meets Linear Senior Capacity

Here’s the equation that’s breaking:

  • 10 junior engineers with AI each create 5 PRs/day = 50 PRs/day
  • 3 senior engineers each spend 30 minutes per thorough review = 6 PRs/day capacity per senior = 18 PRs/day total

50 PRs created per day. 18 PRs reviewed per day.

We’re accumulating a backlog of 32 PRs per day. Every single day.

The “solution” we’re limping along with:

  • Shallow reviews (5 minutes instead of 30) that miss the subtle issues
  • Seniors skip their own coding to keep up with review queue
  • We slow down junior PR creation (defeating the whole point of AI productivity tools)

None of these are good options.

Potential Paths Forward (I’m Not Confident About Any of Them)

Option 1: AI-Assisted Code Review
Use AI to review AI-generated code? Seems like we’re one step away from an infinite loop. But maybe AI can flag suspicious patterns for human follow-up? We’re piloting this now.

Option 2: Pair Programming Renaissance
If review is the bottleneck, move verification earlier. Pair junior + senior during code generation, not after. But this has its own scalability problems.

Option 3: Better Guardrails and Architectural Documentation
AI generates code from context. If we give it better context (architecture docs, design patterns, security requirements), maybe it generates better code that needs less review? We’re investing heavily here.

Option 4: Rethink Team Structure
What if we staff teams differently? Fewer juniors per senior? Dedicated review roles? I hate both of these—first limits growth opportunities, second creates career dead-ends.

Option 5: Accept Lower Code Quality for Higher Velocity
This is the option no one says out loud but some companies are implicitly choosing. Ship fast, fix production issues as they arise. I refuse to go here in fintech where bugs mean regulatory exposure and customer financial harm.

Michelle’s Point About Strategy Problems Disguised as Velocity Problems

Michelle, your comment really resonated: “Many organizations don’t have a velocity problem. They have a strategy problem disguised as a velocity problem.”

In my team, the real bottleneck isn’t even the code review time—it’s that we’re reviewing code for features that shouldn’t be built in the first place.

When a junior engineer can generate a fully-functional microservice in 6 hours, suddenly there’s no natural throttle that forces us to ask: “Should we build this? Does this solve the right problem? Is this the right architecture?”

We built three services last quarter that we’re now deprecating. They worked perfectly. They solved the wrong problems.

That’s not a review bandwidth problem. That’s a strategic clarity problem.


Looking forward to hearing how other engineering leaders are navigating this. I’m trying everything and still not sure what the right answer is.

Luis Rodriguez | Director of Engineering @ Fortune 500 Fintech
Leading distributed teams | Mentoring through SHPE

Luis, your comment about building three perfectly-working services that solved the wrong problems just brought back some painful memories from my failed startup. :skull:

We had velocity. We had shipping cadence. We had green dashboards.

We just didn’t have a business that anyone wanted.

Velocity to Nowhere: A Startup Cautionary Tale

My B2B SaaS startup had incredible “productivity metrics”:

  • Shipped 47 features in our first 12 months
  • Averaged 2-week sprint cycles with consistent velocity
  • Maintained 85% code coverage
  • Deployed to production 3-4 times per week

Our investors loved the updates. Our velocity charts looked fantastic.

Then we hit 18 months and realized: Customers were only actually using 6 of those 47 features. And 4 of the 6 they used were from our original MVP.

We’d spent a year “shipping fast” while building the wrong things. By the time we figured it out, we’d burned through our runway building elaborate solutions to problems customers didn’t actually have.

Speed didn’t save us. Strategy would have.

The Design Parallel: AI Makes Bad Ideas Happen Faster

I’m seeing the exact same pattern now in design systems work.

With tools like Midjourney, Figma AI, and design-to-code generators, I can create:

  • Complete UI mockups in 30 minutes instead of 3 hours
  • Entire design systems in a week instead of a month
  • Production-ready component code without writing a single line manually

Here’s what’s terrifying: Good design still takes the same amount of time. Understanding user needs, iterating on concepts, validating with real users, making strategic aesthetic choices—none of that got faster.

What got faster was executing bad ideas with high polish.

I can now generate a beautiful, pixel-perfect interface for a feature nobody needs in record time. :sparkles:

“Coding Was Never the Bottleneck” Hits Different When You’ve Built the Wrong Thing

That Agoda research quote—“coding was never the bottleneck”—resonates so deeply it hurts.

At my startup, we never struggled to build things. Our engineers were great. We had solid velocity.

What we struggled with:

  • Understanding what to build: We assumed we knew our customers’ problems. We were wrong.
  • Validating before building: We shipped features and waited for usage data. The feedback loop was too slow.
  • Making strategic trade-offs: We said yes to everything because we could build it all. We should have said no to 80% of it.

If we’d had AI coding tools back then? We would have failed 3 months faster, with more features nobody wanted.

What My Current Team Does Differently: Slow Down Discovery, Speed Up Execution

After that failure, I’m militant about one thing: Speed in the wrong direction is just expensive failure.

At my current company, we’ve decoupled velocity from value:

Discovery Phase (Slow and Deliberate):

  • User research and validation before design
  • Prototype and test with real users before implementation
  • Strategic architecture decisions with the whole team
  • Explicit “are we building the right thing?” checkpoint

Execution Phase (Fast with AI):

  • Once we validate direction, ship aggressively
  • Use AI tools to accelerate the build
  • Iterate quickly on validated concepts
  • Deploy continuously with confidence

Result: We ship 40% fewer features than we did two years ago. Customer satisfaction is up 60%. Retention improved by 25%.

Turns out shipping less of the right things beats shipping more of the wrong things.

The Uncomfortable Question: What If AI Productivity Gains Expose Our Lack of Strategy?

Michelle said it perfectly: “Many organizations don’t have a velocity problem. They have a strategy problem disguised as a velocity problem.”

I’ll add: AI tools are X-rays for organizational dysfunction.

If your strategy is clear, AI helps you execute faster. You build the right things with less effort.

If your strategy is unclear, AI helps you build the wrong things faster. You accumulate features, technical debt, and complexity without corresponding customer value.

The velocity gain reveals which category you’re in.

For This Community: What If the “Right Metric” Is Just Better Judgment?

Keisha asked how we measure creativity and innovation. Michelle shared experiments with innovation metrics and customer impact.

Here’s my cynical take from someone who’s optimized the wrong metrics before:

What if the search for better metrics is itself the wrong path?

What if the answer isn’t a different KPI, but better strategic judgment about what to build? What if we need product leaders who say “no” more often, engineering leaders who push back on feature requests, and organizations that are comfortable with not building things?

I don’t have a dashboard for that. I don’t have a velocity measurement for “we chose not to build this because it wasn’t strategic.”

But maybe that’s the metric that actually matters: strategic restraint.

Shipping less. Building fewer things. Saying no with confidence.

That’s hard to sell to a board that wants to see “productivity improvements” from AI investments. But it might be the thing that actually creates value.


Loved this discussion. This community asks the right uncomfortable questions. :bullseye:

Maya Rodriguez | Design Systems Lead @ Confluence Design Co.
Former startup founder | Learning more from failure than success