Google's AI Paradox: 25% AI-Assisted Code, 10% Velocity Gain—Are We Measuring the Wrong Thing?

cto_michelle · March 15, 2026, 9:40pm

At our last exec meeting, our VP Engineering presented a slide that stopped the conversation cold: “25% of our codebase now AI-assisted, velocity up 10%.” The CFO asked the obvious question I was thinking: “If a quarter of our code writes itself, why isn’t productivity up 25%?”

Google just reported the same pattern. Sundar Pichai announced 25% of their new code is AI-generated, with engineering velocity gains around 10%. The math doesn’t work—and I think we’re all missing something fundamental about what we’re actually measuring.

The Two Hypotheses

I see two possible explanations, and I’m not sure which worries me more:

Hypothesis 1: AI code carries a quality tax. Maybe AI-generated code requires disproportionate review time. GitHub reports a 46% code completion rate from Copilot, but only ~30% of those suggestions get accepted. That’s a lot of cognitive overhead—reading, evaluating, rejecting AI proposals. When I watch our senior engineers work with AI assistants, they’re spending more time reviewing and refactoring AI suggestions than they saved by not typing.

Hypothesis 2: We’re measuring the wrong outputs entirely. Velocity traditionally means “code shipped” or “PRs merged” or “story points completed.” But none of those correlate with business value. If AI helps us ship features 10% faster, but those features take 3x longer to maintain, did we actually gain anything? The research backs this up—AI-authored code is now 26.9% of production code globally, but organizational productivity gains are stuck at 8-12%.

What I’m Seeing at Scale

I’ve led engineering teams through two major technology shifts—cloud migration at Microsoft and microservices at Twilio. This AI adoption wave feels different. In previous shifts, productivity dipped before it improved (learning curve, migration costs). With AI, individual developers report 25-39% productivity gains, but our DORA metrics barely moved.

The bottleneck shifted. Our PR queue is 40% larger than last quarter. Code review cycle time increased from 1.2 days to 2.1 days. QA is drowning. Security keeps flagging issues in AI-generated auth code. We optimized coding, but coding was never the constraint.

The Measurement Question

Here’s what keeps me up at night: What if velocity was always the wrong metric, and AI is just making that obvious?

Maybe we should measure:

Time from “idea proposed” to “customer value delivered” (not code committed)
Incident reduction and system reliability improvements
Technical debt accumulation rate
Developer cognitive load and context-switching overhead
Customer outcome metrics tied to engineering work

But most engineering dashboards still show commits, PRs, and lines changed. We’re measuring industrial-era outputs in a knowledge-work world.

The Real Question

When we say “AI increased productivity 10%,” what did we actually measure? Code written? Tasks completed? Customer problems solved? Revenue enabled?

I’d love to hear from other engineering leaders: How are you measuring AI’s impact? Are you seeing the same adoption-velocity gap? And more importantly—what should we be measuring instead?

Because right now, we’re celebrating a 25% input increase while scratching our heads about a 10% output increase. That gap isn’t a mystery—it’s a message. We’re just not sure what it’s telling us yet.

eng_director_luis · March 15, 2026, 9:41pm

This hits close to home. We rolled out GitHub Copilot across our engineering team (40+ developers) six months ago. Adoption was instant—92% using it within the first month. Everyone’s favorite feature? “I write code faster now.”

But here’s what the dashboards told me: PR throughput up 12%. Cycle time up 18%. Bug escape rate up 9%.

The pattern you’re describing—I’m seeing it in real-time. And I think the answer is both hypotheses are true, but there’s a third factor we’re not talking about: junior and mid-level engineers are producing code that requires senior engineer review time.

The Review Tax Is Real

In our last retrospective, a senior engineer said something that stuck with me: “I spend less time writing code and more time reading AI-generated code.” She wasn’t complaining—she genuinely liked using Copilot for boilerplate. But when reviewing PRs from junior devs using AI, she was spending 2x the time.

Why? Because AI-generated code works, but it doesn’t necessarily make sense. Variable names that are technically correct but contextually wrong. Design patterns that solve the immediate problem but ignore the broader system architecture. Security patterns copied from outdated Stack Overflow posts.

Junior developers don’t have the experience to evaluate these suggestions critically. They accept them because “the AI suggested it.” Senior developers then become bottlenecks—not because they’re slow, but because they’re doing quality control that wasn’t needed before.

Are We Measuring Typing Speed or Engineering Judgment?

Your question about what we should measure really resonates. I’ve been tracking:

What went up:

Lines of code committed (+31%)
PRs opened (+23%)
Individual developer “feeling productive” (subjective, but consistent)

What barely moved or got worse:

Features shipped to production (+8%)
Customer-reported bugs (flat)
Technical debt ratio (up 14%)
Senior engineer satisfaction (down—they feel like code reviewers, not engineers)

The math is telling us something uncomfortable: AI is making us faster at typing code, but not necessarily better at engineering.

I love your proposed metrics. Time from idea to customer value is the one I’m pushing hardest. But our exec team still looks at velocity charts and story point burn-down. Changing the scorecard is harder than changing the tools.

The Deeper Question

Here’s what keeps me up: Are we making junior engineers better, or just more productive at writing code that senior engineers have to fix?

In traditional apprenticeship models, junior engineers learned by doing the hard work—struggling with design decisions, debugging their own mistakes, understanding why their first solution didn’t work. AI shortcuts that learning path.

I’m not anti-AI. I use it myself. But I’m worried we’re optimizing for speed without considering the long-term impact on engineering judgment and team capability.

What should we measure instead of velocity? Maybe:

Time to customer value (as you mentioned)
Code that survives 6+ months without major refactoring
Reduction in production incidents
Senior engineer utilization (are they building or just reviewing?)
Junior engineer skill progression over time

The productivity paradox isn’t just about AI. It’s about whether we’re measuring what actually matters in engineering work. AI is just making the gap impossible to ignore.

product_david · March 15, 2026, 9:41pm

Coming from the product side, this conversation is fascinating—and a bit frustrating. Because we’re having the same measurement problem on the product side, and I think both engineering and product are stuck in the same trap.

The Output vs Outcome Gap

Michelle, your question “What did we actually measure?” is the right one. But let me flip it: Why are we measuring engineering productivity at all, instead of measuring business outcomes?

When our eng team reports “velocity up 10%,” I have to ask: velocity toward what? We shipped 15% more features last quarter. Customer engagement went up 3%. Revenue attributed to those features: 1.8%.

The gap between output (features shipped) and outcome (customer value, revenue) is wider than ever. And I suspect AI is making it worse, not better.

Code Velocity ≠ Product Velocity ≠ Business Velocity

Here’s the framework I’ve been using with our executive team:

Code velocity: How fast engineers write and commit code
Product velocity: How fast we ship features customers can use
Business velocity: How fast we create measurable customer and business value

Most engineering dashboards measure code velocity. Some mature teams measure product velocity (features shipped, deployment frequency). Almost nobody measures business velocity—time from idea to revenue, customer problem resolution rate, NPS improvement tied to releases.

When engineering says “productivity up 10%,” they mean code velocity. But if product velocity only increased 3%, and business velocity stayed flat, we didn’t actually get more productive—we just got better at creating work for ourselves.

The AI Amplification Problem

AI coding assistants amplify whatever process you already have. If your process connects engineering to customer value, AI accelerates that. If your process measures “code committed” as success, AI will help you commit more code—whether or not it matters.

I’ve seen this pattern:

Engineering ships features 12% faster (code velocity)
Product ships features to customers 6% faster (product velocity)
Customer adoption of new features is flat (business velocity)

The bottleneck isn’t coding. It’s product-market fit, customer adoption, go-to-market execution. AI doesn’t help with any of that. In fact, it might hurt—because now we’re shipping marginal features faster, diluting our product focus.

What We Should Measure

I agree with Michelle’s proposed metrics, but I’d add a product lens:

Time from customer problem identification to problem solved in production (not “feature deployed,” but “customer success metric improved”)
Feature usage and adoption rates (how many features we ship that customers actually use)
Revenue per engineering hour (controversial, but ties eng work to business outcomes)
Customer-reported value (NPS, satisfaction scores tied to releases)
Failed experiment velocity (how fast we test, learn, and kill bad ideas)

The last one is critical. If AI helps us ship bad features faster, that’s actually negative productivity. But if AI helps us test and validate faster, that’s transformative.

The Uncomfortable Truth

Here’s what I think the 25% → 10% gap really means: Most engineering work doesn’t directly create customer value.

That’s not a criticism—it’s reality. Infrastructure, refactoring, tech debt, platform work, tooling—all necessary, but not directly felt by customers. If AI accelerates coding across the board, but only some of that coding translates to customer value, then of course the productivity gain is smaller than the code generation gain.

The question isn’t “How do we measure AI productivity?” It’s “How do we ensure AI accelerates work that matters?”

Until we tie engineering metrics to business outcomes, we’ll keep celebrating 25% input gains while confused about 10% output gains. The gap isn’t a measurement error—it’s a strategy problem.

vp_eng_keisha · March 15, 2026, 9:42pm

This thread is giving me life. Michelle, Luis, David—you’re all describing pieces of the same system-level problem. Let me add the organizational design angle, because I think the 25% → 10% gap is actually a constraint theory problem, not a measurement problem.

The Bottleneck Just Moved

Michelle said it perfectly: “We optimized coding, but coding was never the constraint.”

This is textbook Theory of Constraints. When you optimize a non-bottleneck, system throughput doesn’t improve. You just create a bigger pile of work-in-progress at the actual bottleneck.

Before AI:

Bottleneck: Senior engineer availability for architecture + code review + mentoring
Coding speed: Limited by thinking time, design decisions, implementation

After AI:

Bottleneck: STILL senior engineer availability (now overwhelmed with reviews)
Coding speed: Faster, but that just moves the queue upstream

We didn’t increase system capacity. We just shifted where the queue forms.

What Happens When You Speed Up Non-Constraints

In our engineering org, I’m seeing this play out in real-time:

Code review queue:

Before AI: ~50 open PRs, median review time 18 hours
After AI: ~85 open PRs, median review time 38 hours
Senior engineers reporting burnout from “endless review mode”

QA and testing:

Before: QA kept up with feature velocity
After: QA backlog growing 20% month-over-month
Testing becoming the critical path for every release

Production incidents:

Before: ~2-3 incidents per week
After: ~3-4 incidents per week
Root cause analysis points to “features shipped before fully understood”

The system is screaming at us: coding speed is not our constraint.

The Real Constraints

In my experience leading engineering orgs, here are the actual bottlenecks:

Senior engineer cognitive capacity - architecture, design review, mentoring, incident response
Cross-functional alignment - eng, product, design, customer success coordination
Customer feedback loops - understanding what actually creates value
Organizational learning - knowledge transfer, documentation, onboarding
Decision quality - making the right bets, killing bad projects early

AI coding assistants don’t help with any of these. In fact, they might hurt—because now we’re creating more code that needs all of the above.

The Adoption-Velocity Gap Explained

David’s framework of code → product → business velocity is spot on. Let me add a systems lens:

Code velocity increased 25% (we optimized a non-constraint)
System throughput increased 10% (limited by actual constraints)
The gap (15%) is waste: WIP sitting in review queues, features deployed but not adopted, bugs that require rework

That 15% gap isn’t missing productivity. It’s waste that our systems weren’t designed to handle.

What High-Performing Organizations Do Differently

I’ve been studying companies that ARE getting meaningful AI productivity gains (20%+ system throughput). Here’s what they have in common:

They optimized the whole system, not just coding
- Invested in review tooling and automation
- Expanded QA capacity and shifted left on testing
- Built AI-powered documentation and knowledge management
They changed how they measure success
- Stopped measuring PRs, started measuring customer outcomes
- Introduced “feature success rate” (% of shipped features that achieve goals)
- Track cycle time from idea → customer value, not code → commit
They reorganized around constraints
- Created “AI-native” workflows where AI handles end-to-end tasks
- Shifted senior engineers away from review toward architecture
- Built better feedback loops so failures surface faster

The Hard Truth

Luis asked: “Are we making junior engineers better, or just more productive at writing code that senior engineers have to fix?”

The answer is: We’re making the constraint worse.

If AI helps 10 junior engineers each submit 30% more code, but all that code flows through the same 2 senior engineers for review, we didn’t increase capacity—we increased load on the bottleneck.

This is why organizational productivity only improves when you redesign the system, not just add new tools.

What Should We Actually Measure?

Michelle’s proposed metrics are great. I’d add a constraints-focused lens:

Constraint utilization - How much time do senior engineers spend on constraint-breaking work (architecture, hard problems) vs non-constraint work (routine reviews)?
Queue time at each stage - Where does work sit waiting? That’s your constraint.
System cycle time - Idea → production → customer value → measured outcome
Constraint productivity - Are your senior engineers and critical resources getting more effective?

The 25% → 10% gap isn’t a mystery. It’s the system telling us: You optimized the wrong thing.

AI made coding faster. Great. Now fix the actual bottlenecks, or that 15% gap will just turn into frustrated engineers and growing backlogs.

maya_builds · March 15, 2026, 9:43pm

Reading this thread as someone who lives between design and engineering—this hits different. Y’all are talking about code velocity and business outcomes, but let me bring a design systems perspective that might add another angle to this mystery.

The “Works But Doesn’t Make Sense” Problem

Luis said something that gave me chills: “AI-generated code works, but it doesn’t necessarily make sense.”

This is EXACTLY what I see with design systems. Engineers use our components, but they don’t understand the why behind them. They copy-paste patterns without understanding the design principles. The result? Technically functional interfaces that feel incoherent to users.

AI is doing the same thing to code that copy-paste did to design.

Technical Debt Has a Half-Life

Michelle asked about measurement. Here’s one I’d add: What’s the half-life of AI-generated code quality?

In design systems, I track “time until component needs refactoring.” Components built with deep understanding last years. Components built by copying patterns last months before they need rework.

I suspect AI-generated code has a shorter half-life than human-architected code. It might work today, but:

Does it handle edge cases we haven’t encountered yet?
Does it follow patterns that make sense for future features?
Will the next engineer understand WHY it works this way?

If AI code needs refactoring 2x sooner, your 25% productivity gain turns into negative productivity within 6 months. But we’re not measuring that.

Speed Now, Maintenance Cost Later

When I was running my startup (before it failed ), we moved fast by cutting corners. Shipped features in days instead of weeks. Felt productive! Until maintenance costs caught up with us.

Six months in, we were spending 60% of eng time maintaining features we’d shipped “quickly.” New feature velocity dropped to nearly zero. We’d optimized for shipping, not for maintainability.

I see the same pattern with AI code. Shipping fast feels like winning. But if review time goes up, bug rates go up, and refactor frequency goes up… are we actually moving faster, or just deferring costs?

The Coherence Tax

Here’s where design thinking meets this discussion: Systems have to make sense as a whole.

A design system where every component was built in isolation might technically work, but it feels incoherent. Inconsistent patterns. Duplicated logic. Different solutions to the same problem.

AI generates code in isolation. It doesn’t understand your system architecture. It doesn’t know you solved this problem differently in another module. It doesn’t maintain conceptual integrity.

Keisha’s point about senior engineer time—I think part of what they’re doing is restoring coherence to AI-generated code. That takes time. You can’t measure it with velocity metrics, but you can feel it when it’s missing.

What I Learned from My Startup Failure

We measured velocity (features shipped per week) and felt productive right up until we died. What we should have measured:

Maintenance burden - % of time spent fixing vs building new
Coherence score - How much duplicated logic exists across features?
Rework rate - How often do we revisit “completed” work?
Conceptual integrity - Does the system make sense as a whole?

Those metrics are fuzzy and hard to quantify. But they predicted our failure better than velocity ever did.

AI and the Illusion of Productivity

The 25% → 10% gap might be hiding something scarier: What if the real productivity is negative, but we won’t see it for 6-12 months?

Code shipped faster (measured now)
Review time increased (measured now)
Maintenance costs compounding (measured later)
System coherence degrading (measured never?)
Junior engineer skill development stalled (measured years later)

We’re measuring the benefits immediately and the costs on a delay. Classic trap.

A Design Lens on Measurement

Michelle’s proposed metrics are great. From a design/systems perspective, I’d add:

Conceptual integrity score - How consistent are architectural patterns across the codebase?
Rework frequency - How often do “completed” features need significant changes?
Maintenance ratio - What % of eng time goes to maintaining vs building?
System comprehension time - How long does it take new engineers to understand how things work?

These are squishy metrics. Hard to quantify. But they capture the long-term health of your systems in ways velocity never will.

The Uncomfortable Parallel

In design, we learned this lesson the hard way: Fast design is easy. Good design is hard. Fast AND good design requires deep expertise.

I think engineering is learning the same lesson with AI:

Fast code is easy (AI does this well)
Good code is hard (requires judgment, experience, system thinking)
Fast AND good code requires deep expertise (AI can’t replace this yet)

The 25% → 10% gap? That’s the difference between fast and good.

Maybe the question isn’t “What should we measure?” but “What are we optimizing for?” If it’s speed, AI delivers. If it’s long-term system health, coherence, and maintainability… we might be making things worse while calling it productivity.

Just my 2¢ from someone who optimized for speed and paid the price later.