AI Coding Productivity Plateaued at 10% Despite 41% AI-Generated Code—Are We Measuring the Wrong Things?

The data doesn’t add up, and I can’t stop thinking about it.

93% of developers now use AI coding tools. 41% of all code is AI-generated. But organizational productivity gains? They plateaued at around 10% after an initial spike.

For context: I’m a VP of Product at a Series B fintech startup. If this were a B2B SaaS product—93% adoption, 41% usage intensity, but only 10% measurable value—I’d be calling an emergency product strategy meeting. Something is fundamentally broken in the value chain.

Three Hypotheses for the Plateau

I’ve been wrestling with three possible explanations:

Hypothesis 1: We Hit a Real Ceiling

Maybe AI coding tools are genuinely good at autocomplete and boilerplate but struggle with complex reasoning. The 10% represents the actual value boundary—we’ve extracted all the low-hanging fruit (syntax help, common patterns), and the hard problems (architecture, novel algorithms, business logic) remain human work.

Some evidence: The METR study where experienced developers believed they were 24% faster with AI but were actually 19% slower in controlled tests. That perception gap is wild.

Hypothesis 2: We’re Measuring Wrong

Traditional productivity metrics—lines of code, PRs merged, velocity points—were designed for a pre-AI world. They don’t capture:

  • Quality improvements (fewer bugs, better test coverage)
  • Cognitive load reduction (less context switching, fewer Stack Overflow tabs)
  • Maintainability gains (clearer code because you spent less time fighting syntax)
  • Learning acceleration (junior engineers accessing patterns they wouldn’t find otherwise)

CircleCI reports 59% throughput increase for teams using AI, but most organizations are “leaving gains on the table” because their systems haven’t caught up. Maybe the 10% plateau is a measurement artifact?

Hypothesis 3: Organizational Systems Lag

AI tools optimized the individual developer. But the surrounding ecosystem—CI/CD pipelines, code review processes, deployment workflows, team collaboration patterns—are all still designed for pre-AI productivity levels.

If developers generate 40% more PRs but review capacity stays flat, you’ve just moved the bottleneck. The system is only as fast as its slowest component.

The Product Lens

Here’s what bothers me from a product strategy perspective:

When we launch a new feature and see 90%+ adoption but <15% value capture, we dig into the “activation gap”—what’s preventing users from extracting the full value?

For AI coding tools, the activation gap is enormous. Developers are using the tools (93%!), generating tons of output (41% of code!), but not seeing proportional productivity gains.

That pattern usually indicates one of three things:

  1. The product solves the wrong problem
  2. The product creates downstream friction
  3. Users lack the skills/context to leverage it fully

I suspect it’s a combination of #2 and #3 for AI coding tools.

What Should We Actually Measure?

If “time to first draft” is the wrong metric (spoiler: it probably is), what should we track instead?

Some ideas:

  • Time to confident deployment: Not “how fast can you write code” but “how fast can you ship validated, tested, production-ready code”
  • Cognitive load reduction: Are developers spending less mental energy on syntax and more on architecture?
  • Error recovery time: When AI gives you wrong code, how long does it take to fix vs. writing from scratch?
  • Iteration velocity: How fast can you go from idea → prototype → refined version → shipped feature?

The challenge: These are harder to instrument than “commits per week.”

But if we’re serious about understanding AI’s impact, we need to evolve our measurement frameworks.

Questions for This Community

I’m particularly curious to hear from engineering leaders who’ve moved beyond the 10% plateau:

  1. What changed? Was it measurement, systems, culture, or something else?
  2. What are you actually tracking? Beyond DORA metrics, what KPIs reveal AI tool effectiveness?
  3. How do you handle the perception gap? If engineers feel faster but data shows they’re not, do you correct the perception or let it be?

And the meta-question: Are we optimizing for the wrong outcome?

What if 10% productivity + 50% job satisfaction improvement is better than 30% productivity + 20% burnout increase?

What if “time saved” is the wrong goal, and “quality of thinking” is the right one?

I don’t have answers, but I’m convinced we’re asking the wrong questions.

Looking forward to your thoughts.


Sources for data cited:

David, this resonates so deeply. We saw the exact same pattern with design systems, and I think the parallel is instructive.

The Design Systems Analogy

When we launched our design token system 18 months ago, product teams were THRILLED. “We’ll ship UI 2x faster!” they said. Figma components, auto-generated code, the whole nine yards.

Six months later? Shipping velocity was… basically the same. Maybe 8% faster.

Everyone was confused and a bit disappointed. Did the design system fail?

No—we were measuring the wrong thing.

The real value showed up in:

  • Consistency: Zero brand compliance issues (previously 15-20 per quarter)
  • Maintainability: Design debt way down because changes propagated automatically
  • Iteration cycles: Going from v1 → v2 → v3 of a feature became trivial
  • Cross-team collaboration: Designers and engineers finally speaking the same language

But none of that showed up in “time to ship first version.”

Why 41% AI-Generated Code Might Be Like Component Libraries

Here’s my hypothesis: The 41% of code that’s AI-generated might be like component libraries—fast to produce in the moment, but the actual value compounds over time.

You’re not measuring:

  • How many bugs you didn’t create because AI wrote better tests
  • How much faster iteration #3 is because the foundation was solid
  • How much context switching you avoided because you didn’t need 15 Stack Overflow tabs open

Those benefits are real, but they’re invisible to traditional velocity metrics.

The Measurement Challenge Is Real

You asked what we should track instead of “time to first draft.” From a design perspective, here’s what actually matters:

Time to understanding: How long before you can confidently modify this code/design?
Cognitive load: Are you spending mental energy on the right problems?
Iteration velocity: Not “how fast was v1” but “how much faster is v3 than v1”?
Flow state retention: Did the tool help you stay in flow, or did debugging AI output break your concentration?

The problem? These are nearly impossible to quantify at scale.

I tried tracking my own “flow state” for my side project. I logged when I felt like I was making progress vs. when I was frustrated/stuck. Over 6 months:

  • Commits per week: Basically flat (~15 avg)
  • Time in flow state: Up ~35%
  • Bugs shipped to production: Down 40%
  • Features completed: Same quantity, way higher polish

Would traditional metrics show that? Nope. Would I trade back to pre-AI workflow? Also nope.

The Valley of Disappointment

You know the Atomic Habits framework? There’s this concept of the “valley of disappointment”—where you’re building better habits, but results haven’t compounded yet.

The first month at the gym, you don’t see visible change. Months 2-4, still not much. Month 6? Suddenly everyone notices.

What if the 10% plateau is because we’re in that valley?

Individual developers see modest gains. But the team-level compounding effects—shared patterns, better code review, accumulated test coverage—haven’t materialized yet because it’s only been 18-24 months of widespread adoption.

Maybe the real productivity spike shows up in year 3-5, not year 1-2?

Your Satisfaction Point Is Crucial

What if 10% productivity + 50% job satisfaction improvement is better than 30% productivity + 20% burnout increase?

THIS. A thousand times this.

My failed startup died partly because we optimized for shipping fast at the cost of sustainable pace. We burned out the team.

If AI tools make the work more enjoyable—less grunt work, more creative problem-solving—that has retention value, morale value, career development value.

None of that shows up in quarterly productivity reports, but it absolutely affects long-term company success.

Questions Back at You

  1. Has your company measured developer satisfaction alongside productivity? Curious if there’s a correlation.

  2. Are you seeing different plateau points for different experience levels? (My hunch: seniors extract more value from AI than juniors, which might explain some of the overall plateau.)

  3. What if we stopped trying to measure productivity and instead measured “quality of engineering experience”? Would that change our tool choices?

I don’t think we hit a ceiling. I think we’re measuring the altitude wrong.

David, Maya—both of your frameworks resonate. Let me add the engineering management perspective with some hard data from our 40-person team.

The 10% Plateau Matches Our Experience Exactly

We rolled out AI coding tools (GitHub Copilot + Cursor) to our entire engineering org in Q1 2025. Here’s what the data showed:

Q1 2025 (first 3 months):

  • Velocity: +25% (measured by story points completed)
  • PR throughput: +32%
  • Team morale: Through the roof (“this is amazing!”)

Q2 2025 (months 4-6):

  • Velocity: +18% (still good, but declining)
  • PR throughput: +28%
  • Engineering satisfaction: Starting to hear grumbles

Q3 2025 (months 7-9):

  • Velocity: +11% (the plateau hits)
  • PR throughput: +15% (but more PRs failing CI)
  • Engineer feedback: “AI helps but review process is killing us”

Q4 2025 / Q1 2026 (current state):

  • Velocity: +11% (stuck at plateau)
  • PR throughput: +12%
  • Morale: Mixed—some love it, some frustrated

So yes, we hit the same ~10% plateau David described. But the why is what matters.

The Gains Aren’t Evenly Distributed

Here’s what we discovered when we segmented by experience level:

Senior engineers (5+ YOE):

  • Productivity gains: 15-20%
  • PR quality: Same or better
  • Adoption: Enthusiastic

Mid-level engineers (2-5 YOE):

  • Productivity gains: ~8%
  • PR quality: Slightly worse (more review iterations)
  • Adoption: Mixed feelings

Junior engineers (<2 YOE):

  • Productivity gains: ~5%
  • PR quality: Noticeably worse (1.5x more bugs in first 30 days)
  • Adoption: High usage but dependency concerns

This distribution explains the 10% aggregate plateau. Seniors are crushing it. Juniors are struggling. Mid-level is… mid.

Maya’s question about experience levels is spot-on.

Why the difference?

Seniors have mental models. They know what good code looks like, so they can spot AI mistakes quickly. They use AI to accelerate what they already know how to do.

Juniors don’t have that reference frame. When AI gives them “almost right” code, they spend hours debugging subtle issues they wouldn’t have created if they’d written it manually.

The METR study (19% slower but believed they were faster) is probably this effect in action.

Organizational Systems Became the Bottleneck

David’s Hypothesis 3 (organizational lag) is the real answer, at least for us.

When engineers started generating 40% more PRs, our review process wasn’t designed for that volume.

The math:

  • 40 engineers × 3 PRs/week = 120 PRs/week (old baseline)
  • With AI: 40 engineers × 4.5 PRs/week = 180 PRs/week (+50%)
  • Same review capacity: ~10 senior engineers doing reviews
  • Result: Review queue went from 1 day → 3-4 days

Suddenly the bottleneck wasn’t “how fast can you write code”—it was “how fast can reviewers approve it.”

Our CI/CD pipeline also buckled:

  • AI tools write tests too (good!)
  • Test suite grew 60% in 6 months (also good!)
  • Pipeline execution time: 8 minutes → 22 minutes (very bad!)
  • Engineers context-switching while waiting for CI (productivity killer)

We accidentally optimized individual developer speed while creating system-level bottlenecks.

What We Changed (And What Worked)

After Q3 2025 plateau analysis, we invested 3 months fixing the ecosystem:

1. PR Review Process Overhaul

  • Implemented async review SLAs (24-hour target)
  • Created “AI-assisted code” review checklist (different from manual code review)
  • Trained reviewers on common AI patterns and failure modes
  • Added automated quality gates (linting, security scanning, complexity checks)

2. CI/CD Infrastructure Investment

  • Parallelized test execution (8 runners → 24 runners)
  • Implemented smarter test selection (only run affected tests for draft PRs)
  • Upgraded build infrastructure (faster machines)
  • Pipeline time: 22 minutes → 9 minutes

3. Documentation Requirements

  • PRs must tag “AI-assisted” if >30% AI-generated
  • Required explanation: “Why this approach?” not just “What does it do?”
  • This helped with code review AND onboarding new team members

The results after 3 months:

  • Velocity gains: 8% → 19% (more than doubled!)
  • PR review time: Back to 1-day average
  • Developer satisfaction: Up significantly

So the plateau wasn’t AI’s ceiling—it was our system’s ceiling.

David’s Measurement Question

What are you actually tracking? Beyond DORA metrics, what KPIs reveal AI tool effectiveness?

We’re using DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, MTTR) as foundation. But we added:

AI-Specific Metrics:

  • % of PRs tagged “AI-assisted”
  • Review time: AI-assisted vs manual PRs
  • Bug rate (first 30 days): AI-assisted vs manual code
  • Developer satisfaction survey (quarterly): “How much do AI tools help you?”

System Health Metrics:

  • PR queue depth (how many waiting for review)
  • CI/CD pipeline wait time
  • Code review capacity utilization

The combination tells a fuller story than velocity alone.

The Perception Gap Is Concerning

David asked: “If engineers feel faster but data shows they’re not, do you correct the perception or let it be?”

This is a tough leadership question.

On one hand, if engineers feel productive and satisfied, that has retention value. Happy engineers don’t leave.

On the other hand, false confidence is dangerous. If they think they’re 20% faster but they’re actually slower, they might:

  • Take on too much work (leading to burnout)
  • Make poor time estimates (hurting planning)
  • Miss skill development opportunities (thinking they’re improving when they’re not)

I lean toward transparency with context.

We share the data (“aggregate productivity is +11%, but distribution varies by experience level”) and explain why (system bottlenecks, not individual performance).

This maintains trust while setting realistic expectations.

Maya’s “Valley of Disappointment” Hypothesis

I’m more optimistic after reading Maya’s framework.

Maybe the 10% plateau is temporary. Team-level benefits (shared patterns, better test coverage, accumulated knowledge) might compound over 3-5 years.

But also: We need to actively invest in unlocking those benefits.

If we just wait passively for compounding effects, they won’t materialize. We need to:

  • Upgrade our systems (CI/CD, review processes)
  • Develop our people (AI literacy, validation skills)
  • Evolve our culture (what does “quality” mean in AI era?)

The plateau is a choice. You can accept it, or you can invest your way through it.

We’re choosing to invest.

Luis’s data is gold. David’s product lens is exactly right. Maya’s “valley of disappointment” framework is brilliant.

Let me add the executive/strategic layer, because this plateau has board-level implications.

This Reminds Me of Cloud Migration ROI Debates

Back in 2015, everyone said moving to cloud would cut infrastructure costs 50%.

Reality? Most companies saw 15% savings after 3 years.

Did cloud fail? Absolutely not.

The real value wasn’t cost savings—it was:

  • Deployment flexibility (ship new regions in hours, not months)
  • Experimentation velocity (spin up test environments, destroy them, repeat)
  • Scaling elasticity (handle traffic spikes without overprovisioning)
  • Innovation acceleration (access to managed services that would take years to build)

But if you only measured “infrastructure cost reduction,” you’d think cloud was a massive disappointment.

AI coding tools might be the same story.

Our Company’s Experience: 8% Gain, But Unmeasured Value

We have 120 engineers. Universal AI tool adoption (Cursor + GitHub Copilot). Aggregate productivity gain: 8%.

That’s… underwhelming on paper.

But here’s what doesn’t show up in velocity metrics:

1. Faster Onboarding

  • New hires productive in 2 weeks vs 6 weeks
  • AI acts as “institutional knowledge on demand”
  • Junior engineers can navigate unfamiliar codebases faster

2. Knowledge Democratization

  • Mid-level engineers accessing senior-level patterns
  • Reduced dependency on “the one person who knows this system”
  • More uniform code quality across the team

3. Context Switching Reduction

  • Inline answers instead of Stack Overflow → Slack → docs → back to code
  • Estimated 15-20 minutes saved per context switch
  • Flow state preservation (hard to measure, huge for deep work)

4. Lower Cognitive Load on Routine Tasks

  • Less mental energy spent on boilerplate
  • More capacity for architectural thinking
  • Better work-life balance (not taking syntax problems home)

If I factor these into the ROI calculation, the value is significantly higher than 8%.

But how do I quantify “junior engineers feel less overwhelmed”? How do I measure “context switching reduction” at scale?

That’s the measurement challenge David identified.

The Perception Gap Is a Ticking Time Bomb

Luis mentioned the perception gap (believed 24% faster, actually 19% slower).

This is deeply concerning from an organizational change management perspective.

Scenario 1: Engineers believe they’re faster, data shows they’re not.

If we correct the perception (“actually, you’re slower”), we risk:

  • Morale hit (“so AI tools are useless?”)
  • Resistance to using tools (“why bother if they don’t help?”)
  • Anxiety about AI replacing jobs intensifying (“if I’m slower with AI, what’s my value?”)

Scenario 2: We let the perception stand uncorrected.

But then we have:

  • Misaligned expectations (engineers overcommit, miss deadlines)
  • False confidence in estimates (planning becomes unreliable)
  • Long-term skill degradation (if they think they’re learning but aren’t)

There’s no clean answer. Leadership requires navigating this tension carefully.

My current approach: Correct the perception, but reframe the value.

“You’re not necessarily faster at writing code. But you’re shipping higher quality code with better test coverage and spending less mental energy on syntax. That’s valuable even if velocity metrics don’t capture it.”

This shifts the conversation from “are we faster?” to “are we better?”

The Hard Question: Should Executives Even Care About This?

Here’s my uncomfortable truth: I need to justify our $400k annual AI tooling spend to the board.

CFO’s math:

  • Engineering team cost: ~$12M annually
  • 10% productivity gain = $1.2M value
  • AI tools cost: $400k
  • ROI: 3x (looks good!)

CFO’s counter-argument:

  • “Why not just hire 10% fewer engineers and save $1.2M instead of spending $400k on tools?”

This is where David’s product lens and Maya’s multi-dimensional measurement save me.

My rebuttal:

1. Retention Value
If AI tools improve job satisfaction and reduce attrition by 5%, that’s ~$600k saved (cost of replacing 2 engineers @ ~$300k each including recruiting, onboarding, lost productivity).

2. Quality Value
If bug rate drops 20%, customer support costs drop, NPS improves, churn reduces. Hard to quantify precisely, but conservatively $200k-$500k annually.

3. Speed-to-Market Value
If time-to-deploy for new features drops 15%, we ship faster than competitors, capture market share earlier. Strategic advantage that doesn’t appear in engineering productivity metrics.

4. Talent Acquisition Value
Engineers want to work at companies using modern tools. If we removed AI tools, recruiting would be harder (candidates expect them now).

When I frame it this way, with Maya’s multi-dimensional measurement framework, the business case is clear.

The $400k isn’t just buying 10% productivity. It’s buying retention, quality, speed, and competitive positioning.

What I Want: Open-Source Measurement Toolkit

Luis’s metrics are great. Maya’s satisfaction framework is perfect. David’s product lens is exactly right.

But every company is reinventing these frameworks independently.

What if we created an open-source measurement toolkit?

Include:

  • Survey templates (developer satisfaction, perception vs reality)
  • Metrics definitions (beyond DORA, what else to track)
  • Anonymized benchmark data (how do we compare to peers?)
  • Analysis scripts (calculate multi-dimensional ROI)

My company would contribute:

  • Our quarterly survey templates
  • Anonymized productivity data (segmented by experience level, like Luis did)
  • ROI calculation framework for board presentations

If 10-20 companies contributed, we’d have industry-standard benchmarks within a year.

Imagine a “State of AI-Assisted Development” annual report (like Stack Overflow Developer Survey).

That would move the industry forward faster than individual companies figuring this out in isolation.

The Learning Question Is Critical for Long-Term Strategy

Maya and Luis both touched on this: Are we building sustainable capability or creating dependency?

From a CTO perspective, I worry about:

18 months from now: Do we have a cohort of engineers who can’t function without AI?

3 years from now: What does the skill ceiling look like for “AI-native” engineers vs “traditionally trained” engineers?

5 years from now: Are we paying a hidden debt (skill gaps) that doesn’t show up in quarterly metrics?

This isn’t about being anti-AI. It’s about being strategic.

If AI tools are scaffolding (Maya’s metaphor), we need to ensure engineers are building strong foundations underneath.

If AI tools are a crutch, we’re creating long-term weakness.

How do we tell the difference? Longitudinal studies.

We need to track: Do engineers who learned with AI reach Staff/Principal level at the same rate as those who learned pre-AI?

We won’t have answers for 3-5 years. But we should start tracking now.

My Proposal: Multi-Dimensional Measurement Framework

Adapt the SPACE framework (from Nicole Forsgren’s DevEx research) for AI-assisted development:

S - Satisfaction and Wellbeing

  • Do engineers enjoy their work more?
  • Has burnout decreased?
  • Would they want to work without AI tools?

P - Performance

  • Not just velocity, but quality, maintainability, deployment success
  • Segmented by experience level (Luis’s insight)
  • Measured over multiple time horizons (immediate, 30 days, 90 days)

A - Activity

  • PRs created, reviews completed, incidents handled
  • But also: How much time in flow state? How much context switching?

C - Communication and Collaboration

  • Has cross-team knowledge sharing improved?
  • Are code reviews more or less effective?
  • Is onboarding faster?

E - Efficiency and Flow

  • Are engineers shipping features end-to-end faster?
  • Are deployment pipelines keeping up?
  • Where are the bottlenecks? (Luis’s system-level analysis)

This framework captures David’s product lens, Maya’s satisfaction dimension, and Luis’s operational rigor.

It gives executives language to communicate value beyond “10% faster.”

And it gives engineering leaders data to justify investment in the ecosystem upgrades Luis described.


Who’s in for building this together?

This thread is exactly what I needed. Michelle’s SPACE framework adaptation, Luis’s segmented data, Maya’s satisfaction lens, David’s product thinking—all of it is clicking.

Let me add the people and organizational culture dimension, because I think the plateau might actually be a feature of where we are in the adoption curve, not a bug.

The Plateau Might Be a Feature, Not a Bug

My EdTech company (80 engineers) saw the same pattern everyone’s describing:

  • Initial spike: +22% productivity
  • Plateau by month 6: +9%
  • Current state (18 months later): +17%

Wait, it went back up?

Yes. And I think it’s because we hit what I call the “First Wave” vs “Second Wave” pattern.

First Wave (Months 1-6): Individual Tool Adoption

  • Engineers use AI for autocomplete, boilerplate, tests
  • Personal productivity spikes
  • Organizational systems aren’t adapted
  • Result: Plateau at ~10%

Second Wave (Months 7-18): Workflow Redesign

  • Teams restructure how they work around AI capabilities
  • Different sprint planning, different task decomposition, different pairing dynamics
  • Organizational systems catch up (Luis’s CI/CD upgrades, review process changes)
  • Result: Second productivity gain spike

We’re currently in Second Wave, and it’s working.

What Changed: Task Structure, Not Just Tools

Here’s the key insight: We stopped using AI the same way we worked before.

We started categorizing work into:

“AI-Friendly Tasks” (well-defined, pattern-based, lots of examples):

  • CRUD operations
  • API endpoint creation
  • Test writing
  • Documentation generation
  • Common UI components

“AI-Resistant Tasks” (novel, ambiguous, requiring deep context):

  • Architecture decisions
  • Complex business logic
  • Performance optimization
  • Incident response
  • Cross-system integration design

This changes everything:

  • Sprint planning now considers “is this AI-friendly or AI-resistant?”
  • Story sizing reflects this (AI-friendly tasks get smaller point values)
  • Pairing strategies differ (juniors solo on AI-friendly, pair on AI-resistant)
  • Skill development focuses on AI-resistant capabilities

The result? Engineers spend AI tools on what they’re good at, human judgment on what they’re not.

Productivity went from +9% plateau → +17% sustained.

But this required intentional workflow redesign, not passive tool adoption.

The 10% Plateau Is “Using AI the Same Way We Worked Before”

I think most companies are stuck at 10% because they’re treating AI tools like faster keyboards.

They haven’t rethought:

  • How we decompose features
  • How we estimate work
  • How we pair and collaborate
  • How we review code
  • How we onboard new engineers

If you use AI tools within a pre-AI workflow, you get modest gains.

If you redesign the workflow for AI-augmented development, you get bigger gains.

But workflow redesign is hard. It requires:

  • Executive sponsorship (Michelle’s point about investment)
  • Team buy-in (cultural change)
  • Experimentation and learning (not every change works)
  • Patience (Second Wave takes 12-18 months)

Most companies aren’t willing to do that work. So they stay at the plateau.

The Diversity and Learning Lens

Now, the uncomfortable part.

Luis’s data showing juniors struggle more than seniors? That worries me deeply.

The METR study (19% slower but believed they were faster)? That’s a massive red flag for skill development.

Here’s my concern: AI tools might widen gaps instead of narrowing them.

Engineers with strong mentorship networks:

  • Have seniors who can validate AI output
  • Get real-time feedback on what’s good vs “almost right”
  • Build judgment alongside speed

Engineers without strong mentorship networks:

  • Use AI as “substitute mentor”
  • But AI can’t teach why something is good or bad
  • They plateau at tool proficiency without developing deeper skills

This has diversity implications.

First-generation college students, career-changers, underrepresented minorities—they often lack informal mentorship networks.

If AI tools create a two-tier system (those with human mentors thrive, those without plateau), we’re making tech less accessible, not more.

What We’re Doing: Explicit Skill Development Programs

We can’t rely on AI tools to passively improve junior engineers.

We implemented:

1. “AI Co-Pilot, Not Auto-Pilot” Philosophy

  • Juniors must explain AI-generated code in PR descriptions
  • Code review checklist includes “can you debug this without AI?”
  • Quarterly “AI-free” rotation (build a feature without assistants, like a skill maintenance exercise)

2. Structured Pairing Program

  • Juniors pair with seniors for first 6 months
  • Seniors use AI, juniors observe how and why
  • Then juniors use AI, seniors coach on validation

3. AI Literacy Curriculum

  • When to trust AI output (well-defined patterns)
  • When to be skeptical (novel problems, security-sensitive code)
  • How to validate AI suggestions (what to check, what could go wrong)
  • How to debug AI-generated code (common failure patterns)

This treats AI collaboration as a skill to develop, not just a tool to use.

Early results (12 months in):

  • Junior engineers feel more confident validating AI output
  • Code quality metrics improving (fewer bugs in production)
  • But ramp time is slower (takes ~8 weeks instead of 2 weeks to first feature)

We’re choosing long-term capability over short-term speed.

The Perception Gap and Trust

Michelle raised the perception gap (engineers think they’re faster but aren’t).

From a people leadership perspective, this is about trust.

If I tell my team “you’re not actually faster,” without context, they’ll feel:

  • Defensive (“my lived experience disagrees”)
  • Demoralized (“so AI tools are useless?”)
  • Uncertain (“what should I believe—my experience or the data?”)

Instead, I frame it as: “You’re faster at different things.”

  • Faster at getting to first draft :white_check_mark:
  • Faster at generating test cases :white_check_mark:
  • Slower at debugging subtle issues :warning:
  • Slower at understanding unfamiliar code :warning:

Net result: Depends on task mix, experience level, and workflow.

This reframes the conversation from “faster vs slower” to “what are AI tools good at, and how do we use them strategically?”

It maintains trust while setting realistic expectations.

David’s Satisfaction Question Matters More Than We Think

What if 10% productivity + 50% job satisfaction improvement is better than 30% productivity + 20% burnout increase?

From a VP Eng perspective, I’d take that trade every single time.

Here’s why:

Attrition costs:

  • Recruiting: ~$50k per senior engineer hire
  • Onboarding productivity loss: 3-6 months at reduced output
  • Knowledge loss: Institutional knowledge walks out the door
  • Team morale impact: Remaining engineers feel the loss

If AI tools reduce attrition by 5% (2-3 engineers per year for my team), that’s $300k-$500k saved annually.

That alone justifies the tooling cost.

But it’s not just retention. It’s:

  • Attraction (engineers want to work at companies with modern tools)
  • Engagement (engineers who enjoy their work do better work)
  • Innovation (happy engineers experiment more, try new things)

Michelle’s multi-dimensional ROI framework captures this, but I’d add one more metric: “Would you want to work without these tools?”

In our quarterly surveys:

  • 78% say “definitely not” (would quit if tools removed)
  • 18% say “probably not” (would be unhappy)
  • 4% say “wouldn’t matter”

That’s a massive signal that the value is real, even if velocity metrics don’t capture it.

Where Do We Go From Here?

I love Michelle’s proposal for an open-source measurement toolkit.

I’d contribute:

  • Our task categorization framework (AI-friendly vs AI-resistant)
  • Junior engineer pairing program structure
  • AI literacy curriculum materials
  • Quarterly survey templates (satisfaction, perception vs reality)

Luis’s system-level bottleneck analysis and segmented data would be invaluable.

Maya’s design systems parallel and “valley of disappointment” framing helps explain the pattern.

David’s product lens gives us language to communicate with executives.

If we pull this together into a shared playbook, we could move the industry forward.

Proposal: “AI-Augmented Development Maturity Model”

  • Level 1: Tool adoption (engineers have access, use ad-hoc)
  • Level 2: Measurement (tracking impact, identifying bottlenecks)
  • Level 3: System optimization (CI/CD upgrades, review process changes)
  • Level 4: Workflow redesign (task structure, pairing, planning adapted for AI)
  • Level 5: Strategic integration (hiring, career development, architecture designed for AI-augmented teams)

Most companies are Level 1-2. The 10% plateau is the Level 2 ceiling.

Getting to Level 3-5 unlocks the Second Wave gains.

But it requires intentional investment, not passive waiting.


Let’s build this together. Who’s in?