90% productivity gains on tests and refactoring, but slower on features. What does this mean for team composition?

I’ve been thinking a lot about team composition lately as we scale from 25 to 80+ engineers at my EdTech startup. The recent research on AI coding assistants has me questioning some fundamental assumptions about how we build engineering teams.

The productivity paradox we’re seeing:

Recent studies show AI tools deliver massive productivity gains on specific task types—up to 90% faster on test generation and refactoring workflows. But here’s the kicker: these same developers are actually slower on feature development work. METR research found experienced developers take 19% longer when using AI tools on complex tasks.

Even more striking: while individual coding speed jumps ~30%, organizational delivery only improves by about 8%. The gap between individual velocity and team throughput is eye-opening.

What I’m seeing on my own team:

We’ve had AI coding assistants rolled out for 6 months now, and the data confirms the asymmetry:

  • Our test coverage increased 40% with the same headcount
  • Refactoring tickets close faster
  • But feature delivery timelines haven’t meaningfully improved
  • Code review has become our bottleneck—PRs are 18% larger and incidents per PR are up 24%

The team composition question:

This has me rethinking our hiring strategy. Traditional wisdom says maintain roughly 1:1 junior to senior ratios, maybe skewing slightly senior as you mature. But if AI is effectively handling “junior-level” coding tasks (boilerplate, test scaffolding, basic implementations), does that ratio still make sense?

Some companies are shifting to 3-5 seniors for every 1-2 juniors, using AI to fill the traditional junior developer coding role while keeping human juniors specifically for succession planning and fresh perspectives.

But I’m conflicted. Junior employment is already down 20% since 2022. Are we accidentally killing the talent pipeline by over-relying on AI for entry-level work?

The bigger strategic questions:

  1. Role definitions: If 65% of senior developers expect their roles to be redefined (moving from hands-on coding to design/architecture/strategy), what does a “senior” engineer actually do in 2026?

  2. Skill development: When AI writes most of the code, how do mid-level engineers develop the pattern recognition and architectural intuition that comes from repetitive implementation work?

  3. Review capacity: If AI generates code faster but increases PR size and error rates, do we need to flip our ratio to favor more experienced reviewers?

  4. Long-term sustainability: Are we optimizing for short-term productivity at the expense of building the next generation of senior engineers?

What I’m curious about:

  • Has anyone else adjusted their team composition ratios in response to AI tools?
  • How are you thinking about junior developer career paths when AI handles traditional junior tasks?
  • Are you seeing the same code review bottleneck we’re experiencing?
  • What metrics are you using to make these team structure decisions?

I don’t have answers yet, but I think we’re in the middle of a fundamental shift in how engineering teams are structured. Would love to hear how others are thinking about this.


Sources for the data I referenced:

Keisha, this resonates deeply with what we’re experiencing at our financial services company. I lead 40+ engineers and we started tracking these metrics 4 months ago when we rolled out AI coding assistants.

Our data confirms your observations:

We’re seeing the exact same pattern:

  • Unit test creation time: down 55%
  • Refactoring velocity: up 38%
  • Feature delivery cycle time: down only 9%
  • PR review time: up 31% (this is our killer)

The most striking thing is the PR size explosion. Our average PR went from ~180 lines to ~290 lines. More code means more surface area for bugs, and we’re seeing incidents per PR climb just like you mentioned.

The junior developer pipeline problem is real:

Here’s what keeps me up at night: We had plans to hire 8 junior engineers this year. After seeing AI’s impact on code generation, leadership is now questioning why we need juniors at all.

My argument: We’re confusing short-term productivity with long-term capability building.

In 3-5 years, who becomes our next senior engineers if we stop developing juniors today? The skills you build doing “boring” CRUD endpoints and writing thousands of tests—that’s not just grunt work. It’s pattern recognition training. It’s learning to think about edge cases. It’s building the intuition that makes someone senior.

AI can generate code, but it can’t replace the learning journey.

What we’re trying (experiment in progress):

We’re testing a hybrid approach:

  1. Maintain 60% of our planned junior hiring (5 instead of 8)
  2. Pair each junior with a senior mentor explicitly focused on skill development
  3. Create deliberate “AI-free zones” for learning—certain tickets juniors must complete without AI assistance
  4. Dedicate senior capacity specifically to PR review (we promoted one senior to a “code quality lead” role)

Early results (2 months in):

  • Junior satisfaction is high—they appreciate the mentorship focus
  • Senior engineers report higher code review burden but better quality outcomes
  • We’re seeing fewer incidents-per-PR after the initial spike

But honestly, I don’t know if this scales.

The economic pressure is real. CFO keeps asking why we pay 3 seniors to review what AI generates instead of just paying 1 senior and 5 AI subscriptions.

I’d love to hear from others: How are you making the ROI case for junior developer hiring when AI can do much of the coding?

Both of you are asking great tactical questions, but I think we might be optimizing for the wrong thing.

The real question isn’t “how do we adjust team ratios for AI?”

It’s: “Why does 30% faster coding only translate to 8% faster delivery?”

That 22-percentage-point gap is where the real problem lives. And I don’t think hiring more or fewer juniors will fix it.

The bottleneck isn’t in code generation—it’s in everything else:

When we did a value stream mapping exercise at my company (mid-stage SaaS, ~100 engineers), here’s what we found:

Actual coding: ~25% of cycle time

  • Code review: 18%
  • Waiting for QA environments: 12%
  • Coordination across teams: 15%
  • Requirements clarification: 11%
  • Deployment approvals and processes: 10%
  • Everything else: 9%

So AI makes developers 30% faster at the 25% of work that’s actual coding. Do the math: 0.30 × 0.25 = 7.5% theoretical maximum improvement.

Sound familiar? That’s almost exactly the 8% organizational gain you’re seeing.

My contrarian take:

The team composition question is a distraction. The real transformation opportunity is in how we structure work, not who does it.

Instead of “should we hire 3 seniors per junior,” we should be asking:

  1. How do we reduce the code review bottleneck? Not by hiring more reviewers, but by improving code quality standards and potentially using AI for first-pass reviews.

  2. How do we eliminate the 12% waiting for environments? Platform engineering and better infrastructure-as-code can solve this.

  3. Why does requirements clarification take 11% of cycle time? Product-engineering alignment problem, not a team composition problem.

  4. Can we reduce coordination overhead? Team topology question—maybe we need smaller, more autonomous teams.

What I’m doing differently:

We’re not changing our hiring ratios at all. Instead, we’re:

  • Investing in platform engineering to reduce environment friction
  • Using AI for automated code quality checks (before human review)
  • Restructuring teams around value streams instead of technical layers
  • Creating clearer product requirement templates to reduce clarification cycles

Three months in, we’re seeing 19% delivery improvement—more than double the “AI makes coding faster” gains.

Bottom line:

Junior vs senior ratio is a team-level optimization. We need system-level thinking. Fix the process bottlenecks, and the team composition question becomes less critical.

That said, Luis, I completely agree with you on the talent pipeline concern. We should develop juniors not because AI can’t code, but because we need future leaders who understand systems deeply. That’s a different ROI calculation than coding productivity.

As someone on the ground actually using these tools every day, I want to share the IC perspective on this.

Michelle’s point about process bottlenecks is spot on, but there’s another angle:

The quality of AI-generated code varies wildly depending on the task. And I’m noticing my PRs are getting rejected more often even though I feel more productive.

What I’m experiencing personally:

I’ve been using Copilot + Cursor for about 8 months now. Here’s my honest assessment:

Where AI makes me genuinely faster:

  • Boilerplate API endpoints (70% faster, easily)
  • Test scaffolding (once I set up the first example, AI nails the rest)
  • Database migrations and schema changes
  • Updating similar code across multiple files
  • Documentation and code comments

Where AI slows me down or creates more work:

  • Architectural decisions (AI suggests patterns that don’t fit our system)
  • Complex state management (I spend more time debugging AI suggestions than writing from scratch)
  • Performance optimization (AI doesn’t understand our specific bottlenecks)
  • Security-sensitive code (I trust AI less here, so review takes longer)

The PR rejection problem is real:

Before AI: ~15% of my PRs got sent back for revisions
With AI: ~28% of my PRs get sent back

Why? I’m moving faster, but I’m also:

  • Not thinking as deeply about edge cases (AI doesn’t catch them)
  • Making more architectural mistakes (AI optimizes for code that works, not code that fits)
  • Introducing subtle bugs that only senior reviewers catch

My biggest concern: Am I getting worse at my job?

I’m 7 years into my career. I should be developing architectural intuition and deep system knowledge. But when AI writes most of my code, am I actually building those skills?

I notice this especially with our junior engineers. They’re productive day one because AI handles the syntax and basic patterns. But ask them to explain why we chose a particular approach, and they can’t articulate it. They never struggled through the design decision.

The “seniority” question is starting to feel existential:

If AI writes code and I orchestrate it, what does it mean to be a “senior” engineer?

  • Is it just knowing what to ask AI to do?
  • Is it reviewing AI output more critically?
  • Is it the architectural decisions AI can’t make?

I don’t have answers, but I do think Luis’s idea about “AI-free zones” for learning is important. Maybe we need that for mid-level engineers too, not just juniors.

One thing I’m trying:

I’ve started a personal rule: For any complex feature, I write the core logic myself first without AI, then use AI for the surrounding boilerplate.

It’s slower in the short term, but I feel like I’m still exercising the “engineering muscle” instead of letting it atrophy.

Curious if other ICs are feeling this same tension between productivity and skill development?

Data person chiming in here—I think we need to be really careful about the metrics we’re using to make these team composition decisions.

The measurement challenge nobody’s talking about:

Everyone’s quoting productivity percentages (30% faster coding, 90% faster tests, etc.), but when I dig into how these numbers are generated, I get concerned.

Questions I’d ask before making structural changes:

  1. What’s the baseline? Are we comparing AI-assisted developers to their own previous performance, or to other developers? The METR study compared same developers before/after, but much of the industry data is self-reported perception.

  2. What are we measuring? Lines of code? Time to PR? Time to merge? Time to production? Each metric tells a different story.

  3. What’s the observation period? Microsoft research shows it takes 11 weeks for developers to realize AI productivity gains. Are we measuring during the learning curve or after plateau?

  4. Are we accounting for rework? If PRs are 18% larger and have 24% more incidents, what’s the net productivity when we include fix time?

The specific metrics I’d want before changing team structure:

Instead of individual productivity, measure:

  • Lead time for changes (commit to production)
  • Deployment frequency
  • Change failure rate (broken deploys requiring hotfixes)
  • Mean time to recovery
  • Developer satisfaction (are people burning out on code review?)

These DORA metrics tell you about system health, not just individual velocity.

A suggestion: Run structured experiments

Rather than making company-wide team composition changes based on industry averages, what if we A/B tested different structures?

  • Team A: Traditional 1:1 junior:senior ratio with AI tools
  • Team B: 3:1 senior:junior ratio with AI tools
  • Team C: Traditional ratio without AI tools (control group)

Measure over 6 months across standardized work. Track both velocity metrics AND learning/development metrics (promotion readiness, technical decision quality, etc.).

My hypothesis:

I suspect we’ll find that optimal team composition depends heavily on:

  • Product domain complexity
  • Technical stack maturity
  • Quality/speed tradeoff tolerance
  • Organizational code review capacity

There might not be a universal answer. A fintech team with high compliance requirements might need different ratios than a consumer app optimizing for speed.

One concrete suggestion:

Before anyone restructures their teams, at minimum track:

  1. Time from commit to prod (full cycle)
  2. Defect rate in production
  3. PR rework cycles (how often do PRs get sent back)
  4. Junior developer promotion readiness (can they handle senior work in 2-3 years?)

If metrics 1-2 improve AND metric 4 stays stable, maybe the new ratio works. But if you’re sacrificing long-term capability building for short-term velocity, the data should show that too.

Alex’s observation about PR rejection rates doubling is a perfect example of a lagging indicator we should be measuring systematically.