Three months ago, we gave our junior engineers (IC1-IC3) access to Claude Code and GitHub Copilot. The results were impressive on paper: they were pushing code 45% faster than before. PRs per week jumped from an average of 3 to 5.5 per engineer.
Then we looked at the other side of the equation: senior engineer code review time increased 91%.
Our Staff+ engineers, who were previously spending about 8 hours/week on code review, are now spending 15+ hours/week. They’re drowning. Sprint planning is suffering because our tech leads don’t have time for architectural discussions. Mentorship has taken a back seat.
The Productivity Paradox
This aligns with recent research showing AI speeds up coding by 30% but can slow overall delivery. The bottleneck just shifted.
Here’s what we’re seeing in reviews:
1. Volume Without Context
Juniors are generating more code faster, but the code often lacks broader architectural context. It “works” but doesn’t fit the existing patterns. Reviewers have to explain not just “what’s wrong” but “why this approach doesn’t align with our system design.”
2. Subtle Architecture Violations
AI-generated code passes linters and tests but introduces subtle issues:
- Unnecessary coupling between modules
- Inconsistent error handling patterns
- Performance patterns that work in dev but fail at scale
- Security concerns that automated tools miss
Senior engineers catch these, but it takes cognitive effort to spot and explain.
3. Review Fatigue
When you’re reviewing 2-3 PRs per day instead of 1-2 per week per junior, fatigue sets in. The quality of reviews decreases. Approval rubber-stamping increases.
Our Adaptations
We’ve made several changes:
Ultra-Granular Commits
We now require engineers to commit after each AI-generated change before accepting the next suggestion. This creates checkpoints. If an AI suggestion introduces a bug, we can revert to the last known good state without losing an entire session of work.
Inspired by best practices for AI-assisted development, this approach treats AI suggestions as experimental branches, not production-ready code.
Design Review Before Code
No PR without a design review first. Even for small features, engineers must:
- Write a brief spec (problem, proposed solution, alternatives considered)
- Get async feedback from senior engineer or tech lead
- Only then start coding with AI
This frontloads the architectural thinking and reduces review churn.
Small PR Mandate
We’ve formalized what used to be a guideline: PRs must be reviewable in 15-30 minutes maximum.
If you can’t review it in one sitting, it’s too big. This forces decomposition and makes AI-generated code easier to review in digestible chunks.
The Central Question
Are senior engineers the bottleneck or the quality gate?
Bottleneck framing suggests we need to scale senior review capacity:
- Hire more senior engineers (expensive, slow)
- Automate parts of code review with AI (might miss what humans catch)
- Relax review standards (introduces tech debt)
Quality gate framing suggests the system is working correctly:
- Junior code volume increased but senior scrutiny prevents tech debt accumulation
- The “slowdown” is actually preventing future maintenance cost
- We shouldn’t optimize for velocity at the expense of quality
I’m leaning toward the quality gate perspective, but I’m curious what others think.
Data That Concerns Me
From our retrospectives:
- 38% of AI-generated PRs required substantial revision in code review (vs 15% for manually written code)
- Average review cycles increased from 1.2 to 2.4 (more back-and-forth)
- Senior engineer satisfaction dropped 12 points (out of 100) in our last engagement survey
We’re shipping faster but burning out our most experienced engineers in the process.
Questions for the Community
- Are others seeing this pattern? Or is this unique to how we’re using AI tools?
- Should we constrain AI output more (tighter guardrails) to reduce review burden?
- Is async code review the wrong model for AI-generated code? Should we do more pairing?
- How do you balance velocity and quality when AI shifts where bottlenecks appear?
I suspect we’re optimizing for the wrong metric (code written) when we should be optimizing for value delivered. But I’d love to hear how other teams are thinking about this.