So here’s something that’s been eating at me lately…
Our design systems team has been shipping at record pace. Velocity charts look amazing. Leadership is happy. But our incident rate? Up 30% over the last 6 months. And I couldn’t figure out why until I stumbled across this CodeRabbit study from December.
AI-generated pull requests contain 1.7× more issues than human-written code.
Not 10% more. Not 20% more. 70% more issues. That’s… significant.
The Data That Made Me Pause
The CodeRabbit report analyzed 470 real PRs from open-source projects:
- Human PRs: ~6.45 issues on average
- AI-assisted PRs: ~10.83 issues on average
But it gets more interesting when you break it down:
- Logic/correctness errors: 1.75× higher in AI code
- Security vulnerabilities: 1.57× higher
- Readability problems: 3× higher

That last one hits home. As someone who literally builds component libraries for a living, readability isn’t just “nice to have” - it’s the entire point. If your code is 3× harder to understand, you’re not shipping value faster. You’re just deferring the cost.
The Productivity Paradox
Here’s where it gets weird. METR’s developer productivity study found that developers using AI tools took 19% longer to complete issues. Not faster. Slower.
But wait - developers expected AI to speed them up by 24%. And even after experiencing the slowdown, they still believed it made them faster by 20%.
We’re literally experiencing the opposite of what we think is happening.
My Real Problem
I’ll be honest: I use AI coding tools. Copilot, Cursor, the whole stack. They’re genuinely helpful for boilerplate, for exploring new patterns, for that annoying CSS I can never remember.
But here’s what’s keeping me up at night - I have no idea which parts of our codebase were AI-generated.
When I review PRs, I can’t tell if I’m looking at:
- Code the author deeply understands
- Code copied from Stack Overflow (we’ve all been there)
- Code an AI hallucinated that just happens to work
And if I can’t tell the difference, how do I know what level of scrutiny to apply?
The Maintenance Time Bomb
The most alarming stat I found: 75% of AI coding agents break working code during long-term maintenance tasks. Even code that passes all tests initially.
Design systems are long-lived. Our component library has been around for 3 years. Some of these components will live for 5+ years. If we’re introducing AI-generated code that works today but breaks during normal maintenance in 6 months… that’s terrifying.
Should We Be Tracking This?
So here’s my question: Should we start tracking which PRs used AI assistance?
I know, I know - it sounds like surveillance. It sounds like we’re creating stigma around using helpful tools. That’s not what I want at all.
But there’s this new Git AI project that tracks AI-generated code at the line level. It’s transparent, it’s automatic, and it links every AI-written line to the agent and transcript that generated it.
The idea isn’t to punish anyone. It’s to:
- Know what we’re working with - like ingredients on food labels
- Learn from patterns - which AI-generated code causes issues vs works great
- Apply appropriate review - like we do for junior engineer code
- Measure the actual impact - not what we feel is happening
Questions I’m Wrestling With
- Is tracking AI code creating fear, or creating transparency?
- Should AI-generated code require a different review process?
- Who’s responsible when AI code breaks production 6 months later?
- Can we measure this without making people afraid to use AI tools?
I don’t have answers yet. But I think we need to start having this conversation before our incident rate climbs another 30%.
What’s everyone seeing on their teams? Are you tracking AI usage? How are you handling code review for AI-assisted PRs?
Would love to hear perspectives - especially from the eng leadership folks who’ve been thinking about this longer than my 3 weeks of anxiety reading papers. ![]()