Pull Requests With AI Code Have 1.7× More Issues: Should We Be Tracking This Separately?

So here’s something that’s been eating at me lately…

Our design systems team has been shipping at record pace. Velocity charts look amazing. Leadership is happy. But our incident rate? Up 30% over the last 6 months. And I couldn’t figure out why until I stumbled across this CodeRabbit study from December.

AI-generated pull requests contain 1.7× more issues than human-written code.

Not 10% more. Not 20% more. 70% more issues. That’s… significant.

The Data That Made Me Pause

The CodeRabbit report analyzed 470 real PRs from open-source projects:

  • Human PRs: ~6.45 issues on average
  • AI-assisted PRs: ~10.83 issues on average

But it gets more interesting when you break it down:

  • Logic/correctness errors: 1.75× higher in AI code
  • Security vulnerabilities: 1.57× higher
  • Readability problems: 3× higher :melting_face:

That last one hits home. As someone who literally builds component libraries for a living, readability isn’t just “nice to have” - it’s the entire point. If your code is 3× harder to understand, you’re not shipping value faster. You’re just deferring the cost.

The Productivity Paradox

Here’s where it gets weird. METR’s developer productivity study found that developers using AI tools took 19% longer to complete issues. Not faster. Slower.

But wait - developers expected AI to speed them up by 24%. And even after experiencing the slowdown, they still believed it made them faster by 20%.

We’re literally experiencing the opposite of what we think is happening.

My Real Problem

I’ll be honest: I use AI coding tools. Copilot, Cursor, the whole stack. They’re genuinely helpful for boilerplate, for exploring new patterns, for that annoying CSS I can never remember.

But here’s what’s keeping me up at night - I have no idea which parts of our codebase were AI-generated.

When I review PRs, I can’t tell if I’m looking at:

  • Code the author deeply understands
  • Code copied from Stack Overflow (we’ve all been there)
  • Code an AI hallucinated that just happens to work

And if I can’t tell the difference, how do I know what level of scrutiny to apply?

The Maintenance Time Bomb

The most alarming stat I found: 75% of AI coding agents break working code during long-term maintenance tasks. Even code that passes all tests initially.

Design systems are long-lived. Our component library has been around for 3 years. Some of these components will live for 5+ years. If we’re introducing AI-generated code that works today but breaks during normal maintenance in 6 months… that’s terrifying.

Should We Be Tracking This?

So here’s my question: Should we start tracking which PRs used AI assistance?

I know, I know - it sounds like surveillance. It sounds like we’re creating stigma around using helpful tools. That’s not what I want at all.

But there’s this new Git AI project that tracks AI-generated code at the line level. It’s transparent, it’s automatic, and it links every AI-written line to the agent and transcript that generated it.

The idea isn’t to punish anyone. It’s to:

  1. Know what we’re working with - like ingredients on food labels
  2. Learn from patterns - which AI-generated code causes issues vs works great
  3. Apply appropriate review - like we do for junior engineer code
  4. Measure the actual impact - not what we feel is happening

Questions I’m Wrestling With

  • Is tracking AI code creating fear, or creating transparency?
  • Should AI-generated code require a different review process?
  • Who’s responsible when AI code breaks production 6 months later?
  • Can we measure this without making people afraid to use AI tools?

I don’t have answers yet. But I think we need to start having this conversation before our incident rate climbs another 30%.

What’s everyone seeing on their teams? Are you tracking AI usage? How are you handling code review for AI-assisted PRs?

Would love to hear perspectives - especially from the eng leadership folks who’ve been thinking about this longer than my 3 weeks of anxiety reading papers. :sweat_smile:

Maya, this hits close to home. We’ve been seeing exactly these patterns on our engineering teams.

The Data We’re Already Tracking

Over the past 8 months, I started noticing something in our sprint retrospectives - our junior engineers were completing tickets faster than ever, but PR reviews were taking twice as long. And the back-and-forth in code review comments? Way up.

So I did what any data-driven engineering leader would do - I started tracking incident sources. Nothing formal, just tagging tickets: “root cause: logic error”, “root cause: edge case”, “root cause: misunderstood requirements”.

The correlation you’re describing? It’s real. Our incidents with root causes in “logic errors” and “edge cases not handled” have increased 40% year-over-year. And when I dig into those, they disproportionately come from PRs that shipped quickly.

Now, I can’t definitively say “this was AI-generated code” because we’re not tracking that. But the pattern matches what you and these studies are describing.

The Cultural Landmine

Here’s my concern with tracking AI usage: We absolutely cannot create an environment where engineers are afraid to use tools that help them.

I mentor a lot of Latino engineers, many first-generation college grads who are already dealing with imposter syndrome. The last thing I want is for someone to feel like using Copilot puts a scarlet letter on their PR.

But at the same time… that 1.7× issue rate is real. The 75% maintenance failure rate is terrifying for our financial systems that need to run reliably for years.

What I Think We Should Actually Track

Instead of tracking “did you use AI?”, what if we tracked review quality metrics?

Things like:

  • Number of review rounds before merge
  • Post-merge bugs within 30 days
  • Time spent in code review
  • Reviewer confidence ratings

This way, we’re measuring outcomes, not tools. If a PR has 10 rounds of review and 3 bugs in the first month, it doesn’t matter if AI wrote it or a human did - something in our process failed.

And here’s the thing - these metrics would help us identify patterns. Maybe we’d notice that PRs touching authentication always need extra scrutiny. Maybe we’d see that certain types of changes consistently have issues regardless of who wrote them.

The Question I Keep Coming Back To

How do we get the benefits of AI coding assistance without the quality cost?

Right now it feels like we’re in this weird place where:

  1. Developers use AI to go faster
  2. It feels faster but is actually slower (per METR)
  3. The code has more issues
  4. Review takes longer
  5. We fix more bugs post-merge

So where’s the actual productivity gain?

I’m not anti-AI. I use these tools myself. But I think we need to be a lot more intentional about how we use them.

Maybe that means:

  • Pair AI-generated code with pair programming
  • Require extra review for AI-heavy PRs (if we can identify them)
  • Build better internal guidelines on when AI is helpful vs risky
  • Invest in training on reviewing AI-generated code

What I don’t want is to measure individual engineers on whether they used AI. That’s the wrong lens entirely.

Question for the group: Has anyone found a way to improve code quality while still letting teams use AI tools? What’s working?

This discussion couldn’t be more timely. Let me add the strategic perspective from someone who’s been managing these risks at the CTO level.

This Is About Risk Management, Not AI Backlash

First, let’s be clear: I’m not anti-AI. Our company uses AI tools extensively. But that 75% maintenance failure rate Maya referenced? That’s not a productivity tool - that’s a ticking time bomb in our codebase.

Think about what that means:

  • You ship code today that works perfectly
  • Passes all tests
  • Gets approved in code review
  • Goes to production
  • Then 6 months later during routine maintenance, it breaks

That’s not technical debt you can see and plan for. That’s hidden risk you don’t even know you’re carrying.

We’ve Been Here Before

This reminds me of other major tech transitions we’ve navigated:

Automated Testing (early 2000s): Teams started writing tests, felt productive, but many tests were brittle or tested the wrong things. We had to learn what “good” tests looked like.

CI/CD Adoption (2010s): Teams could deploy fast, but without proper observability and rollback strategies, they just broke production faster.

Microservices Migration: Moved fast, created distributed systems we didn’t fully understand, spent years fixing the mess.

The pattern? New tools enable new speed, but we initially lack the practices to use them safely.

My Position: Yes, We Should Track AI Code

But not for the reasons you might think.

We’re not tracking it to punish people. We’re tracking it to learn how to use these tools effectively.

When we adopted automated testing, we tracked test coverage. Not to shame teams with low coverage, but to understand what “enough” testing looks like.

When we moved to microservices, we tracked service dependencies. Not to prevent services, but to understand system complexity.

AI code should be no different.

The Git AI Project Is Promising

I’ve been following the Git AI project with interest. What I like:

  1. Transparent attribution - It’s not guessing, it knows exactly what AI wrote
  2. Line-level granularity - You can see which specific lines came from AI
  3. Linked to transcripts - You can see WHY the AI generated that code
  4. Built into workflow - It’s automatic, not reliant on self-reporting

This gives us data to answer questions like:

  • Which AI models produce more maintainable code?
  • Which types of code (tests vs features vs infrastructure) are safe to AI-generate?
  • Which prompting strategies lead to better outcomes?

Treat It Like Junior Engineer Code

Here’s my framework: AI-generated code should get the same scrutiny we give to code from junior engineers.

That’s not an insult to AI or to junior engineers. It’s recognition that:

  • The code might work but miss edge cases
  • The author might not fully understand all implications
  • Extra review eyes catch issues before production
  • It’s a learning opportunity, not a quality gate

Luis raised concerns about creating fear - I share that. The solution is to make the purpose of tracking very clear:

We’re tracking AI code to improve our development process, not to evaluate individual performance.

Make that a firm commitment. Engineering owns the data. It’s aggregated, not individual. It’s used for process improvement, not performance reviews.

Call for Industry Standards

Longer term, I think we need industry-wide standards on AI code disclosure.

Just like we have:

  • Security vulnerability disclosure policies
  • Open source license attribution requirements
  • Accessibility compliance standards

We should have clear practices around AI-generated code in production systems.

Some enterprise customers are already asking us: “How much of your codebase is AI-generated?” We should have a good answer.

The Real Question

Here’s what keeps me up at night: Are we optimizing for feeling productive, or for actually delivering value?

The METR study showing 19% slower completion while developers feel 20% faster? That’s a massive perception-reality gap.

We need data. We need measurement. We need to understand what we’re actually building.

Track the AI code. Learn from it. Build better practices. That’s how we turn a risky tool into a valuable one.

Adding the organizational effectiveness lens here - because this isn’t just about code quality, it’s about how our entire engineering organization functions.

The Velocity vs Quality Trade-off We’re All Experiencing

Michelle’s data about perception vs reality really resonates. At our EdTech startup, we’re scaling fast (25 to 80+ engineers in 18 months), and I’ve been watching our metrics closely.

What we’re seeing:

  • Change failure rate: Up 30% year-over-year
  • Mean time to recovery: Up 18%
  • PR merge time: Up 35% (despite faster initial coding)
  • Developer satisfaction with code review: Down 22%

That last one is what worries me most. When I talk to our engineers, they’re frustrated. Reviewers are spending more time and feeling less confident. Authors are getting more rounds of feedback than ever.

Something is systemically changing, and the timeline correlates exactly with widespread AI tool adoption.

The Hidden Cost: Reviewer Burden

Luis mentioned this, but I want to emphasize it - the cost isn’t just in the code, it’s in the review process.

Our senior engineers are telling me:

  • “I can’t tell if the author understands this code”
  • “The code looks fine but something feels off”
  • “I spent 45 minutes reviewing a 50-line PR because I had to mentally test every edge case”

That cognitive load is real. And it’s burning out our best reviewers.

One of my principal engineers put it perfectly: “It’s like reviewing code written by someone who’s really confident but might be completely wrong about fundamental assumptions.”

The Equity Question Nobody’s Asking

Here’s where I get concerned about fairness and inclusion:

If we don’t track AI usage, who bears the cost?

Right now, it’s our senior engineers and tech leads who are:

  1. Spending more time in code review
  2. Catching issues that make it to production
  3. Maintaining code they didn’t write and the author might not fully understand

That’s not equitable. We’re essentially asking our most experienced people to silently absorb the quality cost of AI-generated code.

And here’s the thing - those senior engineers are already underrepresented groups in tech. I’m acutely aware that when we add invisible burden to the most experienced people, we’re often adding it to the people who’ve already fought hardest to get there.

Who’s Responsible When AI Code Breaks Production?

This is the question that keeps coming up in our leadership team:

Scenario: An engineer uses Copilot to generate a complex authentication function. They review it, it looks good, tests pass. They submit it for PR review. Senior engineer approves (after spending 30 minutes reviewing a 40-line function).

It ships.

Three months later, there’s a security incident. Turns out the AI-generated code had a subtle vulnerability that we all missed.

Who’s responsible?

  • The engineer who submitted it?
  • The reviewer who approved it?
  • The company for allowing AI tools?
  • The AI tool vendor?

I don’t have a good answer. But I know we need better practices before we face this scenario for real.

My Proposal: Track + Train

I agree with Michelle - we should track AI code usage. But I want to pair it with something equally important: Training our teams to work effectively with AI tools.

This includes:

  1. For authors: How to prompt AI effectively, what to review carefully, when NOT to use AI
  2. For reviewers: What patterns to watch for in AI code, how to assess understanding
  3. For everyone: Shared responsibility for code quality regardless of authorship

Make it clear: Using AI isn’t the problem. Using AI without understanding what you’re shipping is.

The Transparency Framework

Instead of tracking that feels like surveillance, I’m proposing transparency:

Voluntary disclosure in PR descriptions:
“AI assistance used for: [specific parts]”

Start with making it optional. Make it culturally safe. Share insights monthly about what we’re learning.

If we find that AI-assisted PRs genuinely have more issues, we adjust our review process - more eyes, more time, specific checklists.

But we need the data first.

The Question I Need Answered

How do we maintain our quality bar while empowering engineers with AI tools?

Because right now, I’m watching:

  • Quality go down
  • Review time go up
  • Developer satisfaction drop
  • Senior engineer burnout increase

That’s not sustainable. And it’s definitely not the future of work I want to build.

What practices are other orgs putting in place? Who’s figured this out?

Coming at this from the product side - and honestly, this whole discussion is making me rethink some assumptions I had about engineering velocity.

The Customer-Facing Implications

Here’s what I’m seeing from my seat:

Our support team has been flagging what they call “weird bugs” - issues that don’t make intuitive sense, edge cases that seem like someone didn’t think through the user journey.

Example from last month: A form validation that worked perfectly in testing but broke when users tried to paste content from Google Docs (because of special characters). The fix was a 2-line change, but it took us 3 days to even figure out what was happening.

When I asked the engineer about it, they were genuinely surprised. They’d tested the happy path, tests passed, looked good. But they hadn’t considered that specific edge case.

Now I’m wondering: Was that AI-generated validation code that looked correct but missed context a human would have caught?

The Velocity Illusion

This thread is making me question our sprint velocity celebrations.

We’ve been hitting sprint commitments 3 sprints in a row. Engineering capacity looks great on paper. Product is shipping features faster than ever.

But:

  • Our bug backlog is 40% larger than 6 months ago
  • Customer support tickets about defects are up 28%
  • We’re spending 2 sprint cycles after every release doing “stabilization”

Are we shipping faster, or just deferring the work?

That METR study showing 19% slower actual completion despite feeling faster? That perfectly describes what I think we’re experiencing at the product level.

The Quality Cost I Didn’t Understand

When engineering tells me “this feature will take 2 sprints,” I don’t usually ask about code quality trade-offs. I assume professional engineers are building it right.

But if AI code has:

  • 3× more readability issues
  • 1.7× more bugs
  • 75% failure rate in future maintenance

Then “2 sprints to ship” doesn’t include the real cost:

  • Extra review time (that Keisha mentioned - 35% longer)
  • Post-release bug fixes
  • Future maintenance failures
  • Support burden from weird edge cases

The TCO is way higher than the initial velocity suggests.

The Question I’m Asking Engineering

When should we explicitly NOT use AI coding tools?

Because from a product perspective, there are features where speed matters more than perfection (experimental features, internal tools, time-sensitive competitive launches).

And there are features where quality is non-negotiable (payment processing, user data handling, core product functionality).

If AI code carries 1.7× more issues, shouldn’t we have a conversation about when that trade-off makes sense vs when it doesn’t?

What I Need From Engineering Leadership

Honestly, I need guidance on:

  1. Realistic timelines: If using AI makes initial coding faster but review and fixing slower, what are real completion estimates?

  2. Quality indicators: How do I know if a feature is “AI-heavy” so I can plan for potential stabilization work?

  3. Risk assessment: Which types of features should avoid AI assistance entirely?

  4. Customer communication: If a customer asks “is this AI-generated?”, what’s our answer?

The Business Impact Perspective

Here’s what worries me from a go-to-market angle:

Our enterprise customers are starting to ask about AI usage in development. Not because they’re anti-AI, but because they want to understand risk.

If we can’t answer:

  • “How much of your code is AI-generated?”
  • “What’s your review process for AI code?”
  • “How do you ensure AI code quality?”

We’re going to lose deals. Especially in regulated industries where our prospects need to satisfy compliance requirements.

Tracking AI code isn’t just an engineering process question - it’s a business capability we need.

Support For Transparent Tracking

I’m 100% on board with what Michelle and Keisha proposed.

Track AI usage. Not to punish anyone, but to:

  • Understand true development costs
  • Make informed product decisions
  • Answer customer questions honestly
  • Improve our overall process

From product perspective, I’d even be happy to help prioritize features where we pilot different approaches - some AI-heavy, some AI-light - to learn what works best.

The Real Question

Is the speed boost worth the quality cost?

Because right now, I don’t actually know if we’re shipping faster. We might just be shifting work from development to debugging.

And if that’s the case, I’d rather have realistic timelines and higher quality than fake velocity and bug backlogs.

What data would help product and engineering make these trade-offs more explicitly?