Pull Requests with AI Code Have 1.7× More Issues—Should We Be Flagging AI-Generated PRs Differently?

I’ve been noticing something at our EdTech startup over the past few months—our PR review process has gotten… messier. More back-and-forth, more bugs caught in review, more “wait, this logic doesn’t handle edge cases” comments. At first, I thought we’d hired too fast or needed better onboarding. Then I saw the data.

The Numbers Don’t Lie

CodeRabbit just released their State of AI vs Human Code Generation report, analyzing 470 open-source GitHub pull requests. The finding that jumped out: AI-generated code creates 1.7× more issues compared to human-written code.

But it’s not just volume—it’s severity and type:

  • Logic and correctness issues: up 75% (business logic errors, misconfigurations, unsafe control flow)
  • Code quality and maintainability: 1.64× more frequent
  • Security vulnerabilities: 1.57× higher
  • Performance issues: 1.42× more common

And the kicker? AI-authored PRs contain 1.4× more critical issues and 1.7× more major issues on average.

The Productivity Paradox

Here’s what makes this complicated: while pull requests per author increased by 20% year-over-year (thank you, AI coding assistants), incidents per pull request increased by 23.5%. We’re shipping more code faster, but we’re also shipping more problems.

At my previous role at Slack, we obsessed over defect escape rates. If we’d seen this trend, alarms would be going off. But in 2026, with 85% of developers using AI coding tools, this is just… normal now?

So What Do We Actually Do About It?

This is where I need the community’s perspective. Should we be treating AI-generated PRs differently in our review process? Here are the options I’m considering:

Option 1: Add a Flag/Label for AI-Generated Code

Simple PR label: “:robot: AI-assisted” or similar. Makes reviewers aware, but doesn’t change the process.

Pros: Transparency, easy to implement, opt-in
Cons: Could create stigma, relies on honor system, might be ignored

Option 2: Require Extra Scrutiny on High-Risk Areas

Focus reviews on where AI struggles most: authentication, authorization, state management, security boundaries. Bright Security’s best practices recommend treating any AI code that touches identity, access, or state as high-risk by default.

Pros: Targeted, evidence-based, addresses actual risk
Cons: Requires defining “high-risk,” needs reviewer training

Option 3: Use AI Review Tools to Catch AI Mistakes

Anthropic just launched Claude Code Review specifically to address this. Fight fire with fire?

Pros: Scalable, catches patterns humans miss, reduces reviewer burden
Cons: False positives, cost, still need human oversight

Option 4: Keep Process the Same, Trust Human Reviewers

Maybe this is a temporary problem. Maybe reviewers will adapt and get better at catching AI-generated issues.

Pros: No process change, avoids complexity
Cons: Ignores the data, increases reviewer cognitive load

Where I’m Landing

I’m leaning toward a combination of Option 2 (focus on high-risk areas) with light implementation of Option 1 (optional flagging). Based on what I’ve read, AI consistently struggles with security boundaries—it optimizes for the happy path while attackers exploit the edge cases.

In our Q2 planning, I’m proposing:

  • Enhanced review checklist for authentication, authorization, and state transitions (regardless of AI usage)
  • Optional “AI-assisted” label for transparency
  • Team education on common AI code patterns and their failure modes
  • Experiment with AI review tools on a subset of repos

But I could be totally wrong!

What I’m Curious About

  • Have you noticed similar quality issues with AI-generated code on your teams?
  • What review practices have you changed (if any) to adapt to AI-assisted development?
  • Are you measuring this? What metrics tell you if AI is helping or hurting overall code quality?
  • Is this a permanent tradeoff (speed vs. quality) or a temporary growing pain as AI tools improve?

The data says we have 1.7× more issues to deal with. The question is: do we adapt our processes, or do we accept the tradeoff as the cost of 20% more throughput?

I’d love to hear how other engineering leaders are thinking about this.

This hits close to home—we just went through this exact debate at our financial services company.

I was seeing the same patterns you describe: more PRs, faster feature delivery, but our post-deployment bug rate was creeping up. The engineering team wanted to try flagging AI-generated PRs. I approved it, thinking transparency would help. It created friction instead.

What Didn’t Work: Mandatory Flagging

We implemented a required “AI-assisted” label on all PRs. Within two weeks, I started hearing feedback through my skip-levels: developers felt like they were being marked for extra scrutiny. Some of the more senior engineers stopped using AI tools publicly to avoid the “taint.” Others gamed the system—technically true that they “edited” the AI code, so was it really AI-generated?

The honor system was inconsistent. The label became meaningless because we couldn’t agree on what percentage of AI contribution required flagging. 30%? 50%? 80%?

We killed the experiment after a month.

What Actually Worked: Automated High-Risk Checks

Instead of flagging based on origin (AI vs human), we refocused on risk areas. This aligned with our existing security and compliance frameworks (we’re in fintech—regulation doesn’t care if AI wrote the code).

Here’s what we implemented:

1. Automated Detection for Critical Paths
We use automated static analysis to flag PRs that touch:

  • Authentication and session management
  • Authorization and permission checks
  • State transitions (especially money movement)
  • Input validation and sanitization
  • Database access patterns

These PRs get mandatory security architecture review, regardless of who or what wrote them.

2. Advisory AI Review, Not Blocking
We integrated an AI code review tool (similar to what Keisha mentioned with Claude Code Review), but it’s advisory only. It posts comments but never sets a required check status.

Why advisory? Because developers are really sensitive to false positives. If the tool blocks merges on style issues or subjective “code smells,” people will just override it or route around it. We deliberately do less to maintain trust.

3. Focus on Business Logic, Ignore Generated Files
We filter out lock files, generated assets, and anything in /dist or /build. We also skip files with 500+ lines of changes—those need human review anyway.

The Example That Validated This Approach

Three weeks after we rolled out the high-risk checks, the automated scanner caught a session handling bug in a “minor refactor” PR. The code looked clean, passed tests, and was authored by one of our senior engineers (who used an AI coding assistant).

The issue: the refactor changed how we validated session tokens. The new logic worked for the happy path but failed to invalidate tokens on logout in certain edge cases. In production, this would’ve meant logged-out users could still access their accounts under specific conditions.

Our previous flagging system wouldn’t have caught this—it would’ve been buried in “AI code needs extra review” alert fatigue. The high-risk automation caught it because it touched authentication state.

My Recommendation

Keisha, I think your Option 2 is right, but I’d challenge the “optional flagging” part of Option 1. In my experience, optional flags create ambiguity and inconsistency.

Instead:

  • Skip the flag entirely. Focus review intensity on risk, not authorship.
  • Automate the high-risk detection. Don’t rely on reviewers to remember the checklist.
  • Make AI review tools advisory. Blocking creates backlash.
  • Educate on failure patterns. Host a “common AI code mistakes” lunch-and-learn.

The goal isn’t to stigmatize AI—it’s to adapt review rigor to match actual risk. In fintech, we learned that compliance frameworks don’t distinguish between AI and human code. The regulatory expectation is the same: did you validate it works correctly and securely?

That’s the standard we should hold ourselves to.

This resonates so much with what I’m seeing in design systems work. The volume problem is real, and it’s not just code—it’s components, patterns, everything.

The AI Component Generation Problem

Last month, I asked Claude to help me generate a new form input component for our design system. It gave me beautiful code: proper TypeScript, clean props interface, even included basic validation logic. I was thrilled—until I started testing edge cases.

The component worked perfectly for the happy path: single input field, standard validation, one error state. But when I tested:

  • Multiple simultaneous error states
  • Dynamic field validation rules
  • Keyboard navigation flow
  • Screen reader announcements

…it fell apart. The state management was optimized for “user types valid input” but completely broke on “user types, deletes, retypes, switches fields mid-validation.”

AI optimizes for linear flows. Real users are chaotic.

The Accessibility Blind Spot

Here’s what really worried me: the AI-generated code looked accessible. It had aria-label attributes, it used semantic HTML, it even had focus styles. But when I ran it through actual WCAG validation:

  • :cross_mark: Error announcements weren’t properly associated with inputs
  • :cross_mark: Focus order didn’t match visual order in certain states
  • :cross_mark: Loading states had no announcement for screen readers
  • :cross_mark: Error recovery patterns were missing

The code passed our automated linters. It would’ve passed PR review if I hadn’t specifically tested with assistive technology. And I only caught it because I was being paranoid about the AI-generated code.

How many teams are shipping “accessible-looking” code that fails real users?

Are We Measuring the Right Things?

This connects to something Keisha mentioned about the productivity paradox. We’re celebrating 60% more PRs merged, but are we tracking:

  • Time spent fixing bugs in AI-generated code vs. time saved writing it?
  • Technical debt accumulation from “it works but isn’t maintainable” code?
  • The compound cost of rushing code that needs to be rewritten in 6 months?

Luis’s point about fintech regulation is spot-on. In design, we have WCAG and ADA compliance. The law doesn’t care if AI wrote the code—if it fails accessibility standards, we’re liable.

My Take on Review Process

I’m planning to pitch this to my team for our Q2 design system work:

1. Focus reviews on logic, not style
AI-generated code will look clean. The problems are in edge case handling and state management. Train reviewers to specifically test non-linear user flows.

2. Require real-device testing for AI-generated components
Not just “does it render,” but “does it work with keyboard navigation, screen readers, and high contrast modes.”

3. Track maintenance burden, not just velocity
Measure: how many follow-up PRs to fix issues in AI-generated components vs. human-written ones?

The Question I’m Wrestling With

Keisha asked if this is a permanent tradeoff (speed vs. quality) or temporary growing pains.

I think it depends on whether we adapt our definition of “done.”

If we keep measuring success as “code merged,” then yes—permanent tradeoff.
If we start measuring “code that doesn’t need immediate follow-up fixes,” then maybe we’re just in a learning phase.

But honestly? I’m skeptical that AI will magically get better at edge cases. Edge cases are edge cases because they’re contextual, user-specific, and hard to predict. AI models are trained on common patterns. By definition, they’ll miss the uncommon ones.

Maybe the answer isn’t “better AI” but “better humans using AI.”

What if we used AI to generate multiple component variations, then human designers/engineers pick the best approach instead of accepting the first output? That matches how I actually work with AI now—treat it like a brainstorming partner, not an oracle.

Reading this thread from the product side, I’m trying to understand the actual business cost of this tradeoff. The 1.7× issue rate sounds concerning, but is it worth slowing down development?

The Numbers I Care About As VP Product

Keisha’s data shows:

  • :white_check_mark: 20% more PRs per engineer (good for product velocity)
  • :cross_mark: 23.5% more incidents per PR (bad for customer experience)

But here’s what I need to know to make informed decisions:

1. What’s the net impact on time-to-market?
If AI lets us ship features 20% faster but we spend 25% more time fixing bugs, we’re actually slower. Has anyone measured the full cycle time from “feature spec” to “stable in production”?

2. What’s the customer-facing cost?
Luis mentioned catching a session handling bug before production. That’s a win for the review process. But how many AI-generated bugs are escaping to customers? Are we trading development speed for customer trust?

3. Is technical debt compounding faster?
Maya’s point about “looks clean but isn’t maintainable” worries me. In 6 months, does that 20% velocity gain turn into a 40% slowdown because we’re drowning in refactoring?

The Framework I’m Using

At my previous role at Airbnb, we had a saying: “Speed is a feature, but so is reliability.”

I think about AI-assisted development in the same way we think about any build vs. buy decision. There’s a tradeoff curve:

  • Early-stage / MVP features: Optimize for speed. Ship AI-generated code fast, validate with users, iterate.
  • Core product flows: Balanced approach. Use AI but invest in thorough review and testing.
  • Critical infrastructure: Optimize for quality. AI might be more risk than it’s worth.

Where I’m struggling: how do I help my PMs make this call for their features?

The Measurement Problem

Luis mentioned tracking “review cycles per PR.” That’s smart. I’d also want to track:

  • Defect escape rate by feature type (does AI impact critical flows more?)
  • Time-to-resolution for AI vs. human bugs (are AI bugs harder to fix?)
  • Customer-reported incident rate trends (are users noticing more issues?)
  • NPS correlation with release quality (does rushing code hurt satisfaction?)

Right now, I don’t have this data. We’re celebrating faster shipping without measuring whether we’re creating long-term problems.

What I’m Taking Back to My Team

This discussion is making me realize: we need to redefine “done” before we can assess if AI is helping.

If “done” = “PR merged,” then yes, AI is a productivity boost.
If “done” = “feature is stable and maintainable,” then… we don’t actually know yet.

I’m going to propose we start tracking:

  1. Feature stability metric: Days until first customer-reported bug
  2. Maintenance cost metric: Follow-up PRs within 30 days of initial merge
  3. Review efficiency metric: Ratio of review time to initial coding time

My hypothesis: AI will look great on metric #1 initially (fast shipping), neutral on #2 (more bugs but smaller fixes?), and concerning on #3 (reviews take longer because reviewers have to verify AI logic).

The Big Question

Is this a temporary adjustment period (AI tools improve, teams learn best practices), or a permanent tradeoff (AI fundamentally optimizes for speed at the expense of edge-case quality)?

Maya’s skepticism about edge cases feels right to me. Edge cases are called “edge” cases because they’re rare, contextual, and domain-specific. If AI is trained on common patterns, it will by definition miss the uncommon ones.

But maybe that’s okay? Maybe the answer is:

  • Use AI aggressively for common cases (90% of code)
  • Use human expertise for edge cases and critical paths (10% of code)
  • Build processes to identify which is which

Luis’s “high-risk area” approach feels like this philosophy in action.

I’d love to hear: what metrics are other product/engineering teams using to measure whether AI is actually improving overall delivery, not just initial coding speed?

This is an excellent thread, and I want to zoom out to the strategic level. The 1.7× issue rate isn’t just a code review problem—it’s a systems architecture problem.

The Real Issue: AI Doesn’t Understand Non-Functional Requirements

As CTO, I think about code quality across multiple dimensions:

  • Functional: Does it do what the spec says? (AI is decent here)
  • Security: Can it be exploited? (AI struggles—1.57× more vulnerabilities)
  • Performance: Will it scale? (AI often misses this—1.42× more issues)
  • Maintainability: Can the next engineer understand and modify it? (AI frequently fails)
  • Operational: Can we debug it in production? (AI-generated code often lacks observability)

AI coding assistants are trained to satisfy functional requirements. They optimize for “make the test pass.” But software quality depends on satisfying all these dimensions simultaneously.

This is why review processes need to fundamentally change.

A Real Example from Our Cloud Migration

Six months ago, my team was migrating our monolith to microservices. One of our senior engineers used an AI coding assistant to generate service communication code. The code worked perfectly in dev. Tests passed. Review looked fine.

We deployed to staging, and immediately:

  • :cross_mark: No structured logging—impossible to trace requests across services
  • :cross_mark: No circuit breakers—cascading failures took down multiple services
  • :cross_mark: No retry logic with backoff—thundering herd on transient failures
  • :cross_mark: No timeout configuration—hung connections exhausted connection pools

The code was functionally correct. It just created an operational nightmare.

The AI had optimized for “service A calls service B and gets a response.” It completely missed the distributed systems considerations that any experienced engineer would’ve included: observability, fault tolerance, graceful degradation.

This wasn’t a code review failure. This was an architectural gap.

My Approach: Guardrails Before Generation

Luis’s focus on high-risk areas is right, but I think we need to go further. Don’t just review AI-generated code differently—constrain what AI can generate in the first place.

Here’s what we implemented:

1. Define “AI-Safe” vs. “Human-Required” Boundaries

  • :white_check_mark: AI can generate: utility functions, data transformations, standard CRUD operations, test scaffolding
  • :no_entry: AI should not generate (without senior review): authentication flows, payment processing, service communication, data migration scripts, infrastructure-as-code

2. Architectural Templates with Built-In Best Practices
If engineers are going to use AI to generate service communication code, give them a template that includes:

  • Structured logging hooks
  • Circuit breaker patterns
  • Retry logic with exponential backoff
  • Timeout configuration
  • Health check endpoints

Let AI fill in the business logic within the template, not generate the whole architecture.

3. Pre-Merge Automated Checks for Non-Functional Requirements
Our CI/CD now fails PRs that lack:

  • Structured logging in API routes
  • Error handling in database operations
  • Timeout configuration in external service calls
  • Security headers in HTTP responses

These aren’t optional. If AI generates code without them, the build fails.

The AI Review Tool Approach

David asked about using AI review tools (like Claude Code Review). We’re experimenting with this, and my take is: use them, but with human oversight on critical paths.

Where AI review tools excel:

  • Catching common security patterns (SQL injection, XSS, insecure deserialization)
  • Identifying performance anti-patterns (N+1 queries, unnecessary loops)
  • Enforcing code style and consistency

Where they fail:

  • Understanding business context and domain-specific requirements
  • Evaluating architectural tradeoffs (when is complexity justified?)
  • Assessing operational implications (can we debug this in production?)

The key insight from Luis: make them advisory, not blocking. False positives destroy trust.

What We’re Learning

Maya’s question about “temporary vs. permanent tradeoff” is the right one. After six months of experimentation, here’s what I believe:

It’s a permanent tradeoff, but the tradeoff curve will shift.

AI will get better at generating functionally correct code. But it will always struggle with:

  • Edge cases (by definition, rare in training data)
  • Context-specific requirements (your company’s specific operational needs)
  • Architectural judgment (when to optimize for simplicity vs. flexibility)

The winners will be teams that:

  1. Use AI strategically (right tool for right task, based on risk and complexity)
  2. Build strong architectural guardrails (templates, automated checks, clear boundaries)
  3. Invest in senior engineering capacity (junior engineers can’t effectively review AI code)

My Recommendation to Keisha

Your Option 2 is directionally correct, but I’d frame it differently:

Don’t think of this as “reviewing AI code differently.” Think of it as “ensuring all code—human or AI—meets our quality bar across all dimensions.”

Practically:

  • Define your non-negotiable requirements (security, observability, fault tolerance)
  • Encode them in automated checks where possible
  • Train reviewers to focus on what automation can’t catch (business logic, edge cases, maintainability)
  • Use AI review tools as a safety net, not a replacement for human judgment

The goal is not to slow down development. It’s to ensure that speed doesn’t come at the cost of system reliability, security, and maintainability.

Because in my experience, the technical debt from “fast but flawed” code compounds quickly. You get 20% more features in Q1, then spend 40% more time in Q2-Q4 fixing and refactoring.

We’re not replacing human judgment. We’re augmenting it. And that requires different processes, not just different tooling.