AI Writes Fast, Humans Fix Slow: The 1.7× Issue Rate No One's Tracking

Two weeks ago, I shipped a seemingly simple PR: a modal component for our design system. Built it in an afternoon using Cursor. Felt like magic. :sparkles:

Then QA tested it.

Twelve accessibility violations. Twelve.

It took three days to fix what took three hours to generate.

And here’s the thing - the AI didn’t write bad code. It wrote code that worked perfectly… if you only use a mouse. If you only have perfect vision. If you’ve never heard of WCAG compliance.

The code compiled. The tests passed. The demo looked beautiful.

It just wasn’t… right.

The Quality Gap in AI-Generated Code

There’s a stat that’s been haunting me since I saw it last week:

Pull requests with AI-generated code have 1.7× more issues than human-written code (source).

Not “slightly more” issues. Not “a few edge cases.” 70% more problems.

And we’re not tracking it separately. We’re lumping AI code and human code into the same metrics, wondering why our overall quality is declining.

What AI Doesn’t Understand: The “Why” Behind the Code

From a design systems perspective, I’m seeing a pattern:

AI optimizes for immediate functionality, not long-term maintainability.

Examples from our codebase:

Design Tokens

  • Human approach: Uses spacing-md, color-primary, font-size-body
  • AI approach: Hardcodes 16px, #2563eb, 14px everywhere
  • Impact: Works perfectly until we rebrand or need dark mode

Component Composition

  • Human approach: Builds modular components that compose predictably
  • AI approach: Creates monolithic components that “do everything”
  • Impact: Works for the immediate use case, nightmare to maintain

Accessibility

  • Human approach: Keyboard navigation, ARIA labels, focus management from the start
  • AI approach: Mouse-first, visual-first, “we’ll add accessibility later”
  • Impact: Later never comes, or costs 3× to retrofit

The 41% Bug Increase Nobody Talks About

@eng_director_luis mentioned this stat in the other thread: projects with high AI code usage saw a 41% increase in bugs (source).

In our design system, I can trace where these bugs come from:

1. Context Blindness

  • AI doesn’t know our component hierarchy
  • AI doesn’t understand our design token system
  • AI doesn’t see the dependencies between components
  • AI generates code that works in isolation but breaks in integration

2. Pattern Inconsistency

  • AI learns from Stack Overflow, not our style guide
  • AI mixes different state management patterns in the same component
  • AI uses whatever works, not what’s maintainable

3. Edge Case Ignorance

  • AI handles the happy path beautifully
  • AI forgets error states, loading states, empty states
  • AI doesn’t think about mobile, tablets, screen readers, slow networks

The Review Burden is Real

Here’s the uncomfortable truth: AI-generated code creates more work for reviewers, not less.

When I review human-written code:

  • 20 minutes to understand the approach
  • 10 minutes to check for issues
  • 5 minutes to suggest improvements
  • ~35 minutes total

When I review AI-generated code:

  • 5 minutes to understand (it’s usually straightforward)
  • 40 minutes to check for ALL the things AI might have missed:
    • Accessibility
    • Design token usage
    • Component composition
    • Edge cases
    • Mobile responsiveness
    • Performance implications
    • Maintainability
  • 15 minutes to document what needs to be fixed
  • ~60 minutes total

Our AI-assisted PRs take 2.3× longer to review than human-written PRs.

Where’s the productivity gain?

My Controversial Proposal: Separate Review Queues

What if we treated AI-heavy PRs as a different category?

Not to shame them. To acknowledge they need different review patterns.

Standard PR review checklist:

  • :white_check_mark: Functional correctness
  • :white_check_mark: Test coverage
  • :white_check_mark: Follows architectural patterns
  • :white_check_mark: Code quality

AI-assisted PR review checklist (additional checks):

  • :white_check_mark: Uses design tokens (not hardcoded values)
  • :white_check_mark: Accessibility compliance (keyboard nav, ARIA, focus)
  • :white_check_mark: Edge cases (error states, loading, empty)
  • :white_check_mark: Mobile/responsive (not just desktop)
  • :white_check_mark: Integrates with existing patterns (doesn’t reinvent)
  • :white_check_mark: Performance (AI loves nested loops and unnecessary re-renders)
  • :white_check_mark: Maintainability (will someone understand this in 6 months?)

Different tools. Different SLAs. Different expectations.

The Question: Should We Track AI Code Quality Separately?

@cto_michelle proposed tracking AI code as a distinct quality category. I think we need to go further:

Separate metrics for:

  • AI code review time vs. human code review time
  • AI code bug density vs. human code bug density
  • AI code rework rate (how often is it substantially refactored within 30 days?)
  • AI code accessibility score vs. human code accessibility score

Not to stigmatize AI code. To understand its actual cost and value.

Right now we’re celebrating “41% of code is AI-generated!” without asking:

  • How much of that 41% shipped to production?
  • How much of it had to be significantly reworked?
  • How much reviewer time did it consume?
  • How much production debt did it create?

What’s Actually Working

After our modal accessibility disaster, we changed our process:

Before AI generates code:

  1. I create the component spec in Figma
  2. I document accessibility requirements explicitly
  3. I list which design tokens must be used
  4. I map out component relationships
  5. Then I let AI generate the implementation

Result: AI-generated code that actually integrates with our system.

Still faster than writing by hand. Way slower than letting AI run wild.

But the code that ships is actually maintainable.

The Uncomfortable Question for Everyone

Are we accepting “good enough” more often because AI wrote it?

When humans write code, we scrutinize every decision. When AI writes code, we just check if it works.

That’s a mistake.

AI makes it easy to generate lots of code fast. That doesn’t mean the code is good. And it definitely doesn’t mean we should review it less carefully.

If anything, we should review it more carefully - because the AI doesn’t understand our system, our constraints, our users, or our future maintenance burden.

What are others seeing? Is the 1.7× issue rate real in your codebases? How are you handling the review burden?

Maya, your accessibility example hits hard because we’re seeing the exact same pattern in a totally different domain: financial compliance code.

The Compliance Blind Spot

Last month, one of our engineers used AI to generate code for processing wire transfers. The AI produced beautiful, clean, well-tested code that handled:

  • Input validation :white_check_mark:
  • Error handling :white_check_mark:
  • Database transactions :white_check_mark:
  • API response formatting :white_check_mark:

What it didn’t handle:

  • OFAC sanctions screening
  • Anti-money laundering (AML) reporting thresholds
  • Reg E compliance for error resolution
  • Audit trail requirements

The AI doesn’t know that wire transfers over $10,000 trigger specific reporting requirements. It doesn’t know that certain countries require additional screening. It doesn’t know that we need immutable audit logs for regulatory examiners.

The code worked perfectly. It was also a compliance violation waiting to happen.

We caught it in review - but only because our senior engineer has 12 years of fintech experience and knows to look for this stuff.

The Risk with Specialized Domains

AI is trained on general programming knowledge. It’s not trained on:

  • Your industry’s regulatory requirements
  • Your company’s compliance obligations
  • Your internal security standards
  • Your architectural guard rails

In financial services, that gap can cost millions in fines.

In healthcare, it could violate HIPAA.

In government contracting, it could break security clearances.

The more specialized your domain, the more dangerous AI-generated code becomes without expert review.

Our Review Time Stats

You mentioned 2.3× longer review time for AI code. Our numbers are similar:

Human-written code (financial domain):

  • ~45 minutes for complex feature review
  • Senior engineers know what to look for
  • Clear patterns established over years

AI-generated code (financial domain):

  • ~90-120 minutes for same complexity
  • Must verify every compliance requirement manually
  • Must check for security patterns AI doesn’t understand
  • Must validate against regulatory rules AI can’t know

We’re not reviewing faster. We’re reviewing way more carefully because the stakes are higher.

What We’re Testing: Required Human Design Phase

Your “spec before AI” approach resonates. We implemented something similar:

For any financial transaction code:

  1. Senior engineer writes a technical design doc that includes:
    • Regulatory requirements (with citations)
    • Security requirements (with threat model)
    • Audit trail requirements
    • Rollback and error handling procedures
  2. Design doc gets reviewed by compliance team
  3. Only then can AI generate implementation
  4. Implementation gets second review against the design doc

It’s slower. Way slower than letting AI run wild.

But we’ve had zero compliance issues in AI-generated code since we started this process.

Before? We caught three potential violations in two months.

The Question of Trust

“Are we accepting ‘good enough’ more often because AI wrote it?”

Yes. And it scares me.

When a junior engineer writes financial code, I ask hard questions:

  • “Why did you choose this approach?”
  • “What happens if this external API is down?”
  • “How are you handling the reconciliation edge case?”

When AI writes the same code, I find myself asking softer questions:

  • “Does it compile?”
  • “Did the tests pass?”
  • “Is the logic correct?”

I’m not probing for understanding the same way. And that’s dangerous.

Because in six months, when we need to debug a production issue at 2am, the engineer who “wrote” the code with AI can’t explain why it was built that way.

The Mentoring Problem

This connects to what @vp_eng_keisha mentioned about junior engineers not learning.

I have a junior engineer who can ship features incredibly fast with AI. But in code review, when I ask “Why did you structure it this way?”, the answer is often:

“I don’t know, that’s what Copilot suggested.”

That’s not learning. That’s copying.

In financial services, we can’t have engineers who can’t explain their own code. The regulators won’t accept “the AI did it.”

Maya, I’d love to hear more about your pre-AI specification process. Are you seeing better integration results? And how do you balance the “slower but better” trade-off when PMs are pushing for speed?

The 1.7× issue rate and the 41% bug increase - these numbers are exactly why our AI investment business case is falling apart.

The Math That Doesn’t Math

Our board approved $2M in AI tooling budget last year based on this logic:

  • AI makes developers 20-30% more productive
  • Same output with 20-30% fewer developers
  • ROI: Save $5M in avoided hiring

Six months later, here’s the reality:

  • Code output is up 25% :white_check_mark:
  • Developers needed: Same number :cross_mark:
  • Bug rates: Up 23.5% :cross_mark:
  • Review time: Up 91% :cross_mark:
  • Deployment incidents: Up 30% :cross_mark:

The productivity gains evaporated in quality issues.

The Hidden Cost of “Fast” Code

Maya, your accessibility example is the perfect illustration. Let me translate it to CFO language:

Time to generate code with AI: 3 hours
Cost: $150 (loaded cost for mid-level engineer)

Time to fix the code: 3 days
Cost: $3,600 (loaded cost for senior engineer + QA + design review)

Net productivity: -2.75 days, -$3,450

That’s a 24× cost multiplier for the “fast” AI code.

Now multiply that across every AI-generated PR that has to be substantially reworked. The “productivity gains” become “productivity debt.”

Strategic Concern: This Undermines AI Adoption

Here’s what worries me as a CTO:

If developers experience AI as “fast code that creates slow problems,” they’ll resist adoption of AI where it could actually help - testing, documentation, ops automation.

We’re poisoning the well by over-indexing on code generation speed without accounting for integration cost.

The Proposal: AI Quality Score

I’m proposing to our engineering leadership that we track an “AI Quality Score” as part of our code review metrics.

For each AI-assisted PR, track:

  1. Initial review time - How long to review?
  2. Rework cycles - How many rounds of fixes?
  3. Bug escape rate - Did issues reach production?
  4. Architectural fit - Does it follow our patterns?
  5. Maintainability score - Will someone understand it in 6 months?

Aggregate these into a score: 0-100 points

If score < 70: Flag for deeper architectural review
If score < 50: Consider human rewrite

This gives us data to answer:

  • Which types of code does AI handle well? (Maybe it’s great at CRUD endpoints but terrible at complex state management)
  • Which engineers use AI effectively? (Maybe they’re better at prompting or better at reviewing AI output)
  • Where should we invest in AI education?

The Organizational Investment Required

If PR review time is up 91% and AI code has 1.7× more issues, we need to invest in review infrastructure:

Option 1: Dedicated AI Code Review Specialists

  • Senior engineers who specialize in reviewing AI-generated code
  • They know what to look for, they’re fast at it
  • Expensive, but maybe cheaper than having all senior engineers slow down

Option 2: Automated AI Code Quality Gates

  • Static analysis that checks for AI-common mistakes
  • Accessibility scanners that catch what AI misses
  • Architectural linters that enforce our patterns
  • Catch issues before human review

Option 3: Both

  • Automation catches the obvious stuff
  • Specialists handle the nuanced review
  • Regular engineers get trained over time

None of these are free. But they might be necessary if we want to actually realize AI productivity gains.

The Question for the Room

Maya asked: “Should we track AI code quality separately?”

My answer: Yes, and we should tie AI tool budget to AI code quality metrics.

If AI code has 1.7× more issues and takes 2× longer to review, then the “productivity” claim needs to be re-evaluated.

Either:

  1. We figure out how to use AI in ways that don’t create quality debt, OR
  2. We acknowledge that AI’s value is in specific use cases (prototyping, scaffolding, documentation), not wholesale code generation

Right now we’re in a messy middle where we’re adopting AI everywhere without understanding where it actually helps vs. where it hurts.

The 41% bug increase is a symptom. The disease is adopting tools without adapting processes.

@maya_builds - your separate review queue proposal is smart. What if we piloted it for one team and measured the difference? I’d be curious if dedicated “AI code reviewers” could develop patterns that make review faster without sacrificing quality.

This thread is making me relive a very painful product lesson from three months ago.

The Customer Impact Story Nobody Wants to Tell

We shipped a new feature in January that our biggest enterprise customers had been asking for: bulk user import.

The engineering team knocked it out in record time. Two weeks from spec to production. Everyone was celebrating the velocity.

The code? About 60% AI-generated. The engineer used Copilot heavily and it showed - clean, well-structured, tested.

Week 1 in production: Three customers reported data inconsistencies after bulk imports.

Week 2: Data loss incident. A customer imported 10,000 users. 847 of them lost associated metadata due to a race condition.

Week 3: We had to write a data recovery tool, contact every customer who used the feature, and issue credits.

Cost: $180K in credits, two weeks of engineering time to fix, immeasurable reputation damage.

Root cause: AI-generated code didn’t handle concurrent database writes correctly. The tests passed because we tested with small datasets. The code review didn’t catch it because the logic looked sound.

The code worked. Until it didn’t.

The PM Question: Speed vs. Safety

Here’s my dilemma:

My job is to ship features that customers want, as fast as we can, without breaking things.

AI promises speed. But if that speed comes with a 41% increase in bugs and a 30% increase in failure rates (source), then we’re not actually shipping faster - we’re just shipping broken things faster.

That’s worse than shipping slowly.

The Framework I’m Now Using: AI for MVPs, Humans for Production

After the bulk import disaster, I proposed a new guideline to engineering:

Use AI for:

  • Prototypes and proof-of-concepts
  • Internal tools (where failure is low-impact)
  • Scaffolding and boilerplate
  • Test data generation
  • Documentation

Require human-led design for:

  • Production features with customer data
  • Payment processing
  • Security-critical paths
  • Performance-sensitive operations
  • Complex state management

It’s slower. But it acknowledges the reality: AI is great at generating code, not great at understanding consequences.

The Customer Perspective Maya’s Missing

Maya, your accessibility example is perfect, but there’s another layer:

Our customers don’t care that AI wrote the code.

They care that:

  • The feature doesn’t work for keyboard users
  • They have to explain to their legal team why our app isn’t WCAG compliant
  • They might get sued because our software enabled discrimination against disabled users

When we say “AI-generated code has 1.7× more issues,” we’re talking about engineering metrics.

From a customer perspective, it’s:

  • 1.7× more reasons to churn
  • 1.7× more support tickets
  • 1.7× more trust erosion

That’s the business impact we’re not measuring.

The Question for Engineering Leaders

When @cto_michelle proposes tracking “AI Quality Score,” I’d add: Track customer impact separately.

For AI-assisted features:

  • Support ticket volume (vs. human-written features)
  • Bug reports per user
  • Customer satisfaction scores
  • Feature adoption rates
  • Churn correlation

If AI code leads to worse customer outcomes, then the “productivity gains” are a lie.

We’re not productive if we’re shipping things customers don’t trust.

What I’m Taking Back to My Team

I’m going to start flagging “AI-heavy” features in our product roadmap and requiring:

  1. Extended QA cycles - If AI wrote 60%+ of the code, double the testing time
  2. Customer-facing beta - Let friendly customers test before general release
  3. Rollback plans - Always have a way to turn it off quickly
  4. Monitoring dashboards - Track feature health metrics in real-time

It’s acknowledging that AI code is higher risk until we prove otherwise.

Not anti-AI. Just pro-customer.

Maya, how do you balance the PM pressure to “ship fast” with the design systems reality of “AI code needs more review”? Do you push back on timelines, or do you let some quality slide?