Two weeks ago, I shipped a seemingly simple PR: a modal component for our design system. Built it in an afternoon using Cursor. Felt like magic. ![]()
Then QA tested it.
Twelve accessibility violations. Twelve.
It took three days to fix what took three hours to generate.
And here’s the thing - the AI didn’t write bad code. It wrote code that worked perfectly… if you only use a mouse. If you only have perfect vision. If you’ve never heard of WCAG compliance.
The code compiled. The tests passed. The demo looked beautiful.
It just wasn’t… right.
The Quality Gap in AI-Generated Code
There’s a stat that’s been haunting me since I saw it last week:
Pull requests with AI-generated code have 1.7× more issues than human-written code (source).
Not “slightly more” issues. Not “a few edge cases.” 70% more problems.
And we’re not tracking it separately. We’re lumping AI code and human code into the same metrics, wondering why our overall quality is declining.
What AI Doesn’t Understand: The “Why” Behind the Code
From a design systems perspective, I’m seeing a pattern:
AI optimizes for immediate functionality, not long-term maintainability.
Examples from our codebase:
Design Tokens
- Human approach: Uses
spacing-md,color-primary,font-size-body - AI approach: Hardcodes
16px,#2563eb,14pxeverywhere - Impact: Works perfectly until we rebrand or need dark mode
Component Composition
- Human approach: Builds modular components that compose predictably
- AI approach: Creates monolithic components that “do everything”
- Impact: Works for the immediate use case, nightmare to maintain
Accessibility
- Human approach: Keyboard navigation, ARIA labels, focus management from the start
- AI approach: Mouse-first, visual-first, “we’ll add accessibility later”
- Impact: Later never comes, or costs 3× to retrofit
The 41% Bug Increase Nobody Talks About
@eng_director_luis mentioned this stat in the other thread: projects with high AI code usage saw a 41% increase in bugs (source).
In our design system, I can trace where these bugs come from:
1. Context Blindness
- AI doesn’t know our component hierarchy
- AI doesn’t understand our design token system
- AI doesn’t see the dependencies between components
- AI generates code that works in isolation but breaks in integration
2. Pattern Inconsistency
- AI learns from Stack Overflow, not our style guide
- AI mixes different state management patterns in the same component
- AI uses whatever works, not what’s maintainable
3. Edge Case Ignorance
- AI handles the happy path beautifully
- AI forgets error states, loading states, empty states
- AI doesn’t think about mobile, tablets, screen readers, slow networks
The Review Burden is Real
Here’s the uncomfortable truth: AI-generated code creates more work for reviewers, not less.
When I review human-written code:
- 20 minutes to understand the approach
- 10 minutes to check for issues
- 5 minutes to suggest improvements
- ~35 minutes total
When I review AI-generated code:
- 5 minutes to understand (it’s usually straightforward)
- 40 minutes to check for ALL the things AI might have missed:
- Accessibility
- Design token usage
- Component composition
- Edge cases
- Mobile responsiveness
- Performance implications
- Maintainability
- 15 minutes to document what needs to be fixed
- ~60 minutes total
Our AI-assisted PRs take 2.3× longer to review than human-written PRs.
Where’s the productivity gain?
My Controversial Proposal: Separate Review Queues
What if we treated AI-heavy PRs as a different category?
Not to shame them. To acknowledge they need different review patterns.
Standard PR review checklist:
Functional correctness
Test coverage
Follows architectural patterns
Code quality
AI-assisted PR review checklist (additional checks):
Uses design tokens (not hardcoded values)
Accessibility compliance (keyboard nav, ARIA, focus)
Edge cases (error states, loading, empty)
Mobile/responsive (not just desktop)
Integrates with existing patterns (doesn’t reinvent)
Performance (AI loves nested loops and unnecessary re-renders)
Maintainability (will someone understand this in 6 months?)
Different tools. Different SLAs. Different expectations.
The Question: Should We Track AI Code Quality Separately?
@cto_michelle proposed tracking AI code as a distinct quality category. I think we need to go further:
Separate metrics for:
- AI code review time vs. human code review time
- AI code bug density vs. human code bug density
- AI code rework rate (how often is it substantially refactored within 30 days?)
- AI code accessibility score vs. human code accessibility score
Not to stigmatize AI code. To understand its actual cost and value.
Right now we’re celebrating “41% of code is AI-generated!” without asking:
- How much of that 41% shipped to production?
- How much of it had to be significantly reworked?
- How much reviewer time did it consume?
- How much production debt did it create?
What’s Actually Working
After our modal accessibility disaster, we changed our process:
Before AI generates code:
- I create the component spec in Figma
- I document accessibility requirements explicitly
- I list which design tokens must be used
- I map out component relationships
- Then I let AI generate the implementation
Result: AI-generated code that actually integrates with our system.
Still faster than writing by hand. Way slower than letting AI run wild.
But the code that ships is actually maintainable.
The Uncomfortable Question for Everyone
Are we accepting “good enough” more often because AI wrote it?
When humans write code, we scrutinize every decision. When AI writes code, we just check if it works.
That’s a mistake.
AI makes it easy to generate lots of code fast. That doesn’t mean the code is good. And it definitely doesn’t mean we should review it less carefully.
If anything, we should review it more carefully - because the AI doesn’t understand our system, our constraints, our users, or our future maintenance burden.
What are others seeing? Is the 1.7× issue rate real in your codebases? How are you handling the review burden?