Last week, our AI code reviewer analyzed my pull request in under 10 minutes. It caught 47 potential bugs, flagged 12 style violations, and gave me a detailed report on performance optimizations. I was thrilled—until my tech lead pulled me aside.
“Maya, this breaks our entire design token hierarchy,” he said, pointing at code the AI had enthusiastically approved. “Every component you created uses hardcoded colors instead of our token system. This is going to cascade into an accessibility nightmare.”
The AI reviewer was right about everything it flagged. But it was completely blind to the one thing that mattered most: architectural consistency within our design system.
The speed vs. context tradeoff is real
Here’s what I’ve noticed working with AI code reviews over the past few months:
What AI reviewers excel at:
- Catching syntax errors and type mismatches
- Identifying common anti-patterns
- Spotting security vulnerabilities from known patterns
- Enforcing style guidelines consistently
- Suggesting performance optimizations
What AI reviewers consistently miss:
- Design system architectural patterns
- Business context for why certain patterns exist
- Accessibility implications of component hierarchies
- How this code fits into the broader system evolution
- Team conventions that aren’t codified in linters
In my case, the AI saw perfectly valid React components with proper prop types and good performance characteristics. What it didn’t see was that I’d bypassed our design token system—a decision that would break our theme switching, accessibility contrast ratios, and design-engineering handoff process.
A real example: when “correct” code breaks the system
The component I built was for a feature flag dashboard. The AI review praised it:
Type-safe implementation
No prop-drilling
Proper error boundaries
Optimized re-renders
Test coverage >80%
But here’s what it missed:
Used hex colors instead of design tokens
Created custom spacing instead of using our scale
Implemented a one-off focus state that broke keyboard navigation patterns
Typography didn’t respond to our accessibility font size settings
Every individual line of code was “correct.” But the architectural decision to work outside our design system was catastrophic for maintainability and accessibility.
Should we review AI-generated code differently?
This experience made me wonder: Do we need different review standards for AI-generated code versus human-written code?
Some research I found suggests we might:
- Studies show AI-assisted code has 23.7% more security vulnerabilities when not properly governed (source)
- 46% of developers don’t fully trust AI results (source)
- Teams using two-pass review workflows (AI first, human second) reduce cycle time by 30-50% while maintaining quality (source)
What I’m thinking about
In design systems, we care deeply about architectural consistency. A beautiful component that breaks the system’s patterns is worse than an ugly component that follows them—because the ugly one can be improved while maintaining coherence.
Maybe code review needs to evolve the same way:
Pass 1 (AI): Mechanical correctness
- Syntax, types, common patterns
- Security vulnerability scanning
- Style guide enforcement
- Test coverage verification
Pass 2 (Human): Architectural coherence
- System-wide consistency
- Business context alignment
- Accessibility implications
- Long-term maintainability
The AI handles the tedious stuff humans are inconsistent at. Humans focus on the contextual stuff AI can’t understand.
Questions for the group
How are your teams handling this?
- Are you reviewing AI-generated code the same way as human code, or differently?
- Have you encountered situations where AI reviewers approved code that was locally correct but systemically wrong?
- What does “architectural review” mean in your context (backend systems, design systems, data pipelines, etc.)?
- Who should be doing the architectural review—do we need a new role, or is this just senior engineers’ job evolving?
I’m especially curious if teams outside design systems are seeing similar patterns. My sense is this isn’t just about design tokens—it’s about any domain where architectural patterns matter more than individual code quality.
What’s your experience been? ![]()