What Does a Good AI Code Review Checklist Actually Look Like?
Building on Maya’s thread about AI code review challenges, I want to share some concrete practices we’ve developed at my financial services company—and get your input on what we might be missing.
Context: Why We Need AI-Specific Review Standards
In my last response, I mentioned treating AI-generated code like “junior developer output.” But saying that and actually doing it are different things. We needed specific, actionable guidance for our reviewers.
The challenge: Our engineers are used to reviewing human code. They know what to look for, what smells bad, where bugs hide. But AI code has different failure modes:
Clean, well-formatted syntax (looks professional)
Follows common patterns (feels familiar)
Missing edge case handling (not obvious at first glance)
Security vulnerabilities that “look correct” (the OAuth issue Maya mentioned)
Over-engineered solutions (AI sometimes does the “clever” thing, not the simple thing)
Traditional code review checklists don’t catch these AI-specific issues. So we built our own.
Our AI Code Review Checklist
This is what our team uses for any PR that’s significantly AI-generated (we use a ai-assisted label to flag these):
1. Security Extra Scrutiny
We can’t afford security issues in banking systems, so these are mandatory checks:
-
Authentication/authorization logic manually verified
- Does this code correctly enforce access control?
- Are there bypass scenarios (concurrent access, edge cases)?
- Has a security-focused engineer reviewed it?
-
Input validation reviewed for injection attacks
- SQL injection, NoSQL injection, command injection
- XSS in any user-facing output
- Path traversal in file operations
-
Proper error handling (no sensitive data leaks)
- Error messages don’t expose stack traces, credentials, PII
- Failed auth doesn’t reveal whether username or password was wrong
- Database errors are generic to external callers
-
Secrets/credentials handling reviewed
- No hardcoded secrets (AI sometimes suggests example keys)
- Proper use of secret management systems
- Credentials not logged or exposed in responses
2. Logic & Edge Cases
This is where AI most often fails us:
-
Null/undefined handling
- What happens with empty arrays, null objects, undefined values?
- Does the code fail gracefully or throw obscure errors?
-
Concurrent access scenarios
- Race conditions in async operations
- Lock contention in shared resources
- State consistency across requests
-
Boundary conditions
- Empty inputs, max values, overflow scenarios
- First/last element handling in loops
- Off-by-one errors
-
Rollback/cleanup logic
- Partial failure handling (what if step 3 of 5 fails?)
- Resource cleanup in error paths
- Idempotency for retries
3. AI-Specific Checks
These are unique to AI-generated code:
-
Prompt quality review
- Was the AI prompt clear and complete? (we require this in PR descriptions)
- Did the prompt include security/performance requirements?
- Is there context the AI might have missed?
-
Team convention alignment
- AI often uses “common” patterns, not our patterns
- Does this match our architecture decisions?
- Does it follow our naming conventions and style guide?
-
Unexplained complexity
- Is there a simpler solution than what AI generated?
- Are there “clever” solutions that should be more straightforward?
- Can a junior engineer understand this code?
-
Test coverage adequate for complexity
- Did the AI generate tests too? (often basic happy-path only)
- Are edge cases tested?
- Do tests actually validate business logic, not just syntax?
4. Architecture & Maintainability
-
Fits existing architecture
- Does this create new patterns or follow existing ones?
- Is the abstraction level appropriate?
- Does it introduce unnecessary dependencies?
-
Performance considerations
- N+1 queries, inefficient algorithms, memory leaks
- Scalability under load
- Caching strategy if needed
-
Maintainability
- Is it documented? (AI-generated comments don’t always explain why)
- Can the team modify this in 6 months?
- Are there magic numbers or unexplained constants?
The Two-Reviewer Experiment
For PRs that are >50% AI-generated (based on author self-reporting), we’re experimenting with requiring two reviewers:
- First reviewer: Focuses on logic, edge cases, security
- Second reviewer: Focuses on architecture, maintainability, team conventions
Is it slower? Yes—adds about 20% to review time.
Is it worth it? We think so—we’ve seen a 60% reduction in AI-related production bugs.
Process Integration
Having a checklist is one thing—actually using it is another. We’ve integrated this into our workflow:
- PR template includes AI disclosure: “Was this PR significantly AI-assisted? (Yes/No)”
- GitHub Action adds
ai-assistedlabel based on PR template - Reviewers see the label and know to use the AI checklist
- CODEOWNERS auto-assigns security reviewer for any auth/payment code
- PR can’t merge without both reviewers approving (for AI-heavy PRs)
Questions for the Community
Okay, this is what we’re doing. But I’m sure we’re missing things. So:
-
What would you add to this checklist? What AI-specific failure modes are we not catching?
-
Is requiring two reviewers overkill? Or is it the right level of rigor for AI-heavy code?
-
How do you handle the “AI generated the tests too” problem? Our AI tools write tests, but they’re often superficial.
-
What’s your false positive rate? How often does this catch real issues vs slow things down unnecessarily?
-
How do you balance this with velocity? Are we being too cautious?
I’m particularly interested in hearing from teams in other regulated industries (healthcare, fintech, government)—what does your AI code review process look like?
TL;DR: We treat AI code like junior dev output and have a specific checklist for security, edge cases, and AI-specific issues. Two reviewers for AI-heavy PRs. It’s slower but catches 60% more bugs. What are we missing?