Following up on the discussion about AI code review mandates, I want to share what we’ve learned implementing verification workflows in financial services. The short version: “Review everything manually” doesn’t scale, but you can’t just trust blindly either.
The Problem We Faced
Like many teams, we started with the simplest policy: AI can write code, humans must review all of it.
Within two months, we hit a wall:
- Review queue grew faster than we could clear it
- Senior engineers burned out on routine review work
- AI-generated code volume increased 3x (more code = more review)
- Review became the bottleneck, not code generation
We had made code generation faster but hadn’t adapted our verification systems. Classic optimization fallacy.
What We Built Instead
We designed a multi-layered verification approach that catches different types of issues at different stages, reducing the burden on human reviewers.
Layer 1: Pre-Commit Automated Checks
Before code even reaches a reviewer:
Static Analysis:
- Linting for code style and common errors
- Security scanning (OWASP dependency check, secrets detection)
- Complexity metrics (cyclomatic complexity, maintainability index)
AI-Specific Checks:
- Pattern matching for known AI failure modes (we’ve cataloged 23 common mistakes our AI tools make)
- Diff analysis - flag unusually large changes or rewrites
- Test coverage verification - AI code must include tests
Result: Catches ~40% of issues automatically, before human review
Layer 2: Automated Testing Gate
Code cannot be reviewed until:
- All unit tests pass (100% of existing tests, new tests for new code)
- Integration tests pass for affected systems
- Performance benchmarks meet thresholds
Result: Catches ~30% of remaining issues, proves functionality
Layer 3: Risk-Based Human Review
Only after passing automated checks, code goes to human review. But review depth varies by risk:
High-Risk (Tier 1): Money movement, auth, customer data
- Two senior engineers review architecture and business logic
- Security team reviews for vulnerabilities
- Compliance spot-checks audit trail
Medium-Risk (Tier 2): Business features, integrations
- One peer review focused on design and maintainability
- Automated checks must pass, reviewer focuses on “does this solve the right problem?”
Low-Risk (Tier 3): Internal tools, scripts, documentation
- Self-review with automated verification
- Spot-checks by team lead (10% sampling)
Result: Human reviewers focus on problems machines can’t catch
Layer 4: Post-Deployment Monitoring
Even after review and deployment:
- Automated monitoring for errors and performance degradation
- Feature flags for gradual rollout
- Incident tracking tagged by “AI-generated” vs “human-written”
Result: We learn AI failure patterns and feed them back to Layer 1
The Data After 6 Months
We’ve been running this system for six months. Here’s what changed:
Before (manual review only):
- Average review time: 4.2 hours per PR
- Review queue: 22 PRs average
- Senior engineer time on review: 35% of week
- Defects caught in review: 73%
- Defects reaching production: 8 per month
After (layered verification):
- Average review time: 1.8 hours per PR (human review only)
- Review queue: 6 PRs average
- Senior engineer time on review: 12% of week
- Defects caught in automation: 68%
- Defects caught in review: 28%
- Defects reaching production: 6 per month (25% reduction)
Key insight: We’re catching MORE defects overall with LESS human review time. Automation catches the routine stuff, humans catch the subtle stuff.
Implementation Lessons
Start Small, Prove Value:
We piloted with one team (internal tools - low risk) for one month. Proved the system worked before rolling out to critical systems.
Catalog AI Failure Patterns:
We tracked every defect in AI-generated code for three months. Common patterns emerged:
- Incorrect error handling (AI often forgets edge cases)
- Over-optimization (AI writes complex code when simple would work)
- Context misunderstanding (AI doesn’t know business rules it wasn’t told)
These patterns became automated checks.
Invest in Infrastructure:
This isn’t free. We spent 6 weeks building:
- CI/CD pipeline enhancements
- Custom linting rules for AI patterns
- Dashboards for tracking defects by source (AI vs human)
But the ROI is clear: senior engineering time freed up for higher-value work.
The Remaining Challenges
This system isn’t perfect:
- Junior engineers still need mentorship: Automated checks don’t teach judgment
- Novel problems slip through: AI sometimes fails in creative ways automation doesn’t catch
- Maintenance overhead: Verification rules need updating as AI tools evolve
- Cultural resistance: Some engineers believe “manual review is the only real review”
We’re still iterating.
My Question for This Community
What verification automation are you running? Are you just using standard linting, or have you built AI-specific checks?
And for those still doing mostly manual review: What would it take to trust automation more? Is it lack of tooling, or lack of confidence in what automation can catch?
I’m convinced that as AI code generation accelerates, verification must accelerate too - and humans can’t review fast enough. Automation isn’t optional anymore.