I’ve been following the discussions about AI code review, and I think we’re framing the problem wrong. It’s not quality vs. speed. It’s risk calibration across multiple dimensions.
Let me explain what I mean.
The Binary Framing Is Misleading
When we say “mandate manual review” vs. “trust AI code,” we’re acting like there are only two states: safe or fast. But that’s not how engineering teams actually work.
In reality, we’re constantly making risk decisions:
- Which features get released to 100% of users vs. 5% beta users?
- Which bugs get hotfixed immediately vs. bundled into next release?
- Which technical debt gets addressed now vs. deferred?
AI code review should follow the same risk-based decision framework we already use for everything else.
The Risk Calibration Framework
Here’s how I think about it:
Dimension 1: Code Risk Level
(Luis covered this well in his thread about verification workflows)
Critical: Money, auth, customer data → High scrutiny
Important: Core features, integrations → Standard review
Routine: Internal tools, documentation → Light review
Dimension 2: Engineer Experience Level
(I touched on this in the earlier thread)
Senior engineers with proven track records → More autonomy
Mid-level engineers building competence → Standard oversight
Junior engineers still learning → Close mentorship
Dimension 3: Blast Radius
Customer-Facing: Directly impacts end users → Extra caution
Internal-Facing: Only affects team workflows → Move faster
Experimental: Behind feature flags → Take risks
Dimension 4: Reversibility
Easily Reverted: Can rollback quickly → Ship with confidence
Migration/Schema Changes: Hard to undo → Go slow
One-Way Doors: Permanent decisions → Maximum scrutiny
How This Changes Review Practices
Instead of “review everything” or “trust AI blindly,” you calibrate review rigor based on where code falls on these dimensions.
Example 1: Senior engineer, routine code, internal tool, easily reverted
- Risk calibration: LOW
- Review approach: Automated checks only, self-review, ship
- Reasoning: All dimensions suggest low risk, trust the process
Example 2: Junior engineer, critical code, customer-facing, hard to revert
- Risk calibration: HIGH
- Review approach: Senior engineer review, security review, staged rollout
- Reasoning: All dimensions suggest high risk, invest in thorough review
Example 3: Mid-level engineer, important code, internal-facing, behind feature flag
- Risk calibration: MEDIUM
- Review approach: Peer review for design, automated testing, gradual rollout
- Reasoning: Mixed signals - more than self-review, less than full scrutiny
The Team Structure Implication
This framework implies you need to structure teams around risk tolerance, not just around products or features.
At my EdTech company, we’ve organized teams into:
Core Platform Team (Low risk tolerance)
- Owns auth, payments, data infrastructure
- Higher review standards, slower velocity
- Mostly senior engineers
Feature Teams (Medium risk tolerance)
- Own customer-facing products
- Standard review processes
- Mix of experience levels
Innovation Team (Higher risk tolerance)
- Experimental features behind flags
- Lighter review, faster iteration
- Self-organizing, high autonomy
Each team has DIFFERENT review practices because they operate with DIFFERENT risk profiles. There’s no universal policy.
Building Judgment, Not Enforcing Compliance
Here’s the subtle but critical shift: Risk calibration requires judgment, not just rule-following.
“Review all AI code” is a rule. It doesn’t require judgment - just compliance.
“Calibrate review rigor based on risk dimensions” is a framework. It requires engineers to think:
- What kind of code is this?
- Who’s writing it?
- What happens if it breaks?
- How easily can we fix it?
This builds better engineers, not just safer code.
The Cultural Shift
Moving from rules to judgment requires trust:
- Trust that engineers will make good risk decisions
- Trust that they’ll ask for help when risk is high
- Trust that they’ll learn from mistakes
It also requires transparency:
- Make risk decisions explicit (document “this is low-risk because…” or “this needs extra review because…”)
- Track outcomes (were our risk calibrations accurate?)
- Adjust (what should we scrutinize more/less based on actual defects?)
What This Means for AI Code Review
Don’t ask: “Should we review AI code?”
Ask: “What level of review does THIS piece of AI code need based on its risk profile?”
Don’t mandate: “All AI code requires manual review”
Enable: “Engineers can calibrate review rigor based on risk dimensions”
Don’t measure: “Percentage of code reviewed”
Measure: “Defect rates by risk tier” and “accuracy of risk calibration”
The Uncomfortable Question
This approach requires more from engineering leaders: we have to teach judgment, not enforce rules. That’s harder.
Rules scale easily - write a policy, mandate compliance, measure adherence.
Judgment scales slowly - teach frameworks, mentor decision-making, learn from mistakes.
But which creates better long-term outcomes?
My Challenge to This Community
If you’re currently using blanket review policies (“review everything” or “review nothing”), try this:
- Pick 10 recent PRs from your team
- Plot them on the 4 risk dimensions I outlined
- Ask: Did we apply the right level of review to each based on its actual risk profile?
My hypothesis: You’ll find you over-reviewed some low-risk code and possibly under-reviewed some high-risk code, because blanket policies don’t account for nuance.
What do you find when you try this exercise?