Our Code Review Process Can't Scale with AI Output Volume—Time to Rethink What 'Thorough Review' Means?

Following up on the AI productivity paradox discussion—I want to dig deeper into what I think is the core bottleneck: our code review processes were designed for a different volume and quality profile of code, and they’re breaking under AI-generated workload.

The Traditional Review Model’s Assumptions

Traditional code review was designed around these assumptions:

  • Developers produce 10-15 PRs per week (per reviewer)
  • Code is well-understood by the author
  • Reviews take 15-30 minutes for most PRs
  • 2-3 review cycles per PR on average
  • Review is primarily about catching bugs and logic errors

All of these assumptions are failing in AI-heavy teams.

What We’re Actually Seeing

Our data from 40+ engineers in a heavily regulated fintech environment:

Before widespread AI adoption (Q1 2025):

  • 120 PRs/week across team
  • Average review time: 25 minutes per PR
  • Review SLA: 4 hours (median time from PR creation to approval)
  • Review comments per PR: 4.2 average
  • Approval rate on first review: 65%

After widespread AI adoption (Q1 2026):

  • 230 PRs/week across team (+92%)
  • Average review time: 35 minutes per PR (+40%)
  • Review SLA: 18 hours (median time from PR creation to approval) (+350%)
  • Review comments per PR: 3.1 average (-26%)
  • Approval rate on first review: 52% (-13 points)

The paradox in those numbers: We’re spending more time per review but leaving fewer comments and catching fewer issues on first pass.

Why Reviews Take Longer for AI Code

This surprised me, but after watching our senior engineers review AI-generated PRs, the pattern is clear:

1. The Author Doesn’t Deeply Understand Their Own Code

When a senior engineer writes code, I can ask “why did you choose this approach?” and get a thoughtful answer about trade-offs considered.

When a junior engineer uses AI to generate code, they often can’t explain the architectural choices because they didn’t make them—AI did. The review becomes a teaching session (“do you understand what this does?”) before it can be a quality gate.

2. Unfamiliar Patterns and Libraries

AI suggests whatever patterns are common in its training data, not what’s common in our codebase. This means reviewers encounter code that’s syntactically correct, idiomatically reasonable, but architecturally inconsistent with the rest of the system.

Last week: an engineer used Copilot to implement caching. Copilot suggested a Redis pattern we don’t use anywhere else. The code worked, tests passed, but now we had a new pattern to maintain. Three senior engineers spent 90 minutes discussing whether to accept it or ask for refactoring to match our existing approach.

3. Subtle Business Logic Errors That Tests Miss

Here’s the scary one: AI is excellent at generating code that looks correct and passes tests but has subtle business logic errors.

Example from last month: Payment processing code that correctly handled the happy path and obvious error cases (which tests covered) but had wrong behavior for a specific edge case (international transactions over $10k with partial refunds) that only a domain expert would spot.

The tests passed because we didn’t have a test for that edge case. The code looked reasonable. Only caught it because a reviewer who’d worked on our payment system for 3 years noticed something that “felt off.”

The Degrading Review Quality Problem

Here’s what worries me most: Review comments per PR are down 26% despite code complexity being unchanged.

I think reviewers are overwhelmed. When you’re looking at 40 PRs per week instead of 15, you start doing shallower reviews:

  • Focus on obvious bugs, skip architectural discussion
  • Trust that tests are comprehensive, skip edge case reasoning
  • Approve if it “looks fine,” skip questioning design choices

This is how technical debt accumulates silently.

The Solutions We’re Experimenting With

Michelle mentioned automated quality gates in the other thread—we’re going all-in on this:

Tier 1: Pre-PR Automated Checks

  • Linting, formatting, security scanning (standard stuff)
  • Design pattern validation: Custom tooling that checks if new code follows our architectural patterns
  • Compliance scanning: Flags any code touching financial data for enhanced review
  • Test coverage requirements: AI-generated code must have >90% coverage (vs. 80% for human code)

Early results: Catching ~60% of issues that would’ve required review comments. Review time per PR down from 35 min to 28 min.

Tier 2: AI-Assisted Review

Testing CodeRabbit and similar tools to provide automated review comments before humans look:

  • “This looks similar to pattern X elsewhere in the codebase—consider using that instead”
  • “This function has high cyclomatic complexity—consider refactoring”
  • “This logic differs from similar implementations in files Y and Z—intentional?”

Mixed results so far. False positives are high (~40%), so reviewers are learning to ignore some automated comments, which might be training them to ignore warnings in general. Concerning.

Tier 3: Risk-Based Review Depth

Not all code needs the same review rigor:

High risk (payments, auth, customer data):

  • 2+ senior engineer reviewers
  • Compliance review
  • Architecture review
  • Required edge case discussion

Medium risk (business logic, UI):

  • 1 reviewer + automated checks
  • Optional architecture discussion

Low risk (tests, docs, internal tools, configs):

  • Automated checks only
  • Async review (can merge, reviewer comments post-facto)

We tag PRs automatically based on files changed. Only ~20% of PRs are high-risk, but those get 80% of review attention.

Tier 4: Differential Review Standards for AI Code

This is controversial, but we’re testing it: PRs labeled as “AI-assisted” get different review criteria:

  • Required explanation of why AI-suggested approach was chosen
  • Mandatory comparison to existing patterns in codebase
  • Higher test coverage requirements
  • Edge case discussion required

Some engineers hate this (“it stigmatizes AI usage”), but I think it’s honest about the different risk profile.

The Questions I’m Wrestling With

  1. Can code review scale to 2× volume without degrading quality? Or is there a fundamental human attention limit we’re hitting?

  2. Should we invest more in AI-assisted review to match AI-assisted coding? Or is automated review fundamentally limited in ways that human review isn’t?

  3. Is tiered review (different standards for different risk levels) pragmatic or dangerous? Are we creating a two-tier system where low-risk code gets poor review and accumulates debt?

  4. Should AI-generated code have different standards than human code? Or does that create perverse incentives (engineers avoiding AI to avoid scrutiny)?

What I Need From This Community

How are others adapting review processes for AI-generated code volume?

  • Are you accepting longer review times?
  • Reducing review depth?
  • Investing in automated tooling?
  • Limiting AI usage?
  • Redesigning review entirely?

Has anyone successfully scaled review capacity to match AI coding capacity? What worked? What failed?

Because right now, our review process is the constraint preventing AI productivity gains from translating to team velocity—and I don’t think “review faster” or “hire more reviewers” are viable solutions at scale.

We need structural changes to how review works. I’d love to hear what others are trying.

Luis, your tiered review approach resonates with what we’ve learned building design systems—because we’ve essentially been running a parallel experiment in “how do you review at scale when automation generates most of the output.”

Design Review Automation as a Model

In design systems work, we figured out years ago that human review doesn’t scale:

  • Designers produce hundreds of component instances per week
  • Can’t manually review every button, icon, spacing choice
  • Need automated validation to catch standards violations

So we built tooling:

  • Figma plugins that check design token usage - Flags non-standard colors, spacing, typography
  • Automated accessibility checks - Contrast ratios, touch target sizes, semantic markup
  • Pattern matching - “This looks like component X but isn’t using component X—intentional?”

The result: 90% of design review comments are now automated. Human reviewers focus on the 10% that requires judgment—aesthetic choices, user experience, edge cases.

Why This Doesn’t Fully Translate to Code Review

But here’s the challenge: Design review is easier to automate than code review because design has clearer standards and less business logic.

Design review questions:

  • :white_check_mark: Is this color from our palette? (automatable)
  • :white_check_mark: Is this spacing a multiple of 8px? (automatable)
  • :white_check_mark: Does this meet WCAG AA contrast? (automatable)
  • :cross_mark: Is this layout intuitive for users? (requires human judgment)

Code review questions:

  • :white_check_mark: Does this follow linting rules? (automatable)
  • :white_check_mark: Is test coverage >80%? (automatable)
  • :warning: Does this follow our architectural patterns? (partially automatable with custom tooling)
  • :cross_mark: Is this business logic correct for edge cases? (requires domain expertise)

That last category—“does this business logic correctly handle all the edge cases we care about”—is most of what makes code review valuable and most of what AI-generated code gets wrong.

The Automated Design System Adherence Model

That said, I think your approach to automated pattern validation is exactly right. Here’s what works for us that might apply to code review:

1. Make Standards Machine-Readable

We don’t just document design standards in Notion—we encode them as rules that tooling can check:

  • Design tokens in JSON
  • Component specs in structured format
  • Validation rules as code

For code review, this means:

  • Architecture Decision Records (ADRs) as machine-readable constraints
  • Pattern libraries with automated compliance checking
  • “If you touch file X, you must also update file Y” rules

2. Fail Fast with Helpful Guidance

When our Figma plugin detects a violation, it doesn’t just flag it—it suggests the correct approach:

  • “You used #1a73e8 (not a design token). Did you mean primary-blue (#1967d2)?”
  • “This spacing is 12px. Use spacing-sm (8px) or spacing-md (16px)”

For code review:

  • “This payment logic differs from the pattern in payment-processor.ts—see Line 145 for the standard approach”
  • “You’re implementing caching. We use the CacheManager utility for consistency—see docs/architecture/caching.md”

Context-aware suggestions, not just “this is wrong.”

3. Progressive Enhancement of Automation

Start with easy rules, gradually add more sophisticated checks:

Phase 1: Basic pattern matching

  • “You created a new React component but didn’t add it to the component registry”
  • “You modified database schema but didn’t create a migration”

Phase 2: Similarity detection

  • “This function is 85% similar to processRefund() in billing.ts—consider extracting shared logic”

Phase 3: Context-aware validation

  • “This touches customer PII—have you reviewed our data handling guidelines?”
  • “This changes authentication flow—security review required”

The Question of Automated Review Fatigue

Your observation about 40% false positives in AI-assisted review is concerning and matches what we saw early in design automation.

The problem: If automated tools cry wolf too often, humans learn to ignore all automated feedback—including the real issues.

What worked for us:

  1. Tune for high precision over high recall - Better to miss some issues than flag too many false positives
  2. Severity levels - Critical (blocks merge) vs Warning (human decides) vs Info (FYI only)
  3. Learn from dismissals - When humans mark automated comments as “not applicable,” feed that back to improve the model

We aimed for <10% false positive rate on “Critical” automated checks. Anything higher and engineers start ignoring them.

The Risk-Tiering Insight

Your high/medium/low risk categorization is exactly right, and here’s why from a design perspective:

Not all design artifacts need the same review depth:

  • Marketing landing pages: High creativity, low consistency requirement → Light review
  • Design system components: Low creativity, high consistency requirement → Rigorous review
  • Internal admin tools: Medium everything → Standard review

Similarly for code:

  • Customer-facing payments: High correctness requirement → Rigorous review
  • Internal tooling: Lower stakes → Lighter review
  • Test code: Different failure modes → Different review criteria

The mistake is treating all code the same. AI-generated code just makes this more obvious.

What I’d Recommend for Code Review Automation

Based on what worked for design system review:

  1. Invest heavily in pattern validation tooling - Custom linters that understand your architecture
  2. Provide suggestions, not just errors - Help developers fix violations quickly
  3. Tune for low false positives - Better to under-automate than train engineers to ignore warnings
  4. Track what human reviewers catch - Continuously improve automation by learning from human reviews
  5. Different standards for different code - Risk-based review depth is pragmatic, not dangerous

The goal isn’t to eliminate human review—it’s to make human review focus on what humans are uniquely good at: judgment, creativity, edge case reasoning, business logic validation.

Automate the mechanical checks so humans can focus on the hard problems.

That’s how design review scales. I think code review needs the same approach—just with code-specific tooling.

Luis, your risk-tiered review approach is strategically sound, but I want to push on the implementation details because the gap between “good idea in theory” and “works in practice” is where most process redesigns fail.

The Automated Quality Gates Investment

At our company (100+ engineers, cloud infrastructure SaaS), we’ve allocated 20% of our platform engineering capacity to building the review automation infrastructure you’re describing.

Here’s what that actually looks like in practice:

Custom AST-Based Linters

We built custom linters that understand our architecture, not just generic code quality:

  • Pattern enforcers: “If you create a new API endpoint, it must use the AuthMiddleware pattern”
  • Consistency checkers: “This database query uses raw SQL; our standard is to use the ORM for type safety”
  • Coupling detectors: “This frontend component is importing from backend code—architectural violation”

Development cost: 2 engineers, 3 months to build initial set of rules
Maintenance cost: ~5-10 hours/week adding new rules as architecture evolves

ROI: These catch ~40% of issues that previously required human review. For a team doing 200+ PRs/week, that’s 80 fewer human review comments needed.

Architecture Decision Record (ADR) Enforcement

Maya mentioned making standards machine-readable—this is critical. We encode architectural decisions as rules:

ADR example: “All external API calls must go through the ExternalServiceClient wrapper for observability and retry logic”

Automated check: Parse AST, find all HTTP client usages, flag any that aren’t using the wrapper.

The challenge: ADRs are usually written in prose for human understanding. We had to create a parallel “machine-readable ADR” format—essentially architectural constraints as code.

Development cost: Template creation, tooling integration, team training
Ongoing cost: Updating checks when architecture changes

Intelligent Pre-Review Feedback

This is where AI-assisted review actually works (compared to the 40% false positive problem you mentioned):

Instead of “AI reviews the entire PR,” we use AI for specific, bounded tasks:

  1. Similarity detection: “This function is 87% similar to processPayment() in billing-service—did you mean to reuse that?”
  2. Documentation suggestions: “This public API function lacks a docstring—here’s a generated draft based on the implementation”
  3. Test gap identification: “You modified calculateDiscount() but didn’t update its tests—existing tests may be insufficient”

Why this works: Narrow, specific tasks with clear success criteria → Lower false positive rates.

Why general AI review fails: “Review this PR for quality” is too broad → High false positives.

The Risk-Based Categorization Implementation

Your high/medium/low risk tiers make sense conceptually, but how do you actually categorize PRs automatically?

Our approach:

Heuristic-Based Auto-Tagging

High Risk:
- Touches files in `/payment-processing/` or `/auth/` or `/customer-data/`
- Modifies database schema
- Changes API contracts (breaking changes detected via OpenAPI diff)
- Over 500 lines changed in a single file

Medium Risk:
- Business logic outside high-risk domains
- UI components with user input
- 100-500 lines changed

Low Risk:
- Test files only
- Documentation only
- Configuration changes (non-security)
- Under 100 lines changed

Accuracy: ~85% correct auto-categorization. Engineers can override if the heuristic is wrong.

False negatives: Occasionally a “low risk” PR touches something subtle and should’ve been high risk. We track these and refine heuristics.

The Differential Standards Question

You asked if AI-generated code should have different review standards. My take: Not different standards, but different review strategies.

Same question we ask: “Is this code correct, maintainable, and aligned with our architecture?”

Different approach:

  • Human code review: “Does the author understand the problem and solution?”
  • AI code review: “Does the author understand the AI-generated solution? Have they verified edge cases beyond what tests cover?”

We don’t label PRs as “AI-assisted” because it creates stigma. Instead, we train reviewers to ask “can you explain why this approach was chosen?” for all PRs. If the author can’t explain it (whether they used AI or copied from Stack Overflow), that’s a review failure.

What Actually Worked vs. What Failed

Worked:

  • Automated pattern compliance checking (high ROI, low false positives)
  • Risk-based categorization with heuristics (pragmatic, good enough)
  • Pre-commit hooks that run fast checks locally (catches issues before PR)
  • Dedicated platform eng capacity for review tooling (not ad-hoc side projects)

Failed:

  • AI review of entire PRs (too many false positives, engineers learned to ignore)
  • Requiring detailed categorization from authors (they always mark as “low risk”)
  • Complex ML-based risk scoring (over-engineered, hard to debug, not much better than heuristics)
  • Trying to automate business logic review (fundamentally requires domain expertise)

The Scaling Question

You asked: “Can code review scale to 2× volume without degrading quality?”

My answer: Not without structural changes, but yes with the right infrastructure investment.

The math:

  • 2× code volume without process change = degraded review quality (what you’re seeing)
  • 2× code volume + automated gates catching 50% of issues = sustainable
  • 2× code volume + automated gates + risk tiering + focused human review on high-risk = actually improves overall quality

But this requires upfront investment. You can’t bolt on automation after the fact while also trying to keep up with 2× PR volume.

My Recommendation: The Platform Engineering Play

Treat code review scalability as a platform engineering problem, not a process problem:

  1. Allocate dedicated eng capacity (2-3 engineers for a 50+ person team)
  2. Build custom tooling for your specific architecture (generic tools won’t cut it)
  3. Make review automation a first-class platform service (not a side project)
  4. Iterate based on what human reviewers are catching (automate the repetitive comments)

This isn’t cheap. But if AI is genuinely making individual engineers 40% more productive, the ROI of unlocking that productivity at the team level is massive.

The alternative is accepting that AI productivity gains stay individual-level only—which might be fine for small teams, but at scale, that’s leaving significant value on the table.

Review automation is infrastructure investment. Treat it like you’d treat building a CI/CD pipeline or observability platform—essential for operating at scale, not optional.

This conversation about review process redesign is timely—we’ve been running an experiment for the past 2 months that directly addresses the “AI-generated code requires different review” question, and the results challenge some assumptions.

The “AI-Assisted Code” Label Experiment

We added an optional label to PRs: “AI-Assisted” (engineer self-reports if >50% of code was AI-generated).

Hypothesis: AI-generated code has different failure modes, so we should review it differently—more emphasis on edge cases and business logic, less on syntax and style.

We trained reviewers on these differential focus areas:

For AI-assisted PRs:

  • :white_check_mark: Ask author to explain the approach (not just what, but why)
  • :white_check_mark: Verify edge case handling beyond what tests cover
  • :white_check_mark: Check if AI solution aligns with existing patterns
  • :white_check_mark: Question whether simpler approach exists

For human-written PRs:

  • Standard review process

What We Learned (Some Surprises)

1. Low Adoption of the Label

Only ~30% of engineers actually label their AI-assisted PRs, even though we know from surveys that 75%+ are using AI tools.

Reasons (from retrospective):

  • Forgot to add the label
  • Felt stigmatized (“my code will get extra scrutiny”)
  • Unclear threshold (“I used Copilot for autocomplete but wrote the logic myself—does that count?”)

Insight: Self-reporting doesn’t work. If AI-generated code needs different review, the detection needs to be automatic or universal.

2. Labeled PRs Had Longer Review Times

AI-assisted PRs: Average 42 minutes review time
Human PRs: Average 28 minutes review time

But not for the reason we expected. Reviewers spent more time because they were teaching the author to understand their own code, not because the code itself was harder to review.

Common review thread:

Reviewer: “Why did you choose this caching strategy?”
Author: “Copilot suggested it, seemed reasonable?”
Reviewer: “What happens if the cache is stale during a concurrent update?”
Author: “Hmm, I’m not sure. Let me investigate.”

This wasn’t review—it was mentorship and debugging disguised as review.

3. The Code Quality Difference Was Smaller Than Expected

Here’s the controversial finding: When we measured bugs found in production in the first 30 days post-deployment:

  • AI-assisted PRs: 1.4 bugs per 100 PRs
  • Human PRs: 1.1 bugs per 100 PRs

Yes, AI code had more bugs, but only 27% more, not the 70% difference we expected from research saying AI code has 1.7× more issues.

Hypothesis: Our review process is catching many of the AI-specific issues, so by the time code ships, the quality gap is smaller.

Alternative hypothesis: Our team’s AI usage is heavily biased toward experienced engineers who know how to validate AI output, so we’re not seeing the worst-case scenarios.

4. The Real Difference: Maintenance Burden

Where AI code really diverged: Technical debt and maintenance cost.

We tracked “time spent modifying code in the 90 days after initial deployment” as a proxy for maintenance burden:

  • AI-assisted code: 3.2 hours average maintenance time per feature
  • Human code: 2.1 hours average maintenance time per feature

53% higher maintenance burden for AI-generated code.

Why? Common patterns:

  • Non-standard approaches that new team members don’t recognize
  • Missing edge case handling discovered later
  • Tighter coupling (AI tends toward quick solutions, not architected ones)
  • Harder to modify because original author didn’t deeply understand design choices

Revised Perspective: Review Isn’t the Only Gate

Michelle and Luis are right that automated quality gates help. But I think we’re all over-indexing on review as the quality gate when the real issue is engineering discipline upstream of review.

The problem isn’t “how do we review AI code better”—it’s “how do we ensure engineers using AI maintain the same design discipline as engineers writing code manually?”

What Changed Our Approach

We shifted from “review AI code differently” to “ensure AI-assisted development includes human design thinking”:

New requirement for all PRs (AI or human):

  1. Design doc for features >100 LOC - Forces engineer to think through approach before coding
  2. “Why this approach?” section in PR description - Must explain choices, not just describe changes
  3. Edge case checklist - Standard list of edge cases to verify (concurrency, error handling, boundary conditions, etc.)

This shifts the quality gate earlier—before coding, during design. Review then validates that the design was followed and edge cases were considered.

Result so far (6 weeks in):

  • Review time actually increased slightly (32 min avg) because design docs take time to review
  • But post-deployment bugs down 40% and maintenance burden down 25%
  • Engineers report feeling more confident in their AI-generated code because they designed it first, then used AI to implement

The Core Insight: AI Makes Implementation Cheap, So Design Becomes Critical

When writing code manually is slow, engineers naturally think carefully about design—the cost of getting it wrong is high.

When AI makes implementation fast, the cost of iteration feels low, so engineers skip design and jump straight to coding. “I’ll just try this approach and see if it works.”

But the downstream costs (bugs, maintenance, technical debt) don’t change just because implementation is faster.

The solution isn’t different review—it’s enforcing design discipline when implementation feels cheap.

What I’m Recommending

Based on our experiment:

  1. Don’t label AI vs. human code - Creates stigma and low compliance
  2. Universal review standards - All code should be understandable, maintainable, well-designed
  3. Upstream quality gates - Design docs, edge case checklists, architectural discussion before coding
  4. Review for understanding - Key question: “Does the author deeply understand this code?” (regardless of who wrote it)
  5. Track maintenance burden - It’s a more important quality metric than initial bug count

The goal: Make AI a tool that implements good designs faster, not a shortcut around design thinking.

Because the productivity paradox isn’t just about review capacity—it’s about maintaining engineering discipline when the feedback loop from “bad design” to “working code” has been artificially shortened by AI.

Fast implementation + poor design = technical debt.
Fast implementation + good design = actual productivity.

Review can’t fix bad design. It can only catch the consequences.

The design-first approach Keisha described is exactly right—and it connects to broader questions about how AI tools change product development workflow, not just code review.

The Product Perspective on “Cheap Implementation”

From the product side, I’m seeing the same pattern Keisha identified in engineering: when implementation feels cheap, discipline around scoping and requirements degrades.

Before AI Assistants

Product-Engineering conversation:

PM: “For this feature, I’d like option A, B, and C.”
Eng: “That’s three weeks of work. Do we really need all three options, or can we ship with just A for the first version?”
PM: “Good point. Let’s start with A, validate with users, then decide if B and C are worth building.”

Result: Disciplined scope management, incremental delivery, validated learning.

After AI Assistants

Same conversation:

PM: “For this feature, I’d like options A, B, and C.”
Eng: “With Copilot, I can probably knock all three out in a week. Sure, let’s do it.”
PM: “Great!”

Result: Feature ships with 3× the complexity, 3× the surface area for bugs, 3× the maintenance burden, and we still don’t know which options users actually want.

The False Economy of “Cheap” Features

Here’s what I’m learning: AI doesn’t make features cheaper—it makes implementation cheaper while increasing costs everywhere else:

Phase Impact of AI-Accelerated Coding
Implementation :white_check_mark: 40% faster
Testing :cross_mark: 3× more test cases needed due to added complexity
Documentation :cross_mark: 3× more to document
User education :cross_mark: More complex UI, harder to explain
Support burden :cross_mark: More features = more support questions
Maintenance :cross_mark: 53% higher maintenance cost (Keisha’s data)

We’re optimizing the smallest cost (implementation) while ballooning every other cost.

The Requirements Discipline Problem

Keisha’s solution—design docs for features >100 LOC—is spot on. We’re implementing a parallel discipline on the product side:

New requirements process:

  1. User story with clear value hypothesis - Why are we building this? What user problem does it solve?
  2. Success metrics defined upfront - How will we know if this feature worked?
  3. Minimum scope definition - What’s the smallest version that tests the hypothesis?
  4. Complexity budget - Maximum acceptable implementation scope (e.g., “no more than 200 LOC”)

The complexity budget is new and controversial. Engineers push back: “But with AI, I can implement the fuller version in the same time!”

My response: “Time to implement isn’t the constraint—time to validate, support, and maintain is.”

Risk-Based Requirements Depth

Building on Luis’s risk-based review tiers and Keisha’s design discipline:

High-value, high-risk features:

  • Full design doc, user research, prototype testing
  • Incremental rollout with metrics monitoring
  • Post-launch review to validate value hypothesis
  • AI used for implementation speed, not scope expansion

Experimental features:

  • Lightweight requirements, fast implementation
  • Time-boxed (if not validated in 30 days, remove)
  • AI enables rapid experimentation, ok to have technical debt temporarily

Infrastructure/tooling:

  • Heavy emphasis on design and architecture
  • AI-assisted implementation after thorough design
  • Long-term maintenance expected, so quality critical

What Product Can Do to Support Engineering Discipline

Michelle, Luis, and Keisha are all describing engineering-side solutions. Here’s what we’re doing on the product side to complement:

1. Scope Discipline

When eng says “AI makes this easy to build,” I now ask:

  • “Does this complexity align with user needs?”
  • “Are we solving a validated problem or speculating?”
  • “What’s the maintenance burden of this approach?”

Saying no to “easy to build” features that don’t have clear value is my new discipline.

2. Incremental Delivery Requirements

We mandate smallest possible first version, even if AI makes the fuller version fast to implement.

Why? Because:

  • We can’t validate whether the feature works until users try it
  • Users might want something different than we built
  • Technical complexity is permanent; feature scope can grow later if validated

3. Post-Launch Review Rigor

We’re more aggressive about sunsetting features that don’t show usage or value.

If AI makes features cheap to build, we can afford to experiment more—but that means we must also remove failed experiments faster. Otherwise, we accumulate a graveyard of unused features with ongoing maintenance burden.

The Uncomfortable Truth

Here’s what I’m realizing: AI coding assistants make it easier to build the wrong thing fast.

The bottleneck in product development was never “how fast can we type code”—it was:

  • Figuring out what to build
  • Validating we built the right thing
  • Maintaining what we built

AI accelerates the middle step but doesn’t touch the first or last. So we can now build wrong things faster and create more maintenance burden.

The productivity paradox from a product lens: Engineering velocity feels higher, but product velocity (validated value delivered to users) hasn’t changed because we’re still bottlenecked on knowing what to build and validating it worked.

My Ask to Engineering Leadership

When you’re redesigning code review and development processes for AI-scale output, please include product discipline checkpoints:

  • Require value hypothesis and success metrics before starting implementation
  • Enforce minimum viable scope even when fuller implementation is “easy”
  • Track feature complexity and maintenance burden, not just implementation speed
  • Partner with product on “build vs. learn” decisions

Because the risk isn’t just technical debt—it’s product debt: features that seemed easy to build but don’t deliver value and now must be maintained forever.

AI makes implementation cheap. That makes saying “no” to unnecessary complexity more important, not less.

Review process redesign should include requirements review, not just code review.

Otherwise, we’re just building the wrong things faster.