Skip to main content

The Rubber-Stamp Collapse: Why AI-Authored PRs Are Hollowing Out Code Review

· 10 min read
Tian Pan
Software Engineer

A senior engineer approves a 400-line PR in four minutes. The diff is clean. Names are sensible. Tests pass. Two weeks later the on-call engineer is paging through a query that returns the right shape of rows but from the wrong column — user.updated_at where user.created_at was meant — and the cohort analysis dashboard has been quietly lying to the CFO for nine days. The reviewer was competent. The code was well-structured. The bug was invisible in the diff because it wasn't a syntactic smell. It was a semantic one, and the reviewer had nothing to anchor against because no one had written down what the change was supposed to do.

This is the failure mode that shows up once the majority of diffs in your repo start life as model output. Reviewers stop asking "is this correct?" and start asking "does this look like code?" The answer is almost always yes. AI-authored code is grammatically fluent in a way that bypasses the review heuristics engineers spent a decade sharpening on human-written slop.

The numbers are not subtle. One large-scale study of pull requests across enterprise repositories found AI-assisted PRs generate about 1.7× more issues overall, with critical-severity issues up roughly 40% and logic-and-correctness findings up 75%. At the 90th percentile, AI-authored PRs contain 26 issues per change — more than double the human baseline. Another 2026 telemetry dataset, covering 22,000 developers, reported incidents per PR up 242.7% while individual output metrics climbed. Throughput went up. Escaped defects went up faster.

Why the rubber stamp forms

Rubber stamping is not laziness. It is a rational response to a shifting signal-to-noise ratio. When most PRs were human-authored, a reviewer's brain ran a background classifier: sloppy naming, inconsistent indentation, a function that does two things, a boolean with a negative name — these were correlated with deeper problems. The syntactic smell was a cheap proxy for semantic risk, and it worked, because humans who cut corners on form usually cut corners on logic too.

AI output breaks that correlation. The model emits code that looks like the work of a careful engineer: docstrings in the house style, error messages that match the project's tone, variable names chosen from the right thesaurus. The form is detached from the reasoning that produced it. A reviewer scanning for syntactic smell is now scanning for a signal that is no longer there.

Worse, the failure modes are ones the human eye is poorly tuned to catch. Hallucinated APIs that compile against a method that doesn't exist in your version of the library. A query that joins on the right table but projects the wrong column. An except clause that swallows the specific error the caller was relying on to retry. A loop that runs N+1 queries because the model defaulted to a per-row lookup instead of the batched one the codebase already has. These don't render as diff-level problems. They render as ordinary, plausible code — which is exactly what AI is optimized to produce.

Pair this with the size shift. AI-authored PRs are larger on average; one study pegged median PR size up 51.3% and median review time up 441%. Past about 400 lines, reviewers are not reviewing, they are sampling. The larger the diff, the more decisive the rubber stamp, because at some point the reviewer's choice is between a superficial approval and an honest "I can't review this in reasonable time" — and the first option has no political cost while the second reads as obstructionism.

The signal degradation nobody graphs

The organizational tell of rubber-stamp collapse is a pair of metrics moving in opposite directions on AI-authored changes specifically. Review time per PR drops (the reviewer doesn't engage deeply because there is no purchase). Bug escape rate on those same PRs rises. Neither movement is alarming on its own. Together they form the signature.

Most teams don't cut their review metrics by authorship mode, so the trend is invisible. Leadership sees aggregate throughput improving and aggregate cycle time shortening and concludes the AI rollout is working. The incident curve catches up three to six months later, and because incidents are noisy and multi-causal, the link back to the review process is easy to miss. Six months is also long enough that the engineers who could have noticed the drift have already normalized the new equilibrium.

A few leading indicators that your review quality has crossed the line and nobody has said so out loud:

  • Reviewers increasingly approve PRs that touch parts of the codebase they have never worked in, without asking a question.
  • Comments skew toward nits (naming, import order, one-line refactors) and away from "what does this actually do?"
  • Rollback reasons shift from "misunderstood requirement" (a review-catchable class) to "edge case in production data" (often not, but now a convenient attribution).
  • The phrase "the AI wrote most of it, I just cleaned it up" appears in standups and is never challenged.
  • Post-incident reviews stop surfacing review gaps, because nobody wants to write "the reviewer didn't read it carefully" about a colleague, and "the AI hallucinated a method" is easier to blame.

None of these are actionable on their own. The pattern is.

Countermeasures that survive contact with a real team

The instinct is to mandate heavier reviews across the board. This fails, because the ceiling on reviewer attention is fixed and AI-authored volume isn't, so you end up with the same rubber stamp applied to even more code. The countermeasures that actually hold are ones that redirect attention rather than demand more of it.

A human-written intent section in the PR body. Not a description of the change — a description of the intent behind the change. What is the reviewer supposed to verify happened? Which invariants must still hold after this merge? Which were previously enforced only by the reviewer having the codebase in their head, and are now at risk because the author did not? This is the single most useful lever. It forces the author to separate what the model generated from what the change is for, and it gives the reviewer a non-AI anchor to compare against. If the author can't write the intent section clearly, they haven't understood their own PR.

Rotate an "AI-adversary" reviewer role. One reviewer on every significant PR is assigned the explicit job of finding the semantic bugs the syntactic review misses. They read the diff against the intent section and look for the class of errors humans skip: column-level mistakes, hallucinated APIs, wrong-but-plausible library usage, silently swallowed exceptions, invariants preserved in the wrong scope. The role is explicit because without it, everyone assumes someone else is doing this work, and nobody is.

AI-specific checklists targeting the failure modes humans miss, not the ones they catch. Style, formatting, and naming are already handled by linters and the model itself. The checklist should cover: is every external API actually a real one? Does every schema reference match the current schema, not the one three migrations ago? Is error handling specific or a bare except? Does the new code reuse the existing helpers for cross-cutting concerns (auth, retries, logging) or did the model reinvent them inline? Were tests added that would fail if the bug class the change is trying to prevent actually occurred? Checklist items that overlap with linter coverage are worse than useless — they train reviewers to skim the checklist.

A PR-size ceiling for AI-authored changes. If the PR is large, split it. This is not a style preference; it is an acknowledgment that the rubber stamp is mostly a function of size. A 60-line AI-authored PR gets a real review. A 600-line one gets a signature.

Spend-to-review ratio tracking. Measure the ratio of human review minutes to the author's AI-assisted time. If the ratio drops below a threshold on a given PR class, escalate. This catches the case where a 30-second model call produces a 400-line change that gets a 2-minute review. The math alone tells you the review cannot have been serious.

None of these require changing how engineers use AI to write code. They change how the team treats the output as a reviewable artifact.

The disclosure question leadership has to answer out loud

Every engineering organization with more than a handful of people using AI to write code will have to answer a question that most would rather duck: is "Claude wrote this" (or Copilot, or Cursor's background agent) a required field on the PR, optional metadata, or a cultural taboo?

There is no neutral default. Not asking is a choice — the one where authors quietly ship AI output without disclosing it, reviewers cannot calibrate their attention, and the organization loses the ability to slice review quality by authorship mode when incidents show up. Asking imperfectly (an optional checkbox) produces selection bias: people disclose when they think the AI's work is good and hide it when they think it's risky, which is the opposite of what's useful.

The coherent positions are:

  • Required disclosure, with authorship and provenance. PRs must indicate which sections were model-generated, which were hand-edited, and which prompt or agent produced them. This is what the Linux kernel community has been converging toward with Co-developed-by: trailers, and what the Apache and Fedora projects have codified as Generated-by: and Assisted-by: tags. It adds friction. It also makes the review-quality-by-mode question answerable.
  • Structural disclosure via workflow. AI-authored PRs flow through a separate review path — a different template, a mandatory intent section, a dedicated reviewer role. Authors don't have to label individual lines; the workflow itself is the disclosure. This scales better than per-line annotation and is harder to cheat.
  • Treated as author-accountable with no disclosure. The author is fully responsible regardless of who typed the tokens, and the review bar is raised on every PR to the level you'd apply to AI output. Honest, but expensive, and most organizations will say this is their policy and then not actually pay for it.

The dishonest position — the one many teams are drifting into by default — is to say disclosure is optional, see nobody using the optional field, and conclude that AI authorship isn't a meaningful factor. It is a meaningful factor. The absence of the signal is a choice about measurement, not about reality.

The curve that catches up

The team that celebrated 10× PR throughput in their all-hands six months ago is the same team whose quarterly incident count quietly doubled. The two numbers are the same story told at different lag times. Throughput is what you see in the moment; escaped defects are what you see once the code has been in production long enough for the edge cases to arrive. Most organizations don't retroactively graph the first against the second, because by the time they would, the throughput win has already been priced into next quarter's commitments and the incident count is already being attributed to "we're just moving faster now, some regression is expected."

It isn't. The regression is a review-process failure, and the review process failed because the signals it ran on stopped working when the authorship distribution shifted. Fix the review process — intent sections, adversary reviewers, checklists aimed at semantic failure, size limits, and honest disclosure — and the throughput gain survives. Don't, and the rubber stamp compounds until the next painful incident forces the conversation anyway. The choice is whether you have the conversation on your schedule or on the incident's.

References:Let's stay in touch and Follow me for more thoughts and updates