The Rubber-Stamp Collapse: Why AI-Authored PRs Are Hollowing Out Code Review
A senior engineer approves a 400-line PR in four minutes. The diff is clean. Names are sensible. Tests pass. Two weeks later the on-call engineer is paging through a query that returns the right shape of rows but from the wrong column — user.updated_at where user.created_at was meant — and the cohort analysis dashboard has been quietly lying to the CFO for nine days. The reviewer was competent. The code was well-structured. The bug was invisible in the diff because it wasn't a syntactic smell. It was a semantic one, and the reviewer had nothing to anchor against because no one had written down what the change was supposed to do.
This is the failure mode that shows up once the majority of diffs in your repo start life as model output. Reviewers stop asking "is this correct?" and start asking "does this look like code?" The answer is almost always yes. AI-authored code is grammatically fluent in a way that bypasses the review heuristics engineers spent a decade sharpening on human-written slop.
The numbers are not subtle. One large-scale study of pull requests across enterprise repositories found AI-assisted PRs generate about 1.7× more issues overall, with critical-severity issues up roughly 40% and logic-and-correctness findings up 75%. At the 90th percentile, AI-authored PRs contain 26 issues per change — more than double the human baseline. Another 2026 telemetry dataset, covering 22,000 developers, reported incidents per PR up 242.7% while individual output metrics climbed. Throughput went up. Escaped defects went up faster.
Why the rubber stamp forms
Rubber stamping is not laziness. It is a rational response to a shifting signal-to-noise ratio. When most PRs were human-authored, a reviewer's brain ran a background classifier: sloppy naming, inconsistent indentation, a function that does two things, a boolean with a negative name — these were correlated with deeper problems. The syntactic smell was a cheap proxy for semantic risk, and it worked, because humans who cut corners on form usually cut corners on logic too.
AI output breaks that correlation. The model emits code that looks like the work of a careful engineer: docstrings in the house style, error messages that match the project's tone, variable names chosen from the right thesaurus. The form is detached from the reasoning that produced it. A reviewer scanning for syntactic smell is now scanning for a signal that is no longer there.
Worse, the failure modes are ones the human eye is poorly tuned to catch. Hallucinated APIs that compile against a method that doesn't exist in your version of the library. A query that joins on the right table but projects the wrong column. An except clause that swallows the specific error the caller was relying on to retry. A loop that runs N+1 queries because the model defaulted to a per-row lookup instead of the batched one the codebase already has. These don't render as diff-level problems. They render as ordinary, plausible code — which is exactly what AI is optimized to produce.
Pair this with the size shift. AI-authored PRs are larger on average; one study pegged median PR size up 51.3% and median review time up 441%. Past about 400 lines, reviewers are not reviewing, they are sampling. The larger the diff, the more decisive the rubber stamp, because at some point the reviewer's choice is between a superficial approval and an honest "I can't review this in reasonable time" — and the first option has no political cost while the second reads as obstructionism.
The signal degradation nobody graphs
The organizational tell of rubber-stamp collapse is a pair of metrics moving in opposite directions on AI-authored changes specifically. Review time per PR drops (the reviewer doesn't engage deeply because there is no purchase). Bug escape rate on those same PRs rises. Neither movement is alarming on its own. Together they form the signature.
Most teams don't cut their review metrics by authorship mode, so the trend is invisible. Leadership sees aggregate throughput improving and aggregate cycle time shortening and concludes the AI rollout is working. The incident curve catches up three to six months later, and because incidents are noisy and multi-causal, the link back to the review process is easy to miss. Six months is also long enough that the engineers who could have noticed the drift have already normalized the new equilibrium.
A few leading indicators that your review quality has crossed the line and nobody has said so out loud:
- Reviewers increasingly approve PRs that touch parts of the codebase they have never worked in, without asking a question.
- Comments skew toward nits (naming, import order, one-line refactors) and away from "what does this actually do?"
- Rollback reasons shift from "misunderstood requirement" (a review-catchable class) to "edge case in production data" (often not, but now a convenient attribution).
- The phrase "the AI wrote most of it, I just cleaned it up" appears in standups and is never challenged.
- Post-incident reviews stop surfacing review gaps, because nobody wants to write "the reviewer didn't read it carefully" about a colleague, and "the AI hallucinated a method" is easier to blame.
None of these are actionable on their own. The pattern is.
- https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report
- https://www.helpnetsecurity.com/2025/12/23/coderabbit-ai-assisted-pull-requests-report/
- https://byteiota.com/ai-code-review-crisis-prs-up-20-quality-down-23/
- https://www.faros.ai/blog/key-takeaways-from-the-dora-report-2025
- https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/
- https://flamehaven.substack.com/p/the-pull-request-illusion-how-ai
- https://bryanfinster.substack.com/p/ai-broke-your-code-review-heres-how
- https://arxiv.org/html/2512.05239v1
- https://earezki.com/ai-news/2026-04-04-ai-code-review-checklist/
- https://medium.com/@marcusavangard/code-review-is-broken-heres-why-your-team-keeps-shipping-bugs-anyway-38c32c3a961b
