Skip to main content

Reviewing Agent PRs Is a Different Job, Not a Faster One

· 10 min read
Tian Pan
Software Engineer

A senior engineer pulls up an agent-authored PR. The diff is clean. The tests pass. The naming is consistent. They skim it, leave a thumbs-up, and merge. Two months later, a different senior engineer is rewriting that module because the abstraction it introduced quietly leaks state across three call sites and the test suite never noticed because it asserted what the code does, not what the spec required.

This pattern is the dominant failure mode of code review in 2026. The reviewer instincts that worked on human-authored PRs — probe the author's intent, look for the bug they didn't think of, check whether the test reflects the design — break down on agent PRs because the bugs cluster in different places and the artifacts the reviewer sees are no longer the artifacts that matter.

The data backs the intuition. CodeRabbit's December 2025 analysis of 470 GitHub PRs found that AI-co-authored code produces about 1.7× more issues than human-authored code, with logic and correctness errors at 1.75×, security findings at 1.57×, and algorithmic and business-logic errors at 2.25× the human rate. Critical issues climb 1.4× and major issues 1.7×. The diffs read fluent, and that fluency is precisely the problem.

The Bug Profile Flipped

Human engineers introduce bugs from incomplete mental models. The fix is usually a missing branch, an unconsidered edge case, a unit not converted, a race condition the author didn't picture. Reviewers calibrated on years of human PRs have a sixth sense for these — "what happens if this list is empty?" "did you consider the timezone?" "is this thread-safe?" — and the questions land because they target the gap in the author's understanding.

Agents introduce bugs from a different source: misread prompts, hallucinated APIs, and confidently-wrong abstraction choices. The bug isn't a missing branch — it's a coherent, plausible-looking implementation of the wrong thing. The agent called response.json().getField("name") because that pattern shows up in its training data, even though the library being used returns a typed object. The agent factored out a helper that "looks reusable" but couples three concerns that should stay separate. The agent wrote a test that asserts what the code does rather than what the spec requires, because the same model produced both, and they share blind spots.

CodeRabbit's data on this is striking. Concurrency control issues appear 2.29× more often in agent code. XSS vulnerabilities show up 2.74× more often. Excessive I/O operations are nearly 8× more common. These aren't the bugs human reviewers are trained to find by reading a diff — they're systemic patterns that require a different reading strategy.

What the Reviewer Sees Versus What Matters

A human-PR reviewer sees the diff and infers the intent from the commit message, the PR description, and the surrounding code. The intent is in the author's head, but the diff is a high-fidelity signal of that intent because the author had to translate it consciously, line by line.

An agent-PR reviewer sees the diff and infers the intent from… the same diff. The actual intent lives in the prompt — the spec the engineer typed into Claude Code, Cursor, or Devin — and that prompt is almost never attached to the PR. The reviewer is reading the implementation and trying to reconstruct the spec it implements, which is exactly backwards.

This is why fluent agent code is so dangerous. With a human author, fluent code is evidence the author understood the problem. With an agent, fluent code is evidence the model produced consistent prose; the model writes the same way whether it understood the problem or not. The visual signal a reviewer trusts — "this looks like someone who knows what they're doing wrote it" — has been disconnected from the underlying property it was tracking.

Three Artifacts, Not One

The discipline that has to land is treating an agent PR as three separate artifacts:

  1. The prompt — the spec the diff is supposed to implement.
  2. The diff — the implementation.
  3. The test plan — the agent's claim about what behavior it covered.

Reviewing each in isolation is half the job. The other half is finding the gap between any pair:

  • Prompt vs. diff: did the agent solve the problem you actually asked for, or a nearby one? Subtle scope drift is common. You asked for "validate the email format" and got back a full email-deliverability check that hits an external API on every request.
  • Diff vs. test plan: does the test plan actually exercise the behavior the diff claims? Tautological tests — assertEqual(format(x), format(x)) — pass without proving anything. Tests written by the same agent that wrote the implementation share blind spots; both encode the same misunderstanding.
  • Prompt vs. test plan: do the tests cover what the spec asked for, or only what the implementation happened to do? The spec said "handle empty input." The implementation crashes on empty input. The test only runs the happy path. All three artifacts agree with each other and disagree with reality.

If your team's review tooling shows only the diff — and most do — the reviewer is structurally unable to do this job. Posting the prompt as the PR description, attaching the agent's test plan as a separate file, and reviewing all three together is a workflow change that has to land before the discipline can.

"Why Didn't the Agent Ask?"

The single most useful first-class heuristic for reviewing agent PRs: a confident, fluent diff with no clarifying questions in the conversation history is a red flag. Real specifications have ambiguities. A senior engineer asked to "add caching to the user lookup" would ask: which cache? What invalidation policy? What TTL? What about per-tenant isolation? Is the existing cache layer the right tool, or do we want a new one?

When the agent doesn't ask any of those, it picked an answer. That answer is in the diff, but the alternatives the engineer would have surfaced are not. The reviewer's job is to reconstruct the questions the agent should have asked and verify the diff matches the answer the team would have given.

This generalizes into a checklist for failure modes humans rarely produce:

  • Hallucinated APIs: for every new library import or method call, verify the method exists in the version pinned in package.json or requirements.txt. The signature must match exactly. Agents synthesize plausible APIs from training data; the calls compile in your IDE if your IDE is lenient about types and explode at runtime.
  • Over-broad refactors: agents enjoy "improving" code adjacent to the change. A two-line bug fix arrives as a thirty-file PR with consistent style edits, renamed variables, and an unrelated abstraction extracted. The diff is technically correct but the blast radius is wrong; the reviewer's job is to push back on scope, not just correctness.
  • Tautological tests: tests where the assertion mirrors the implementation rather than the specification. Common form: assert sort(input) == [item for item in sorted_using_same_algorithm(input)]. The test will pass forever and prove nothing.
  • Confidently wrong abstractions: the agent factored out a base class, a context manager, a generic helper. It looks clean, but the three call sites it abstracts have meaningfully different invariants that the abstraction collapses. Months later, a fourth call site needs slightly different behavior and the whole abstraction has to come out.

The Cost Frame Nobody Surfaces

Stripe's internal "minions" agents now ship roughly 1,300 PRs per week. Intercom reports 93% of their PRs are agent-driven and 19% merge with no human reviewer. GitHub's Octoverse pegs ~41% of new code as AI-assisted. Industry data suggests heavy AI users open 98% more PRs per developer, while review time has ballooned 91% per PR — and delivery velocity hasn't moved.

That last number is the punchline. The productivity claim "agents write 70% of our code" is meaningless without a paired line item: "and reviewing that code costs X% more engineer-hours per PR than it used to." Most orgs aren't measuring the second number. They measure PR throughput, ship-rate, and lines-of-code-per-engineer; they don't measure the depth-per-PR a reviewer applied, and they especially don't measure the senior-engineer rewrite-cost six months out when a confidently-wrong abstraction has to come out.

The honest accounting is that reviewing agent PRs at the depth they require is slower per PR than reviewing human PRs, because the reviewer has to:

  1. Reconstruct the prompt from the diff.
  2. Cross-check every API call against actual library docs.
  3. Audit the test plan independently rather than trusting that "tests pass" means "behavior is correct."
  4. Push back on scope creep that humans wouldn't have generated.
  5. Probe abstraction choices the agent made silently.

If a team treats this work as the same as a human PR review and applies the same time budget, the abstractions decay. If the team sizes the review work honestly, the throughput claim gets a haircut. There's no third option — reviewing well costs what it costs.

Tiered Review and the Limits of Auto-Approval

Some teams are responding with tiered review systems: automated checks for hallucinated APIs and security anti-patterns, a separate AI reviewer that runs a structured rubric on every PR, and human judgment reserved for architecture, scope, and abstractions. Cloudflare and Microsoft have published variants of this — multi-agent review systems where specialized reviewers cover security, performance, and code quality, coordinated by an agent that deduplicates findings into a single comment.

This helps. It does not eliminate the bottleneck. The work an AI reviewer can do well is exactly the work where the failure mode is local and pattern-matchable. The work that still needs a human — does this abstraction belong here? Is this scope creep? Does this test reflect the spec? — is the work that scales worst. And it's the work the cost of getting wrong shows up six months later, after the agent that wrote it is on a different model version and the prompt is gone.

Auto-approval at 19% of PRs (Intercom's number) only works if the underlying class of changes is genuinely low-stakes — formatting fixes, dependency bumps, well-bounded refactors with strong contract tests. Pushing auto-approval into ambiguous territory is borrowing against the future review budget. The interest rate is "senior engineer rewrite," and it compounds.

The Bottleneck Isn't Going Away

Coding-agent vendors have spent two years promising to remove the code-review bottleneck. The bottleneck is still there. It moved, but it didn't go away, and the underlying reason is structural: the work of code review is not a function of how the code was written. It's a function of what the code is — what it changes, what depends on it, how it interacts with the system around it. Agents changed who generates the code and how fast the volume scales. They did not change the work of deciding whether the code should be merged.

The teams that are quietly succeeding at this aren't the ones with the fanciest AI reviewer. They're the ones who treated the prompt as a first-class artifact, sized the review budget honestly, and trained their reviewers to look for the failure modes that don't appear in human-authored code. The teams that are quietly failing are the ones approving fluent diffs at human-PR speed and accumulating the kind of subtly-wrong abstractions that don't show up in any dashboard until somebody senior has to spend a week ripping one out.

The reviewer instinct is the asset. The work is to recalibrate it for a new bug profile, not to retire it.

References:Let's stay in touch and Follow me for more thoughts and updates