The Mixed PR Queue: Reviewer Throughput Is Now the Binding Constraint

May 10, 2026 · 10 min read

Software Engineer

For the last twenty years, the Theory of Constraints answer in software delivery was the same: the bottleneck is producing code. We tooled around that assumption — pair programming, IDE autocompletion, faster CI, smaller services, all designed to push more code through a fixed-width review pipe. Then coding agents arrived, the production side of the pipe got 5–10x wider, and the review pipe stayed exactly the same width. A senior engineer who used to open three PRs a week now supervises a fleet that opens thirty in an afternoon. The team's velocity is no longer set by how fast anyone writes code. It's set by how fast a human can read it.

This is not a future problem. Median PR review time has been measured at +441% year over year in some samples, and 31% more PRs are merging with zero review — not by policy, but because reviewers gave up trying to keep pace. Stripe is shipping over a thousand agent-produced PRs per week. Feature-branch throughput grew 59% YoY in one benchmark while main-branch throughput fell 7% — code is being written, but it's not getting promoted, because it's stuck in review.

The temptation when this happens is to treat it as a tooling problem and reach for an AI reviewer. That helps at the margins, but it misses the structural shift: a coding-agent program is a review-process program first and a code-generation program second. If you ship the agents without redesigning the queue, you've just built a backlog generator.

The Two Failure Modes Are the Same Failure

The mixed PR queue fails in two opposite directions, and teams tend to oscillate between them.

The first failure is the rubber-stamp at 11pm. The reviewer comes back to a queue of forty diffs, knows they have to clear it before standup, and starts approving things they haven't actually read. "Approved" is no longer a quality signal — it's a throughput signal. The reviewer is honest about it in private and dishonest about it in the audit trail, and the team's nominal review SLA looks healthy right up until the post-mortem reveals that the bug shipped through a 900-line PR that got approved in 48 seconds.

The second failure is the graveyard queue. Different team, same input. The senior engineers refuse to rubber-stamp, so PRs sit for three to seven days waiting for a real review. Authors context-switch onto something else, the diff goes stale, the merge conflicts pile up, and the agent has to be re-prompted with a new branch. Effective throughput drops below where it was before agents existed, because the review queue has become a self-reinforcing depression that no one wants to enter.

These look like opposite problems, but they're the same problem viewed through different team cultures. Both are caused by feeding agent-authored and human-authored PRs into a queue that was designed for human-paced output, with a single approval bar that makes no distinction between a one-line dependency bump and a payment-flow refactor.

The Labeling Discipline That Has To Land

Before any process change works, the queue has to be legible. A reviewer staring at a list of PRs needs to know — without opening the diff — what kind of object they're looking at. Three pieces of metadata earn their keep:

Author class. agent-authored: <agent-id> tells the reviewer this code was generated by Claude Code or Codex or your in-house agent, and which one. Different agents have different failure modes, and reviewers calibrate their attention to the agent the same way they calibrate to a junior developer they've worked with before.
Supervision state. supervised-by: <human> tells the reviewer whether a human has already eyeballed the diff before it landed in the queue. An agent PR that the author has already read is a different review object than an agent PR that nobody has read.
Risk tier. This doesn't have to be perfect. Even a coarse tag — tier: trivial | feature | critical-path — changes the reviewer's behavior. The trivial tier gets a 30-second skim. The critical-path tier gets the same scrutiny it would have gotten from a human author.

The naming matters less than the existence of the labels. Teams that try to keep all of this in their heads end up with the rubber-stamp failure within a quarter. Teams that put it in a structured trailer or a PR template end up with a queue a human can scan in under a minute.

Per-Tier SLOs, Not Per-PR SLOs

Once the queue is tiered, the review SLO has to be tiered with it. A blanket "PRs reviewed within 24 hours" target was tolerable when output was human-paced. With agents, it becomes a mechanism that pushes everything toward the rubber-stamp end of the spectrum, because the only way to hit the SLO is to spend less time on each PR.

A workable shape:

Trivial tier: minutes-to-merge, automated approval where CI provides sufficient signal, batched into a single fast-lane PR where possible (one grouped Dependabot PR for all patch bumps, not six). The point of this tier is to keep these PRs out of human attention entirely, not to make humans review them faster.
Feature tier: hours-to-first-review, full human review, agent-assisted triage allowed. This is where the reviewer's actual judgment lives, and the queue should be small enough that they can afford to spend real time here.
Critical-path tier: explicit named reviewer, no rubber-stamp path, no agent merge regardless of CI status. Auth boundaries, payment flows, data deletion, anything that touches a contract with a customer. This tier is small by construction and gets the attention it used to get when there were fewer total PRs.

The key insight is that the trivial tier is not a faster version of the feature tier — it's a different review object with a different approval bar. The mistake teams make is to treat all PRs as the same kind of thing on a faster clock, which compresses the time spent on the PRs that actually mattered.

Reasoning Traces Make the Reviewer's Job Tractable

When the agent ships a PR, the reviewer faces a question they didn't have to face for human-authored code: did the model actually understand the requirement, or did it produce something that pattern-matches to the requirement? These are not the same thing, and a 200-line diff doesn't tell you which one happened.

The structural fix is to require the agent's PR description to carry a reasoning trace and a verification record. The reviewer is then auditing the process — "did the agent solve the right problem with a defensible approach, and did it verify the result?" — rather than re-deriving the correctness from scratch. The first job takes 30 seconds. The second takes 30 minutes. That's the whole difference between a queue that scales and a queue that doesn't.

The trace doesn't have to be elaborate. Three things are usually enough: what the agent understood the goal to be (so the reviewer can spot a misframed task immediately), what alternatives it considered and rejected (so the reviewer knows whether the easy path was looked at), and what it ran to verify the result (so the reviewer knows whether "tests pass" means the test for this behavior passes, or just that the suite is green). When any of those three is missing, the PR goes back to the agent before it goes to a human reviewer.

There is a real risk to this approach: reasoning traces can be confidently wrong. A model can produce a plausible-sounding chain of thought that arrives at the right answer through the wrong logic, or fabricate a verification step that didn't actually happen. The trace is not proof — it's a structured object the reviewer can spot-check against the diff. Used well, it cuts review time by an order of magnitude. Trusted blindly, it's a more sophisticated rubber stamp.

"CI Passes" Is Not the Gate You Think It Is

The pressure of the mixed queue eventually pushes teams toward the most dangerous compromise: let the agent merge its own PRs once CI is green. It looks like the obvious fix — tests pass, lints pass, types pass, why does this need a human? — and it's the policy you'll regret in six months.

The problem is that CI was never designed to be the merge gate for agent-authored code. CI was designed to verify that human-authored code didn't break what humans-already-thought-to-test-for. When the author and the test author are the same model, CI green tells you the change is internally consistent; it does not tell you the change does what the requirement said. The teams that discover this discover it the same way: a feature ships, a customer reports it doesn't work, the test that should have caught it doesn't exist, and the agent's PR description is silent on the missing case because the agent didn't think to write a test for a behavior it didn't think to implement.

The fix is not to abandon auto-merge — it's to be honest about what auto-merge means. For the trivial tier (dependency bumps inside a known compatibility range, lint auto-fixes, generated-code refreshes where the generator is deterministic), CI is a reasonable gate. For anything that touches business logic, CI is a necessary gate but never a sufficient one, and the merge button stays in human hands.

The Org Realization

The teams that handle this transition well are the ones that recognize it as an org-design problem early. The senior engineers' job is shifting from writing code to running a review queue, and that's a different skill profile, a different load profile, and a different career conversation. Pretending it's the same job with more PRs is how you burn out your seniors in two quarters.

The teams that handle it badly are the ones that treat the agents as a productivity feature for individual engineers and leave the queue, the SLOs, and the reviewer compensation alone. Six months in, they've shipped more code, broken more things, and watched their best reviewers quit because the only signal anyone could see was "PRs merged" and the only thing they were paid to do was approve faster.

A coding-agent program is a review-process program first. The code generation is the easy part — it's already working. The hard part is building a queue that a human can govern, with tiers that match risk, traces that make audit tractable, and a clear honest answer to the question of when a machine is allowed to merge.

If your team is shipping agents this quarter, the question to walk into the planning meeting with is not which model are we using. It's what does our review queue look like in twelve weeks, and who is going to want to work in it?

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Mixed PR Queue: Reviewer Throughput Is Now the Binding Constraint

The Two Failure Modes Are the Same Failure

The Labeling Discipline That Has To Land

Per-Tier SLOs, Not Per-PR SLOs

Reasoning Traces Make the Reviewer's Job Tractable

"CI Passes" Is Not the Gate You Think It Is

The Org Realization

Recommended Reading

About Tian Pan

The Two Failure Modes Are the Same Failure​

The Labeling Discipline That Has To Land​

Per-Tier SLOs, Not Per-PR SLOs​

Reasoning Traces Make the Reviewer's Job Tractable​

"CI Passes" Is Not the Gate You Think It Is​

The Org Realization​

Recommended Reading

About Tian Pan

The Two Failure Modes Are the Same Failure

The Labeling Discipline That Has To Land

Per-Tier SLOs, Not Per-PR SLOs

Reasoning Traces Make the Reviewer's Job Tractable

"CI Passes" Is Not the Gate You Think It Is

The Org Realization