Skip to main content

Rater Throughput Is the Hidden Bottleneck in Your Eval Pipeline

· 10 min read
Tian Pan
Software Engineer

The team plans an eval suite the way they plan a service: failure modes inventoried, rubric drafted, sample size argued over, judge calibration scheduled. Then they file the rater capacity as a footnote — "we'll get the annotation team to grade a few hundred per week" — and ship the rest. Six weeks later the rater queue is at 4,300 items, eval velocity has collapsed to one judge-calibration cycle per month, and someone in a planning review says the quiet part out loud: nobody capacity-planned the humans.

Rater throughput is the binding constraint on eval velocity in any AI system that takes human grading seriously, and the discipline that treats annotation as an SRE problem rather than a recruiting one is the one that ships. A human reviewer processes 50–100 examples per hour at expert difficulty, and an expert annotator caps out around 500–1,000 examples per week — those numbers are not a recruiting problem to be brute-forced with headcount. They are an operational property of the eval system that has to be modeled and budgeted the way you model database IOPS.

The Queue Grows Faster Than the Eval Suite

The shape of the problem is unintuitive until you draw it. Every eval suite has two growth rates: the rate at which engineers add new cases (new failure modes from incidents, new product surfaces, new judge calibrations), and the rate at which raters drain the queue. The first compounds, because every shipped model migration, every new tool the agent gets to call, every new prompt variant under A/B test, generates fresh items to grade. The second is linear at best and degrades at worst — raters churn, calibration drift forces re-grading of old items, and ambiguous rubric cases get sent back to engineering for clarification.

When you plot queue depth against the rate of new evals filed, the line bends upward within the first quarter of operation. Teams that don't notice it ship anyway and discover, in the next eval-driven release decision, that the data they're relying on was graded six weeks ago against a rubric that has since been revised. The release decision is now made on stale evidence, and the eval suite, technically green, has stopped being a release gate and become a release ritual.

The mental model that works: rater throughput is bandwidth. Eval cases in flight are packets. The queue is a buffer that's either draining or filling. There is no "infinite labor" assumption that survives contact with production usage, and pretending otherwise is the operational analogue of designing a service that assumes the database has unlimited write throughput.

Calibration Is Onboarding, Not a One-Time Event

The first thing teams underprice is calibration. A new rater is not productive on day one — they're productive after they've graded a calibration set, had their disagreements with consensus surfaced, and worked through the rubric edge cases with someone who has internalized the policy. Treating calibration as a one-time onboarding event misses the more expensive reality: every rubric change is a recalibration event for every rater, every model migration that shifts the distribution of outputs creates new edge cases that the original calibration didn't cover, and every new task category needs its own calibration pass.

The discipline that works treats calibration the way SRE treats runbook drills. There is a quarterly recalibration cycle where every active rater grades the same hold-out set, inter-rater agreement is measured (Cohen's Kappa, Krippendorff's alpha, or whatever metric your domain actually uses), and the items where agreement drops below threshold trigger one of three actions: rubric edit, anchor-example addition, or rater-specific feedback. The hold-out set itself is treated as a living artifact, versioned alongside the rubric, with items added whenever an ambiguous case appears in production grading.

The pattern that breaks teams: hiring more raters without calibrating them. Adding annotators to a poorly-calibrated team is the same anti-pattern as adding cores to a system with a contended lock. Throughput goes up sublinearly, disagreement rates climb, and the downstream eval signal degrades to the point where engineers stop trusting it. At that point you've spent the headcount budget and made the eval problem worse.

Queue-Aware Eval Prioritization

If rater throughput is bandwidth, then queue-aware prioritization is admission control. Not every eval case is equally valuable, and the eval suite as a whole has a budget — say, 4,000 items per week if you have four expert raters running at the high end of expert annotator throughput. The discipline is treating that budget the way an oncall team treats error budget: it's finite, the work that consumes it competes for the same pool, and someone has to make the prioritization decision before the queue depth makes the decision for you.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates