The Annotation Queue Your Humans Quietly Stopped Reading
Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.
This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.
The pattern is familiar to anyone who has run an on-call rotation past the point of sustainability. The pager keeps firing. Engineers keep clicking acknowledge. The incidents that get a real postmortem are the ones that happened to land during business hours on a quiet day. Everything else gets a one-line note and a green checkbox. The system looks healthy by every metric the team owns. The metric is the failure.
Annotator throughput is the ceiling, and it does not scale with traffic
The first thing to internalize is that human grading capacity is a hard, slow-moving ceiling. A trained reviewer can carefully grade somewhere between thirty and a hundred traces per hour, depending on task complexity and how much context they need to reload between cases. A part-time pool of three reviewers, given four hours a week each, tops out around twelve hundred careful grades a week. A serious eval pipeline can produce that many traces in an afternoon.
This asymmetry is not a recruiting problem you can solve with another job posting. Domain-expert annotators are precisely the people whose calendars are already saturated, because they are the same engineers, lawyers, clinicians, and support leads whose judgment the product depends on in the first place. Hiring generic annotators does not help when the question is whether a specific tool call was the right move in a specific customer's account.
So the budget is fixed. Production traffic, by contrast, is exponential. Once your application crosses a few thousand requests a day, the gap between traces emitted and traces a human will actually read passes three orders of magnitude. From that point on, every additional unit of traffic widens the gap. Adding more reviewers buys you a constant factor. The eval pipeline keeps growing geometrically. The ratio gets worse, not better, as your product succeeds.
The queue order is the silent sampler nobody chose
When you produce far more traces than humans can grade, the difference between the traces you produce and the traces you grade becomes a sample. Every sample has a sampling function. If you did not design one, the default is whatever ordering your queue happens to use: timestamp, insertion order, trace ID hash, or whichever join key the dashboard query happened to land on.
That default ordering is almost never the right answer, and it is rarely random. Recency bias means the most recent traces dominate. Reviewers fatigue partway down and the tail of the queue is systematically under-graded. If your queue is sorted by anything correlated with the input — user ID, request size, latency — your "human-graded score" is a measurement of the slice that ordering surfaced, and the slice that ordering hid is invisible.
The dangerous part is that this looks like signal. The number is stable week over week because the sampling bias is stable. It feels like a real measurement because it is consistent. It moves when the system changes because some changes do show up in the graded slice. None of that means it generalizes. You can spend a quarter chasing regressions on a subset of traffic the team did not know existed, and a quarter shipping fixes whose impact is invisible because they help the slice no one is reading.
The four sampling strategies, and what each one is for
Annotation-queue tooling has converged on a handful of sampling strategies. They are not interchangeable. Each one answers a different question and produces a different bias.
Random sampling gives you an unbiased view of overall quality. If you want to track whether the system is getting better or worse in aggregate, this is the only sampling regime that gives you that answer without correction. The cost is that random sampling spends almost all its budget on the median case and almost none on the tails, which is exactly where regressions live.
Stratified sampling divides traffic into segments — user tier, feature, request type, conversation length, language — and samples within each. This is what you want when the system behaves differently for different populations and you need to detect a regression in one segment without it being washed out by ten others. Stratification turns a single aggregate number into a panel of segment-level numbers, each of which is honest about its own population.
Priority sampling pushes the rare or interesting traces to the front: low automated scores, high latency, user-reported issues, traces where two judges disagreed. This is where your budget pays off for finding new failure modes. The cost is that priority-sampled grades are not representative of overall quality — they are representative of "things the priority function flagged," which is a different quantity.
- https://www.braintrust.dev/articles/human-in-the-loop-evals-for-llm-apps
- https://www.braintrust.dev/articles/best-human-in-the-loop-llm-evaluation-platforms-2026
- https://docs.datadoghq.com/llm_observability/evaluations/annotation_queues/
- https://langfuse.com/docs/evaluation/evaluation-methods/annotation-queues
- https://www.comet.com/docs/opik/evaluation/annotation_queues
- https://www.honeyhive.ai/post/introducing-annotation-queues
- https://www.getmaxim.ai/articles/human-annotations-for-strong-ai-evaluation-pipelines/
- https://www.getmaxim.ai/articles/llm-as-a-judge-vs-human-in-the-loop-evaluations-a-complete-guide-for-ai-engineers/
- https://mlfrontiers.substack.com/p/llm-evaluation-the-new-bottleneck
- https://link.springer.com/article/10.1007/s43681-024-00572-w
