The Annotation Queue Your Humans Quietly Stopped Reading
Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.
This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.
