Skip to main content

Eval Triage Queues: Why FIFO Misses the Failures That Matter

· 11 min read
Tian Pan
Software Engineer

A healthy eval set is supposed to be a sign of maturity. It is also, on any given Monday, a thousand failed cases sitting in a queue with a human reviewer who has eight hours and a per-case throughput of about fifty. The arithmetic is brutal: roughly one in twenty failures gets read. The other nineteen wait. Which nineteen wait, and which one gets the seat, is decided by whichever order the file happens to load in.

Most teams call this "reviewing failures." It is closer to a lottery weighted by alphabetical order. A failure case that affects two percent of production traffic and lives at the top of the file gets attention. A failure case that affects forty percent of production traffic and lives near the bottom gets a glance on Friday afternoon, if at all. The team ships a fix for the small problem on Tuesday and writes a retro on Thursday wondering why the dashboard hasn't moved.

The reason the dashboard hasn't moved is not that the team picked the wrong fix. The team picked the right fix for the case in front of them. The system that put the case in front of them is what's broken. Eval failures are a stream of bug reports with a priority queue's worth of metadata that the reviewer never sees by default, and a queue that processes them as FIFO is doing the engineering equivalent of triaging an emergency room by the order patients walked through the door.

The default queue is FIFO and the default is wrong

The default behavior of every eval framework worth using is to dump failures into a list. The list is ordered by whatever produced it: trace ID, timestamp, batch index, output of the test runner. Reviewers open the list, scroll from the top, and start reading.

This is fine when the failure count is small enough that the reviewer can read all of it. The eval set was built to grow beyond that point. The whole reason to invest in evals is to surface more failures than gut feel and ad-hoc bug reports were catching. The moment the eval set works, the queue overflows, and the question of which failures get reviewed becomes the most consequential decision in the loop — more consequential than which prompt change you make next, because the prompt change is downstream of which case taught you what to change.

FIFO answers that question with: whichever ones loaded first. That is not a triage policy. That is the absence of one.

Failures aren't worth equal review time

The cases waiting in the queue differ along axes the reviewer can't see from the list view:

  • Production-traffic representativeness. Some failure cases reflect a query pattern that ten percent of users hit every day. Others reflect a query pattern that fired twice last quarter. A fix for the first one moves the needle; a fix for the second one is craft.
  • Severity to the user. A wrong refund amount and a slightly verbose answer both register as "failed" in the eval. They are not the same kind of failure. One is a regulatory incident. One is a style nit.
  • Cluster size after deduplication. Three hundred failures in the queue might really be twelve distinct failure modes, each repeated. A reviewer who reads them one-by-one spends most of the budget re-discovering the same twelve patterns.
  • Adversarial slice coverage. Some failures are bulk-traffic misses. Some are jailbreak attempts, edge-case prompts, or compliance probes. The bulk-traffic queue under-samples the adversarial slice for the exact reason bulk traffic is bulk — there's more of it. Letting the queue auto-prioritize by frequency drowns the adversarial cases.
  • Time since the cluster was last reviewed. A failure mode you reviewed two days ago and patched does not need to be reviewed again today. A failure mode you have never opened, no matter how rare, is information the team doesn't have yet.

A FIFO queue treats all of these as the same row in the same list. The reviewer's expertise is the most expensive thing in the loop and it is being spent on whichever case happened to sort first.

A triage scoring layer is the obvious fix and nobody ships it

The right shape is a priority queue with an explicit scoring function. The score doesn't have to be fancy. A first-draft version that beats FIFO by a wide margin looks like:

priority = traffic_share × severity × recency_decay
× adversarial_slice_quota_multiplier
/ cluster_size_already_reviewed

Each term is one column the team can compute from data they already have. Traffic share is a join between the failure case and production logs — what fraction of last week's traffic looks like this input? Severity is a one-to-five field the reviewer fills in once per cluster and the score inherits. Recency decay starts at one and falls off over days since the cluster was last reviewed. The adversarial multiplier is the policy lever — bump rare-but-important slices so they don't get out-voted by bulk traffic. The denominator deflates clusters the team has already worked on.

This is a triage scoring layer, not a research problem. It is the same shape as the priority queue every on-call rotation runs for incident severity, the same shape every product team runs for bug triage, the same shape every customer-support tool ships out of the box. The reason it's missing in eval workflows is that the eval framework assumes the human at the end of the pipeline will sort it out. The human at the end of the pipeline has fifty cases of throughput and no time to sort.

The shape that emerges when the score is in place is not subtle. The top of the queue is dominated by clusters representing real production weight, freshly broken or never opened, with a sprinkling of adversarial probes the policy lever forced up. The reviewer's day produces fixes for the failures whose fixes actually move production. The dashboard moves.

Batching: one judgment, generalized across a cluster

The triage score solves which cluster to look at next. Batching solves how much value the reviewer extracts from one read.

The naive review loop opens one case, decides what to do, opens the next, decides again. If those two cases are structurally the same failure — same root cause, different surface text — the reviewer just paid twice for one decision. At fifty cases per day per reviewer, paying twice means losing half the budget.

The batching discipline is to cluster cases by shape before they reach the reviewer, then present one cluster at a time with a representative case and a count of similar cases behind it. The reviewer makes one judgment ("this cluster is a retrieval miss on long-tail entities, route to the retrieval team") and that judgment applies to every case in the cluster. The throughput goes from fifty cases per day to fifty clusters per day, which at typical cluster sizes is two to ten times the effective coverage.

The clustering is not magic. The standard approach is to embed the inputs and outputs of failing traces, run a clustering algorithm, and let the reviewer name the clusters during review — labels get attached, the cluster becomes a named bucket, and the next time a similar case shows up it joins the existing bucket. Open coding then axial coding, borrowed from qualitative research. Theoretical saturation — the point at which twenty new cases don't reveal a new failure mode — is the signal that the cluster set has stabilized for now.

The teams that do this routinely will tell you the same thing: the first time they ran the eval set through a clustering pass, the apparent five hundred failures collapsed into about thirty real failure modes, and the work suddenly became sized to the team rather than the queue.

The adversarial-slice quota nobody defends

A scoring function that uses production traffic share as a primary input has a known failure mode: it under-samples the cases that don't look like bulk traffic. The jailbreak attempts, the regulated-domain edge cases, the prompts that came from a security researcher testing the system — these are rare by definition. Multiplying their frequency by their severity still leaves them swamped by the cases everyone hits.

The fix is not to let the math sort it out. The fix is a quota. Reserve some explicit fraction of the review budget — fifteen percent, twenty percent, whatever the policy says — for adversarial and high-severity slices. The quota is a top-of-queue policy: those clusters bypass the normal priority calculation up to the quota budget, and only fall back into the general queue once the quota is met for the period.

This is the part of the design that gets pushed back on. The argument from the metrics team is that quota-prioritizing rare cases is mathematically inefficient — we're spending review time on cases that affect 0.1% of traffic when there are 30% cases waiting. That argument is correct on average and dangerous in tail. The cases the quota protects are the ones that go to the regulator, the ones that surface in a Twitter thread, the ones that take down the system the day after the metrics team's dashboard was green. The quota is the team's insurance against being right on average and broken on tail.

The teams that learn this learn it the same way: by shipping a model that scored well on the volumetric queue and getting a compliance escalation from a slice they had three failure cases for and never opened.

What changes when the queue gets a scoring layer

The shift looks small from the outside. The reviewer still opens cases. The eval framework still emits failures. The dashboards still show a fail rate.

Underneath, the unit of work changes from "a failure case" to "a failure cluster," and the order of work changes from "whatever sorted first" to "whatever moves the most traffic with the least review time, plus the slices we promised to protect." The number of fixes per quarter that actually shift a production metric goes up because the cases the team chose to fix were the ones with weight behind them. The number of compliance surprises goes down because the adversarial quota forced a review of cases that the volumetric queue would have buried.

The harder shift is cultural. The team has to accept that "we reviewed everything that failed this week" stops being the goal — it was never the goal in the first place; the goal was to learn the most about where the system is breaking, and to ship the fixes that matter. Reviewing everything is the FIFO-era artifact, the leftover from when the eval set was small enough that exhaustive review was possible. The eval set worked. It produced too much. The right response is a queue policy, not a guilt complex about the cases that didn't get read.

The architectural realization

The deeper realization is that eval failures are a stream of structured bug reports, and bug reports have always needed prioritization. Every mature engineering org runs a bug triage process — someone with context decides which bugs get worked on this week, which get rolled to next quarter, which get closed as won't-fix. The reason eval failures look different is that they arrive in batches of a thousand at a time, with no human attaching priority at the moment of filing, so the triage step has to be automated or it doesn't happen.

A team that runs its eval queue as FIFO is making a structural mistake of the same shape as a team that runs its bug tracker as the order issues were filed. The mistake is invisible while the volume is small. It becomes the dominant source of wasted engineering capacity the moment the volume crosses the reviewer's throughput.

The fix is not new infrastructure. The scoring function is twenty lines of code. The clustering is an embedding pass and a kmeans call. The adversarial quota is a config field. The hard part is deciding that the queue policy is worth owning — that the question "which failures do we look at first" is a product question with a written answer, not a default that the file-system sort order gets to make.

The teams that ship the most reliable AI products are not the teams with the most eval cases. They are the teams whose reviewers spend their hours on the cases that pay back the hour. Everyone else is shipping fixes for the easy cases while the consequential failures wait in line.

References:Let's stay in touch and Follow me for more thoughts and updates