Skip to main content

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.

The deeper failure here isn't a bug in any particular router. It's a category error. We keep building these things as classifiers — train, validate, ship, monitor accuracy. They're not classifiers. They're control systems. And a control system without feedback from the thing it's controlling is open-loop, which in control theory is a polite synonym for broken.

The Benchmark Sets the Distribution, and the Distribution Is Clean

Look at how a typical production router gets built. The team picks a benchmark — MT-Bench, MMLU, GSM8K, maybe a synthetic query set generated to span their use cases. They run both the cheap model and the expensive one against it, label which queries the cheap model handled correctly, and train a classifier that predicts "cheap can handle this" from query features.

The router scores well. Of course it does. The benchmark is built by people who understand the task. Queries are well-formed sentences. Each query expresses one intent. Inputs are clean and within the model's context window. Edge cases are represented in proportions that someone thought were reasonable.

Production is none of that. Production is a long tail of:

  • Malformed inputs. Queries with copy-paste artifacts, half-stripped HTML, three trailing question marks, accidental file dumps where someone meant to paste a sentence.
  • Multi-intent queries. A user asks one question, then another, then a follow-up clarification all in the same turn. The benchmark labeled queries by primary intent. Production has no primary intent.
  • Off-distribution domains. A new product launch sends queries about a feature that didn't exist when the benchmark was written. The router treats them as a familiar shape because the surface features look ordinary.
  • Adversarial weirdness. Users prompting in ways nobody anticipated, scripted clients submitting malformed JSON as queries, language mixing within a single message.

The router can't tell the difference. Its features were calibrated against a world where these inputs barely existed. So it confidently routes them to the cheap model — the bad answers come back, and they don't show up as router errors in any dashboard, because the router did exactly what its training data said it should do.

The Errors Cluster, and the Clustering Is the Story

Here's the part that gets missed when teams report aggregate routing accuracy. Production failures from a benchmark-trained router are not uniformly distributed across users. They cluster. They cluster hard.

The clustering follows the same pattern almost every time:

  • A small subset of users with unusual usage patterns — power users, API customers, automation, enterprise scripts.
  • A specific feature surface where the input shape differs from organic chat — bulk processing, programmatic clients, integrations.
  • A locale or language where the benchmark coverage was thin.
  • A workflow stage where context has accumulated and the prompt is now structurally different from a fresh conversation.

Your overall p99 routing quality looks fine because these segments are 3% of total traffic. But for that 3%, the router is silently cheap-handling the queries that needed the expensive model, and the affected users see the worst version of your product. They are also disproportionately the users who churn loudly, file support tickets, write scathing reviews, and tell their colleagues.

This is why "the quality loss in the long tail often exceeds the cost savings on the hot path" is the frame I've started using when teams ask whether to ship a router. The savings are real, but they are paid for by a small number of users who are now getting answers their use case can't tolerate. The accounting is rarely run that way.

A Router Is a Control System, Not a Classifier

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates