Skip to main content

The Router Is the Product: Why Your Cheap Classifier Decides More Behavior Than Your Flagship Model

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped what they called "the routing project": a tiny BERT classifier in front of their flagship model that decided whether a query was simple enough for a cheaper, faster fallback. It paid for itself in three weeks. The cost dashboards lit up green. The flagship's eval suite — three hundred adversarial cases, weekly grading runs, the works — still passed every Friday.

Six weeks in, retention on a particular product surface dropped four points and nobody could find the cause. The flagship was fine. Latency was fine. The router, it turned out, was sending 71% of queries to the cheap model. It had been since week two. The cheap model was the product for most users, and the cheap model had no eval suite at all.

This is the most common failure mode I see in 2026 among teams that adopted LLM routing for cost control: the eval discipline gets attached to the expensive tail of the system, and the cheap head — the part that defines the product for most of the request volume — runs blind.

The router is not the optimization. It is the product surface.

Routing is sold as a cost lever. The pitch is familiar: premium frontier models run $30 to $60 per million tokens, mid-tier $10 to $15, small open-weights as low as $0.10 to $0.50. Put a small classifier in front, send simple queries to the cheap tier, and your bill drops 30% to 85% with "minimal quality impact."

The framework that popularized this approach, RouteLLM, demonstrated 95% of GPT-4 quality on MT Bench while making only 26% of its calls to GPT-4. That number is real. It is also dangerous to internalize without staring at its inverse: 74% of queries go to the cheap model. If the cheap model handles 74% of requests, then the cheap model is responsible for 74% of your product's behavior — and roughly 74% of the behavior new users see when they form an opinion about whether your product is good.

Teams treat the router like load balancing. They evaluate the flagship as if it were the product. But your users do not encounter the flagship directly; they encounter the output of the router's decision. If the router is calibrated even slightly aggressively toward cheap, the modal user experience is not your flagship — it is a smaller model interpreting an ambiguous prompt under a tight latency budget. That is the product. The flagship is a fallback that fires on a minority of traffic.

The architectural realization is simple and uncomfortable. Your eval investment should be proportional to the share of traffic each path handles. If 70%+ of queries hit the cheap path, 70%+ of your eval engineering should be aimed there. Most teams have it inverted: 90% of evals point at the model that handles 30% of the load.

The failure modes that compound when the router is unevaluated

When you don't evaluate the routing layer separately, four kinds of failures stack up quietly.

False-negative routing. The router sees a query that genuinely needs the flagship, decides it's "simple," and dispatches it to the cheap model. The cheap model does not know it is in over its head, so it does not escalate — it answers. Self-consistency-based escalation only fires when the cheap model produces visibly inconsistent outputs across samples; queries that need the flagship for nuance, not for raw correctness, slip through silently. The user gets a confident, plausible, wrong answer. No exception is raised. No alert fires. The thumb-down rate ticks up by half a percent.

Bias by language or topic. Your router was trained on yesterday's traffic, which skewed English, Western product names, and a particular vocabulary of edge cases. New queries in Spanish or Mandarin look "simpler" to the classifier than they really are because the classifier was rarely correct on them in training and learned to default. So you ship a quality disparity along language lines that nobody designed and that nobody is monitoring — because the eval set is also primarily English. Worse: your business KPIs cut the same way, and "non-English users have lower engagement" gets attributed to localization gaps instead of routing miscalibration.

Drift from new query types. Routers trained on historical traffic systematically misroute queries that didn't exist when the training set was frozen. A new feature ships, users start asking a new kind of question, the router has never seen that distribution, and it falls back to its prior — which usually means cheap-path, because the training optimization rewarded cheap-path on the dominant historical distribution. New product surfaces get the worst routing precisely because they are new. Recent research on what happens "when routing collapses" describes routers that converge to degenerate behavior, defaulting to one model regardless of input — a particularly nasty form of drift where the classifier silently stops classifying.

Adversarial reroute. The "Rerouting LLM Routers" line of work showed that small perturbations to inputs can flip routing decisions, and that attackers can deliberately phrase queries to coerce expensive-tier escalation (cost amplification) or cheap-tier dispatch (quality attack). If your router is a small fine-tuned classifier with no adversarial training, you have a side channel for both economic and quality denial-of-service.

None of these are theoretical. Each one is a thing teams find when they finally look. The reason they don't find them sooner is that the eval discipline they brought from the pre-routing era runs on the model, not on the routing decision.

Hardness calibration: evaluating the routing decision separately from the answer

The fix is not "add the cheap model to your existing eval suite," though you should do that too. The fix is recognizing that there are now two products to evaluate:

  1. The model's answer given a routing decision.
  2. The routing decision itself.

These are different products with different metrics. A model eval asks "given this query landed at this model, was the response correct?" A routing eval asks "given this query, was sending it to this tier the right call?"

To evaluate routing decisions, you need a hardness label per query. Hardness can be obtained the same way RouteLLM gets it: ask the flagship and the cheap model the same query, score both, and use the score gap as the ground-truth signal of whether escalation was needed. If the cheap model scores within ε of the flagship, the cheap path was the right answer. If the gap is large, the flagship was needed and the router should have escalated.

With hardness labels in hand, the routing layer gets its own confusion matrix — false-positive escalation (sent to flagship, didn't need it: pure cost) and false-negative escalation (kept on cheap path, needed flagship: pure quality loss). These are the actual SLIs of a router. Latency and cost are downstream of them.

The most actionable derived metric is escalation rate, sliced by query segment. Escalation rate that drifts up in aggregate while cost rises is classifier drift — the router is losing confidence and overshooting to flagship. Escalation rate that drifts down while quality drops is classifier collapse — the router is increasingly answering everything cheap-path. Either pattern is a routing incident, separate from any model incident.

Shadow routing: the cheapest way to keep the router honest

Hardness labels at scale are expensive if you compute them after the fact. Shadow routing is the operational practice that makes them cheap: take every Nth query in production, run it through both paths simultaneously, log the cheap response, log the flagship response, and have a grader (LLM-as-judge or a thin human review queue) score the disagreement.

Shadow routing has three properties that matter:

  • It runs on real production traffic, so it is automatically representative of whatever the current distribution is — including the new query types that your offline eval set hasn't caught up with.
  • It is observational, not experimental. The user only sees the production routing decision; the shadow path's output is logged and discarded. So shadow routing has no user impact and does not need product-side approval the way an A/B test does.
  • The disagreement signal is denser than any offline eval. Even at N=100, a moderate-traffic product gets thousands of hardness samples per day, segmented by language, topic, and user cohort.

Shadow routing pairs with a disagreement triage queue. When the cheap and flagship answers diverge by more than a calibrated threshold, the query, both responses, and the routing decision get filed into a queue that an on-call AI engineer reviews weekly. That queue is also the source of the next router retraining batch — the router learns most from the queries it got wrong, which it cannot identify on its own.

What to instrument before you ship a router (or fix one you already shipped)

If you take one thing from this: every routing decision needs to be a first-class observability dimension, not a hidden internal state.

  • Per-trace routing label. Every request log includes the routing decision (which tier was chosen), the routing reason (the rule or score that drove it), the routing model version, and a routing-confidence score. Your cost dashboard, your quality dashboard, your latency dashboard, your retention slice — all of them slice on this dimension. If a regression appears, you can answer "is this a routing problem or a model problem?" in seconds.
  • Escalation rate as an SLI. Set a target band for escalation rate (e.g., 18-24%). Page on excursions in either direction. The temptation will be to suppress the upper-bound alert because "more escalation just means we're being safe" — resist this. Escalation-rate excursions in either direction are signals of distribution shift the router can no longer keep up with.
  • Per-segment quality slices. Quality metrics segmented by routing decision and by user segment (language, geography, customer tier, query topic). The disparities you don't measure are the disparities that ship.
  • Shadow routing queue. A standing background sample, even a small one, with a disagreement-triage process. Without this, you have no source of truth for hardness, and every retrain is opinion-driven.
  • Adversarial probes in CI. A small set of crafted queries — paraphrases that should not change tier, edge cases that must escalate — that runs against every router release the way a typecheck runs against every build. Routers regress silently in ways flagship models do not, because they are smaller and more brittle to input distribution.

The product framing your VP of engineering needs to hear

The reason teams under-invest in router eval is organizational, not technical. The router is owned by the platform/infra team because it looks like an infrastructure component. Evals are owned by the AI/applied team because they look like a model-quality concern. Nobody is funded to evaluate the routing layer's product impact, because nobody owns "the cheap path" as a product surface.

The reframe that gets buy-in: the cheap path is the modal product. If 70% of users see it on 70% of their queries, it deserves the same product-management discipline you give to your flagship surface. That means a PM who owns the user-facing quality of the routed product, an eval owner whose dashboards are weighted by traffic share rather than model importance, and a routing-incident category in your incident-management taxonomy that is distinct from a model incident.

Cost-aware routing is one of the highest-leverage patterns in 2026 LLM systems. It is also the place where the "small thing nobody is watching" decides what your users actually experience. The flagship model is what your engineers brag about. The router is what ships.

References:Let's stay in touch and Follow me for more thoughts and updates