Your Model Router Is a Load Balancer That Cannot See the Load
A load balancer in front of a web fleet works because every machine reports back: CPU, queue depth, error rate, latency. The balancer reads the load and routes accordingly. A model router does not get that telemetry. It decides which model handles a query by looking only at the query, before the model has done anything. The router predicts difficulty from the prompt. Real difficulty only shows up in the answer. By the time the signal exists, the routing decision is already three seconds old and the cheap model has already shipped a confident, wrong reply to your user.
This is the structural defect at the center of model routing, and most teams ship a router without ever framing it this way. They frame it as a classifier — train a model to label queries as "easy" or "hard," validate it on a held-out set, ship when accuracy clears 90%. The classifier metaphor is wrong in a way that matters. A classifier predicts a label that already exists. The router is predicting a label that does not exist yet, will not exist until the routed model has answered, and may never exist in a form clean enough to learn from.
The result is a system whose confidence is uncorrelated with the thing it is supposed to control. The router is 96% sure it picked the right model. The downstream model is silently wrong 4% of the time on the cheap path, which sounds small, until you realize you have no idea which 4% and no automatic way to find out. Misroutes don't surface as 500s. They surface as a slightly worse product for a slowly-changing slice of users, which is exactly the kind of failure modern observability is worst at catching.
The Label You Need Doesn't Arrive Until After the Decision
Most prediction problems give you the label eventually. A spam classifier predicts spam, you mark messages that slip through, those become training data. A churn model predicts churn, the user either renews or doesn't, you train on the truth next quarter. The loop closes.
Routing doesn't close. The label you want — "would the cheap model have been good enough for this query?" — is a counterfactual. To get it cleanly you would have to run both the cheap and the expensive model on every query, judge both outputs, and compare. That is exactly the cost regime routing was supposed to escape. So nobody does it for live traffic. Teams approximate, using either an offline benchmark (where you do run both and judge both, on a query distribution that does not match production) or a tiny shadow sample (where you sample, say, 1% of traffic and double-route it, which leaves you with confidence intervals too wide to detect anything but gross regressions).
This is not a bug in any one router. It is the bandit problem in a thin disguise. The router is a contextual bandit choosing between arms, observing the reward of the chosen arm only, never the reward of the arm it didn't pick. Treating that as supervised classification, with a confusion matrix and an F1 score, sneaks an assumption past the team — that the label was knowable from the prompt all along. It wasn't. The router is guessing at a thing that only exists downstream of the guess.
The Feedback Loop Eats Itself
Now compound the problem: assume the team does collect data after launch, and trains the next router on logged production traffic. Every entry in that training set was conditioned on the router's previous decision. The cheap model only got the queries the router thought were easy. The expensive model only got the queries the router thought were hard. The "ground truth" you train on is a sample drawn by the policy you are trying to improve.
This is the classic policy-bias trap in offline RL and counterfactual learning — when training data is shaped by an action policy, the patterns the model learns don't generalize once the policy changes. Researchers have built half a literature on this for personalization and pricing. Routing teams mostly ship without reading any of it. The result is a router that converges to its own priors. The queries it routes cheap look easy in the logs (because the answers were short, the user didn't escalate, no retry fired). The queries it routes expensive look hard (because the answers were long, the user engaged with them, the conversation continued). The router's confidence climbs. Its decisions get more biased. The feedback "loop" is a tube with one end plugged.
The first sign this is happening is usually that retraining doesn't move the metrics. The team adds three months of fresh data, re-runs the pipeline, and watches the new router behave almost identically to the old one. They conclude it is "stable," which is the kind word for what is actually a self-confirming policy that has stopped learning.
Misroutes Look Like Mediocre Quality, Not Like Errors
If routing failures triggered exceptions, this whole post would be unnecessary. They don't. The cheap model gets a hard query, does its best, and returns a fluent, plausible answer that is somewhere between unhelpful and subtly wrong. The response field is non-null. The status code is 200. The latency is fine. The user reads it and either accepts it (because they don't know what the better answer was), corrects it (an extra turn of conversation that is hard to attribute), or gives up (a session ending early, which looks indistinguishable from a satisfied short session).
None of those outcomes throws an alert. None registers on a router-accuracy dashboard. The router is happily reporting that it routed N% of traffic to the cheap path and saved X by getting Y% of the cheap-path answers wrong is only a win if Y times the cost-of-a-bad-answer is less than $X — and most teams cannot quote what the cost of a bad answer is.
The harder-to-swallow corollary is that this asymmetry biases the org. Cost savings are easy to chart. Silent quality loss isn't. Routing decisions get rewarded in budget reviews and punished only when a complaint loud enough to escalate finally surfaces, by which time the regression is months old and probably attributed to "the model changed" or "users got pickier."
The Honest Signals Live Downstream
The router cannot see the load directly, but the load leaves fingerprints in the user's behavior afterward. The trick is to instrument those fingerprints and feed them back as the only labels you actually trust. The candidates are all imperfect, and they all share one virtue: they are caused by routing decisions instead of being predicted by the router.
- Escalation rates by route. If your cascade allows the cheap path to defer to the expensive one when its confidence is low, the escalation rate is a direct measurement of misroute pressure. A cheap path that escalates 30% of the time is telling you the router was wrong about easy/hard 30% of the time, which is a number no offline benchmark will produce.
- Regenerations and edits. Users who re-ask, rephrase, or hit "try again" are voting on quality with their behavior. Aggregate the rate by routing decision. If cheap-path queries regenerate 2x as often as expensive-path queries of similar shape, the cheap path is losing answers.
- Follow-up corrections. Multi-turn sessions where turn N+1 starts with "no, I meant..." or "that's wrong, ..." are labeled rejections of turn N. Cheap rule-of-thumb to extract them and bucket by route.
- Abandonment partway through a stream. A user who closes the tab before the answer finishes streaming was rejecting the answer in progress. Track session duration and stream completion ratios per route.
- Human-escalation triggers. For agents with a handoff path, the rate of "escalate to human" by route is the loudest possible label. It is also rare, so it only catches the worst routes, but those are the ones that matter.
Each of these signals is noisy. None of them is ground truth. But unlike "router accuracy on the eval set," they all change in the direction of the thing you actually care about. A router whose cheap path produces a rising regeneration rate is failing, even if its classifier confidence is rock steady.
Validate the Router Against the Money It Was Supposed to Save
Before you trust a router with production traffic, run one more experiment that almost no team runs: deliberately misroute and measure the damage. Pick a slice of traffic, force every query to the cheap model regardless of router output, and compare downstream signals (escalation, regeneration, abandonment, retention by cohort) against a matched control routed expensive. The delta is the worst case of routing wrong every time. Then pick another slice, force every query to the expensive model, measure the same signals. The delta in the other direction is the upper bound of what perfect routing could buy you.
Now you have the actual range. Your router is somewhere in between. If a deliberately bad router only costs you 3% on regeneration rate, and your live router is sitting at 2.5%, you have learned something concrete: your router is barely better than random, and the apparent classifier accuracy was measuring the easy half of the distribution. If a deliberately bad router costs you 25% on regeneration rate and yours sits at 4%, you have a real router doing real work. Without that calibration, you have no idea which world you're in.
The same logic applies to the dollars. Routing-savings numbers are usually reported in isolation: "we cut inference cost 58%." A more honest number is the ratio of dollars saved to the downstream cost of misroutes. If the misroute cost is unmeasured, the savings figure is a half-statement. The router is paying for itself in some currency the team didn't account for, and the only way to find out which currency is to measure it on purpose.
A Better Mental Model: Routing Is a Control Problem on Slow, Indirect Signals
Once you stop treating the router as a classifier, the design opens up. The router is not deciding once based on the prompt; it is a control loop that observes downstream behavior, attributes that behavior back to routing decisions, and adjusts. Like every control loop, it lives or dies on signal latency and observability.
A few practical implications:
- Don't commit to the cheap path. Cascade with a fallback. The cheap model produces a candidate; a cheap verifier (token-level uncertainty, schema validation, confidence threshold from the model itself, a lightweight judge) gates whether to ship that candidate or escalate. This converts the routing decision from a prediction into a post-hoc check, which is the only frame in which the router has access to the right information.
- Treat the offline eval as calibration, not as the source of truth. Use it to sanity-check that the router can tell trivially easy queries from trivially hard ones. Don't use it to set thresholds. Set thresholds against live downstream signals, per cohort if possible.
- Sample, judge, and replay deliberately. Allocate budget — explicitly, as a line item — to double-routing a small percentage of traffic and grading the difference offline. This is your only counterfactual signal, and it doesn't exist unless you pay for it.
- Watch for the policy-bias trap on every retrain. When you retrain on logged traffic, the cheap path's training data is whatever the previous router decided was easy. Mix in deliberately misrouted samples, or accept that your retrains are fitting your prior policy harder rather than learning anything new.
The deeper takeaway is humility about what the router is actually doing. It is making a prediction about an outcome it cannot observe, using features that don't fully determine that outcome, in a system where the wrong answer doesn't fail loudly. Every part of that sentence is uncomfortable, and every part is true. The teams that ship routers well are the ones that internalize this and design downstream — escalations, judges, replays, deliberate misrouting — to compensate. The teams that ship routers badly are the ones who report classifier accuracy and call it a day.
A router is a load balancer that cannot see the load. The fix is not a better classifier. It is plumbing — instrumentation that lets you see the load after the fact, and a control loop that adjusts based on what you saw, instead of what you guessed.
- https://arxiv.org/pdf/2502.20576
- https://arxiv.org/html/2603.04445v2
- https://www.anyscale.com/blog/building-an-llm-router-for-high-quality-and-cost-effective-responses
- https://arxiv.org/pdf/2511.03808
- https://arxiv.org/pdf/2412.04692
- https://www.traceloop.com/blog/catching-silent-llm-degradation-how-an-llm-reliability-platform-addresses-model-and-data-drift
- https://leanlm.ai/blog/llm-cost-optimization
- https://www.vellum.ai/blog/how-to-evaluate-your-ai-product-if-you-dont-have-ground-truth-data
