Model Routing Is a System Design Problem, Not a Config Option
Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.
That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.
Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.
The Signals That Actually Drive Good Routing
The appeal of simplicity pushes teams toward single-signal routing: "route by token count," or "route by whether the prompt contains the word 'code.'" These work poorly in practice because individual signals are weak proxies for what you actually care about: does this request need a capable model, or will a cheaper one do?
Effective routers combine several signal types:
Input complexity, not just input length. Token count is a fast signal and a cost proxy, but two 200-token prompts can have entirely different reasoning demands. A question about common geography is easy at 50 tokens. A nuanced legal interpretation question is hard at 500 tokens. Complexity classifiers — lightweight models trained to predict query difficulty — consistently outperform raw length as a routing signal.
Task classification. Different task types have different model sensitivity curves. Code generation, math reasoning, and structured extraction are highly sensitive to model capability: small models fail in ways that are immediately visible and hard to recover from. Document summarization, translation, and classification are far less sensitive — the quality delta between a 7B and 70B model is often imperceptible to end users. Knowing which category a request falls into is worth more than any other single signal.
Latency SLA by user tier. An interactive chat session has a very different acceptable time-to-first-token than a batch processing job running overnight. User tier (free vs. paid, interactive vs. background) is a first-class routing input, not an afterthought. Building tier awareness into your router lets you give large-model capacity to paying users who will notice the difference and tolerate cost accordingly.
Confidence escalation. Some routers run the cheap model first and check a self-reported confidence score before deciding whether to escalate. When confidence is high, the cheap answer ships. When it's low, the request routes to a stronger model. This requires your cheap model to be calibrated — overconfident small models make this approach unreliable — but when it works, it achieves the best of both worlds.
The research on multi-signal routing is consistent: combining task classification, complexity estimation, and cost signals outperforms any single-signal approach by a wide margin. RouteLLM, the most widely benchmarked open-source framework, achieves 85% cost reduction on conversational benchmarks by learning routing preferences from human comparison data across both signal types simultaneously.
Why Naive Fallback Logic Breaks Under Traffic
The simplest mental model for routing is: "try the cheap model, fallback to the expensive one if it fails." Teams implement this first because it's obvious and requires no training data. It breaks in production for reasons that aren't obvious until you're in an incident.
Retriable and non-retriable errors are not the same. A 400 Bad Request means your prompt is malformed — retrying it on a different model will fail identically. A 503 means the provider is overloaded — retrying on a different provider is exactly right. A 429 rate-limit needs exponential backoff, not an immediate fallback that will hit the same limit on the backup provider in seconds. Naive fallback logic collapses these into a single code path, which amplifies problems under load precisely when you can least afford it.
Streaming failures don't compose. If a provider starts streaming a response and fails mid-stream, your client has already received partial output. You cannot transparently switch providers mid-stream without client-side buffering that defeats the latency benefits of streaming in the first place. Static fallback chains don't account for this; they assume requests are atomic.
Cost distributions are heavy-tailed. Most requests consume modest tokens. A small fraction — long document ingestion, multi-turn conversations with accumulated context, complex reasoning chains — consumes the majority of your token budget. Routing every request through "small model first" optimizes for the median case while barely touching the tail, which is where most of your money goes. Effective routers prioritize the routing decision for high-cost requests and route low-cost requests cheaply.
Quality-cost tradeoffs create feedback loops. Routing too aggressively toward cheap models degrades output quality. Degraded quality increases retry rates, user confusion, and support volume. Support volume and churn are costs too, just in a different budget. The teams that report routing "saved 60%" often aren't accounting for what shifted to their CX and growth numbers.
Shadow Routing: Validating Your Router Without User Impact
Deploying a new routing policy directly to production is high-risk: if your router is wrong, you've degraded quality for real users before you've measured the degradation. Shadow routing is the standard technique for validating a router before committing.
The basic setup: your new routing policy runs in parallel with your production policy on the same traffic, but its decisions are logged, not executed. Every request is dispatched by the production router while the shadow router records what it would have done differently. After collecting enough traffic — typically 5-7 days to reach statistical confidence — you compare the shadow router's decisions against observed quality outcomes.
This gives you several measurable quantities before deployment:
- Routing divergence rate: how often the new policy disagrees with production
- Predicted cost delta: if the shadow router routes more aggressively to cheap models, what's the projected savings given actual token consumption
- Quality signal distribution: for requests where the shadow router would have used a cheaper model, what was the quality signal on the actual (expensive) response? If quality was consistently high, cheaper is probably fine.
Shadow routing doesn't tell you what the cheaper model would have actually produced — that requires running it and evaluating the output. But it narrows the experiment to high-disagreement cases and gives you a prioritized list of where to run offline evaluations before touching live traffic.
The same principle applies to routing policy updates, not just new policies. Treating a prompt routing rule change with the same rigor as a code deploy is the right instinct. Route changes are risky in proportion to how much traffic they affect.
The Cost Accounting That Most Teams Skip
Organizations that implement routing frequently report large cost savings: "40% reduction," "85% cheaper." These numbers are real but often incomplete, and the gap between reported savings and actual P&L impact is instructive.
Router overhead has real cost. A routing decision requires compute — either a lightweight classifier, a similarity lookup, or in some implementations, a separate LLM call. At high request volumes, router overhead is a non-negligible line item. Fast similarity-based routers (milliseconds, cheap) are appropriate for the common path. LLM-based routers that make the routing decision itself via an API call can cost more than the routing savings they generate for short, cheap requests.
Escalation rate is a multiplier. Every request the router misclassifies — cheap model selected, quality insufficient, escalation to expensive model — costs twice: the cheap model call plus the expensive model call. High escalation rates turn routing from a cost-reduction strategy into a cost-amplification strategy. The escalation rate, not the cost-per-request, is the number to monitor obsessively.
Measuring at 70-80% of theoretical savings is the right target. There's a consistent pattern in production routing deployments: the first 70% of theoretical cost savings are achievable with modest routing complexity. The next 15% require increasingly aggressive routing thresholds that push quality toward an unacceptable edge. The last 15% require solving problems that cost more in engineering time than they save in inference costs. Chasing 95% of theoretical savings is a trap. Target 70-80% and move on.
Total cost of ownership includes downstream. Low-quality outputs at scale increase retry rates (users try again), support volume (they escalate), and churn (they leave). A routing policy that reduces your inference bill by 25K/month is a net loss. You need cross-functional visibility to measure this accurately.
Designing a Router That Survives Production
Given the above, a production-ready routing architecture has several properties:
Multi-level fallback chains, not binary fallback. Rather than cheap → expensive, build primary → secondary → tertiary with different cost/capability profiles at each level. This distributes load across providers, reduces single points of failure, and lets you tune degradation gracefully when one level is unavailable.
Per-tier budgets with kill-switches. Interactive tiers get hard latency budgets — if the model hasn't responded in N seconds, escalate rather than waiting. Background tiers get hard cost budgets — if token consumption exceeds threshold per request, truncate or reject. These budgets prevent the failure mode where a batch queue saturates and interactive requests time out waiting for capacity.
Feedback loops to retrain the router. Routers trained on static benchmark data (like Chatbot Arena preference pairs) generalize to production, but they drift as your traffic distribution changes. Implement lightweight collection of user acceptance signals — thumbs up/down, retry rates, session abandonment — and retrain your routing classifiers periodically. The router that's well-calibrated at launch will degrade over months without it.
Cost attribution per request, not per billing cycle. You need to know the routing decision, the model used, the token count, and the cost for every request. Aggregate billing dashboards tell you how much you spent; per-request attribution tells you why and which routes to adjust. Build this instrumentation before you tune thresholds, not after.
Avoid routing latency in the hot path. The fastest routers add 5-20ms per request; slower classifiers or LLM-based routers add 100ms+. For interactive use cases where time-to-first-token is the primary UX metric, a 100ms routing tax is often larger than the latency difference between the cheap and expensive model. Use fast classifiers (BERT-scale, not GPT-scale) for the common case and reserve heavier routing logic for requests where the stakes justify it.
Where Routing Earns Its Complexity Cost
Model routing is real infrastructure with real operational burden. The configuration surface is large, the failure modes are non-obvious, and the cost accounting requires cross-functional alignment that most teams don't have natively. It earns its complexity when:
- Your traffic is diverse: a mix of simple and complex queries across a wide task distribution
- Your scale is meaningful: at low volumes, routing overhead often exceeds savings
- Your user tier structure maps onto quality requirements: not every request needs the same model
- You have eval infrastructure: routing decisions need to be validated against quality signals, and you need the measurement apparatus to do that before deploying
If all your queries are similar in complexity, or your volume is low, or you don't have the eval infrastructure to validate routing quality — static model selection is probably fine. Routing is an optimization, not a default. The teams that build it prematurely end up maintaining complex infrastructure that doesn't move the numbers they care about.
But if those conditions are met — and at meaningful scale, they usually are — treating model selection as a static config is leaving real money on the table, every request, continuously. The routing layer pays for itself.
- https://arxiv.org/pdf/2406.18665
- https://arxiv.org/abs/2410.10347
- https://arxiv.org/abs/2403.12031
- https://www.lmsys.org/blog/2024-07-01-routellm/
- https://www.merge.dev/blog/llm-routing
- https://blog.logrocket.com/llm-routing-right-model-for-requests/
- https://dev.to/debmckinney/routing-load-balancing-and-failover-in-llm-systems-pn3
- https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/
- https://mlechner.substack.com/p/the-economics-of-llm-inference-batch
- https://arxiv.org/html/2410.13284v2
