Skip to main content

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

The Signals That Actually Drive Good Routing

The appeal of simplicity pushes teams toward single-signal routing: "route by token count," or "route by whether the prompt contains the word 'code.'" These work poorly in practice because individual signals are weak proxies for what you actually care about: does this request need a capable model, or will a cheaper one do?

Effective routers combine several signal types:

Input complexity, not just input length. Token count is a fast signal and a cost proxy, but two 200-token prompts can have entirely different reasoning demands. A question about common geography is easy at 50 tokens. A nuanced legal interpretation question is hard at 500 tokens. Complexity classifiers — lightweight models trained to predict query difficulty — consistently outperform raw length as a routing signal.

Task classification. Different task types have different model sensitivity curves. Code generation, math reasoning, and structured extraction are highly sensitive to model capability: small models fail in ways that are immediately visible and hard to recover from. Document summarization, translation, and classification are far less sensitive — the quality delta between a 7B and 70B model is often imperceptible to end users. Knowing which category a request falls into is worth more than any other single signal.

Latency SLA by user tier. An interactive chat session has a very different acceptable time-to-first-token than a batch processing job running overnight. User tier (free vs. paid, interactive vs. background) is a first-class routing input, not an afterthought. Building tier awareness into your router lets you give large-model capacity to paying users who will notice the difference and tolerate cost accordingly.

Confidence escalation. Some routers run the cheap model first and check a self-reported confidence score before deciding whether to escalate. When confidence is high, the cheap answer ships. When it's low, the request routes to a stronger model. This requires your cheap model to be calibrated — overconfident small models make this approach unreliable — but when it works, it achieves the best of both worlds.

The research on multi-signal routing is consistent: combining task classification, complexity estimation, and cost signals outperforms any single-signal approach by a wide margin. RouteLLM, the most widely benchmarked open-source framework, achieves 85% cost reduction on conversational benchmarks by learning routing preferences from human comparison data across both signal types simultaneously.

Why Naive Fallback Logic Breaks Under Traffic

The simplest mental model for routing is: "try the cheap model, fallback to the expensive one if it fails." Teams implement this first because it's obvious and requires no training data. It breaks in production for reasons that aren't obvious until you're in an incident.

Retriable and non-retriable errors are not the same. A 400 Bad Request means your prompt is malformed — retrying it on a different model will fail identically. A 503 means the provider is overloaded — retrying on a different provider is exactly right. A 429 rate-limit needs exponential backoff, not an immediate fallback that will hit the same limit on the backup provider in seconds. Naive fallback logic collapses these into a single code path, which amplifies problems under load precisely when you can least afford it.

Streaming failures don't compose. If a provider starts streaming a response and fails mid-stream, your client has already received partial output. You cannot transparently switch providers mid-stream without client-side buffering that defeats the latency benefits of streaming in the first place. Static fallback chains don't account for this; they assume requests are atomic.

Cost distributions are heavy-tailed. Most requests consume modest tokens. A small fraction — long document ingestion, multi-turn conversations with accumulated context, complex reasoning chains — consumes the majority of your token budget. Routing every request through "small model first" optimizes for the median case while barely touching the tail, which is where most of your money goes. Effective routers prioritize the routing decision for high-cost requests and route low-cost requests cheaply.

Quality-cost tradeoffs create feedback loops. Routing too aggressively toward cheap models degrades output quality. Degraded quality increases retry rates, user confusion, and support volume. Support volume and churn are costs too, just in a different budget. The teams that report routing "saved 60%" often aren't accounting for what shifted to their CX and growth numbers.

Shadow Routing: Validating Your Router Without User Impact

Deploying a new routing policy directly to production is high-risk: if your router is wrong, you've degraded quality for real users before you've measured the degradation. Shadow routing is the standard technique for validating a router before committing.

The basic setup: your new routing policy runs in parallel with your production policy on the same traffic, but its decisions are logged, not executed. Every request is dispatched by the production router while the shadow router records what it would have done differently. After collecting enough traffic — typically 5-7 days to reach statistical confidence — you compare the shadow router's decisions against observed quality outcomes.

This gives you several measurable quantities before deployment:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates