Model Routing in Production: When the Router Costs More Than It Saves
A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.
The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.
This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.
The Routing Tax You're Already Paying
Every model routing system has three cost layers, and most teams only think about the first one.
Layer 1: The router itself. A lightweight classifier adds milliseconds and a small compute cost per request. At scale, this is genuinely negligible — well-implemented rule-based routers add under 5ms of latency, and even ML-based routers typically add under 20ms. This layer almost never breaks your economics.
Layer 2: Classifier errors. Every time the router sends a complex query to the cheap model, you either get a bad output (quality cost) or trigger a fallback to the expensive model (double cost). Every time it sends a simple query to the expensive model, you overpay by the price delta. The ratio of these errors determines whether routing helps or hurts. Most teams measure neither.
Layer 3: System effects. A router that degrades output quality increases user retry rates, support tickets, and downstream failures — costs that don't show up in your inference bill but do show up in your business metrics. A router optimizing for raw token cost while ignoring quality can be net-negative even when its per-token math looks good.
The only routing strategy that works is one where you instrument all three layers simultaneously.
The Decision Framework: Four Routes, Not Two
The "cheap model vs. expensive model" framing is the first thing to discard. A real production routing policy has at least four paths:
Direct to small model. Tasks with narrow input domains where quality can be validated structurally — format conversion, classification with a fixed label set, extraction from templated inputs. You don't need confidence scores here; you need schema validation on the output. If the output passes validation, the task was correctly handled. If it fails, you already have a fallback trigger.
Confidence-threshold cascade. Tasks where the small model might be adequate but you can't know ahead of time. The small model runs first and produces a confidence signal. If confidence exceeds your threshold, return the result. If not, escalate to the large model. This is the pattern most teams implement — but the confidence threshold calibration is where they fail.
Setting the threshold too high (conservative: only escalate if very uncertain) leaves a lot of low-confidence responses in production that users experience as degraded quality. Setting it too low (aggressive: escalate at the first sign of uncertainty) means you're running both models on most requests while only returning the expensive result — worse than skipping routing entirely.
The correct calibration is specific to your task domain and your quality tolerance. It requires labeled evaluation data and iterative threshold tuning. If you haven't done this work, the threshold is probably wrong.
Latency-budget dispatch. For interactive user-facing applications with tight latency SLOs, routing decisions need to incorporate the current queue depth and expected wait time on each model endpoint — not just the per-token cost. A cheap model with a 3-second queue is often worse than an expensive model with a 200ms queue. Production routing systems that ignore current infrastructure state will make systematically wrong decisions during peak load, which is precisely when the decisions matter most.
Skip routing entirely. For tasks where the input domain is genuinely uniform and you've already profiled the cheapest model that handles the domain reliably, routing adds complexity without benefit. Pick the right model once at design time and don't pay the per-request classification overhead.
Complexity Proxies: What Actually Works
The "classify by complexity" framing sounds straightforward. In practice, measuring complexity before running the model is a subtle problem.
Token count is a weak proxy. Long inputs aren't necessarily hard — a 2,000-token document retrieval task is far simpler than a 50-token multi-step reasoning question. Teams that route purely on input length end up sending easy-but-verbose queries to expensive models while confidently misrouting short-but-complex ones.
Intent signals are stronger. The task type — summarization, extraction, code generation, open-ended reasoning — is a better complexity predictor than raw size. A router trained on task-type classifiers tends to make better decisions than one using heuristics about prompt length. The tradeoff: building a good task-type classifier requires labeled examples from your actual traffic, which takes time to collect.
Structural features beat semantic ones for cold starts. When you don't have labeled data, structural signals — number of distinct instructions in the prompt, presence of nested conditions, count of tool calls required — give you a working baseline. They're not optimal, but they're better than token count alone and require no labeled data.
Calibrate against outcomes, not inputs. The only ground truth for a complexity proxy is whether the small model actually succeeded. If you have any quality signal at all (user ratings, downstream validation, eval suite scores), feed it back into your classifier training. Routers trained on input features alone diverge from ground truth over time as your traffic distribution shifts.
The Cases Where Routing Hurts
Beyond the calibration failures, there are structural situations where routing makes economics worse:
Low-volume applications below the routing break-even point. At small scale, the engineering cost of building and maintaining a routing system — the classifier training, threshold tuning, fallback logic, monitoring — exceeds the lifetime inference cost savings. The break-even for most teams is somewhere in the range of 5–10 million tokens per month. Below that, just pick the right model statically.
Tasks with high output variance. If the same input can produce outputs of wildly different quality depending on the model's "mood" on a given day, confidence scores become unreliable. This is common with creative tasks, ambiguous instructions, and multi-hop reasoning chains. Routing on confidence in these domains adds noise without adding value.
When classifier latency is disproportionate. If your routing classifier takes 50–100ms to run and your fast model takes 80ms, you've added up to 125% overhead before you've routed anything. This math kills interactive applications. If your classifier can't run in under 10ms, consider rule-based routing instead of ML-based routing, even if it's less accurate.
When you have a high fraction of "borderline" queries. Some task distributions have a small fraction of clearly easy queries, a small fraction of clearly hard queries, and a large middle band where neither model is clearly correct. Routing only saves money when the easy/hard distinction is clean enough to classify reliably. If your distribution is mostly borderline, routing accuracy will be low, error costs will be high, and you'll end up with neither cost savings nor quality guarantees.
Telemetry: The Signal Your Router Actually Needs
Most teams deploy routers with no way to answer the question "is this router actually saving money?" The standard monitoring setup tracks requests routed to each model and total token cost. That's necessary but nowhere near sufficient.
The metrics that actually tell you if your router is working:
Escalation rate vs. baseline. What fraction of requests escalate to the expensive model? This should be stable over time. A rising escalation rate means your traffic distribution has shifted and your classifier hasn't tracked it — a common degradation mode. You need an alert on this.
Quality delta between routes. For any task where you have a quality signal, track quality by routing path. If the small-model path has meaningfully lower quality than the large-model path on the same task types, your threshold is too aggressive and you're shipping degraded outputs. If they're equivalent, your threshold is too conservative and you're escalating unnecessarily.
Retry amplification. Track whether requests routed to the small model have a higher user retry or error rate than requests routed to the large model. If they do, your router is cutting token costs while increasing total request volume — a common way routing ends up net-negative.
Classifier cost as a fraction of savings. Log the cost of each routing decision (classifier inference compute, latency overhead) and subtract it from the cost delta between small and large model per request. If you're routing 1,000 requests/day and saving 0.01 on classifier overhead per route, your gross savings are 50% of what you think they are.
Distribution drift detection. The inputs that your router was calibrated on are not the inputs you'll get in six months. Track input feature distributions over time and trigger recalibration when they shift significantly. A router that was accurate on your January traffic may be systematically miscalibrated by July.
A Practical Routing Decision Tree
Before building a router, answer these questions in order:
-
Is your monthly token spend above the break-even threshold for routing? If not, pick a model statically and move on.
-
Do you have labeled quality data for your task domain? If not, start collecting it before building an ML-based router. Use rules-based routing as a placeholder.
-
Is your task distribution actually bimodal? Profile a sample of your traffic with both models. If most queries get equivalent quality from the cheap model, route everything to the cheap model — you don't need a router. If most queries need the expensive model, route everything to the expensive model.
-
Can your classification signal be validated structurally? For tasks with structured output requirements, use output validation as your quality signal rather than confidence scores. It's more reliable and cheaper to compute.
-
Have you measured latency under load, not just average latency? Test your routing system at 2x peak traffic. If your classifier becomes a bottleneck under load, it will hurt you when you most need it to help.
If all five answers are favorable, build the router. Instrument it from day one with the telemetry outlined above. Run it in shadow mode — routing but not acting — for at least a week before switching to live routing. The shadow period will surface calibration problems before they affect production.
The teams getting 70–85% cost reductions from routing are using task distributions where most traffic is genuinely trivial and the quality bar for those tasks is low enough that the cheap model handles it reliably. That's a real opportunity when it applies. The failure mode is assuming your traffic has those properties without verifying it — and shipping a router that makes your economics quietly worse while the dashboard shows green.
- https://proceedings.iclr.cc/paper_files/paper/2025/file/5503a7c69d48a2f86fc00b3dc09de686-Paper-Conference.pdf
- https://www.requesty.ai/blog/intelligent-llm-routing-in-enterprise-ai-uptime-cost-efficiency-and-model
- https://www.getmaxim.ai/articles/reduce-llm-cost-and-latency-a-comprehensive-guide-for-2026/
- https://techcloudpro.com/blog/enterprise-llm-cost-optimization/
- https://www.burnwise.io/blog/llm-model-routing-guide
- https://www.pondhouse-data.com/blog/saving-costs-with-llm-routing
- https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms
- https://www.truefoundry.com/blog/llm-cost-tracking-solution
- https://openreview.net/forum?id=AAl89VNNy1
