Skip to main content

Model Routing in Production: When the Router Costs More Than It Saves

· 10 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.

The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.

This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.

The Routing Tax You're Already Paying

Every model routing system has three cost layers, and most teams only think about the first one.

Layer 1: The router itself. A lightweight classifier adds milliseconds and a small compute cost per request. At scale, this is genuinely negligible — well-implemented rule-based routers add under 5ms of latency, and even ML-based routers typically add under 20ms. This layer almost never breaks your economics.

Layer 2: Classifier errors. Every time the router sends a complex query to the cheap model, you either get a bad output (quality cost) or trigger a fallback to the expensive model (double cost). Every time it sends a simple query to the expensive model, you overpay by the price delta. The ratio of these errors determines whether routing helps or hurts. Most teams measure neither.

Layer 3: System effects. A router that degrades output quality increases user retry rates, support tickets, and downstream failures — costs that don't show up in your inference bill but do show up in your business metrics. A router optimizing for raw token cost while ignoring quality can be net-negative even when its per-token math looks good.

The only routing strategy that works is one where you instrument all three layers simultaneously.

The Decision Framework: Four Routes, Not Two

The "cheap model vs. expensive model" framing is the first thing to discard. A real production routing policy has at least four paths:

Direct to small model. Tasks with narrow input domains where quality can be validated structurally — format conversion, classification with a fixed label set, extraction from templated inputs. You don't need confidence scores here; you need schema validation on the output. If the output passes validation, the task was correctly handled. If it fails, you already have a fallback trigger.

Confidence-threshold cascade. Tasks where the small model might be adequate but you can't know ahead of time. The small model runs first and produces a confidence signal. If confidence exceeds your threshold, return the result. If not, escalate to the large model. This is the pattern most teams implement — but the confidence threshold calibration is where they fail.

Setting the threshold too high (conservative: only escalate if very uncertain) leaves a lot of low-confidence responses in production that users experience as degraded quality. Setting it too low (aggressive: escalate at the first sign of uncertainty) means you're running both models on most requests while only returning the expensive result — worse than skipping routing entirely.

The correct calibration is specific to your task domain and your quality tolerance. It requires labeled evaluation data and iterative threshold tuning. If you haven't done this work, the threshold is probably wrong.

Latency-budget dispatch. For interactive user-facing applications with tight latency SLOs, routing decisions need to incorporate the current queue depth and expected wait time on each model endpoint — not just the per-token cost. A cheap model with a 3-second queue is often worse than an expensive model with a 200ms queue. Production routing systems that ignore current infrastructure state will make systematically wrong decisions during peak load, which is precisely when the decisions matter most.

Skip routing entirely. For tasks where the input domain is genuinely uniform and you've already profiled the cheapest model that handles the domain reliably, routing adds complexity without benefit. Pick the right model once at design time and don't pay the per-request classification overhead.

Complexity Proxies: What Actually Works

The "classify by complexity" framing sounds straightforward. In practice, measuring complexity before running the model is a subtle problem.

Token count is a weak proxy. Long inputs aren't necessarily hard — a 2,000-token document retrieval task is far simpler than a 50-token multi-step reasoning question. Teams that route purely on input length end up sending easy-but-verbose queries to expensive models while confidently misrouting short-but-complex ones.

Intent signals are stronger. The task type — summarization, extraction, code generation, open-ended reasoning — is a better complexity predictor than raw size. A router trained on task-type classifiers tends to make better decisions than one using heuristics about prompt length. The tradeoff: building a good task-type classifier requires labeled examples from your actual traffic, which takes time to collect.

Structural features beat semantic ones for cold starts. When you don't have labeled data, structural signals — number of distinct instructions in the prompt, presence of nested conditions, count of tool calls required — give you a working baseline. They're not optimal, but they're better than token count alone and require no labeled data.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates