LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries
Most teams reach the same inflection point: LLM API costs are scaling faster than usage, and every query — whether "summarize this sentence" or "audit this 2,000-line codebase for security vulnerabilities" — hits the same expensive model. The fix isn't squeezing prompts. It's routing.
LLM routing means directing each request to the most appropriate model for that specific task. Not the most capable model. The right model — balancing cost, latency, and quality for what the query actually demands. Done well, routing cuts LLM costs by 50–85% with minimal quality degradation. Done poorly, it creates silent quality regressions you won't detect until users churn.
This post covers the mechanics, the tradeoffs, and what actually breaks in production.
Why the Cost Gap Is Exploitable
The price spread between frontier and smaller models is enormous — and growing. GPT-4o costs roughly $0.02 per 1,000 tokens. GPT-4o-mini costs $0.00075. That's a 27x price difference. Claude 3.5 Sonnet runs about $0.018/1K tokens; Claude 3 Haiku is $0.0015. Similar ratio.
The key insight: model capability doesn't scale linearly with price, and most queries don't need frontier capability. A factual lookup, a JSON extraction, a short summary — these tasks are handled competently by models two or three tiers cheaper. Research on the "Hybrid LLM" approach (ICLR 2024) showed formally what practitioners already suspected: allocating maximum inference resources doesn't always yield better outputs. For a large class of queries, a 7B parameter model is sufficient.
The math gets interesting at scale. If you're processing 100,000 queries per day and shift even 50% of them from GPT-4o to GPT-4o-mini, you cut costs roughly 14x on that segment. At 1M queries/day, the savings compound into something that justifies serious engineering investment.
Three Routing Strategies (and When to Use Each)
Rule-Based Routing
The simplest approach: conditional logic on explicit signals.
def route_query(query: str, context: dict) -> str:
# High-priority users always get the best model
if context.get("plan") == "enterprise":
return "gpt-4o"
# Short queries are likely simple
if len(query.split()) < 20:
return "gpt-4o-mini"
# Code-related queries need a capable model
code_keywords = ["function", "debug", "implement", "refactor", "sql", "api"]
if any(kw in query.lower() for kw in code_keywords):
return "gpt-4o"
# Default: cheaper model
return "gpt-4o-mini"
Rule-based routing is deterministic, fast (sub-millisecond), and auditable. It covers 80% of use cases and should always be your starting point. The limitation: it's brittle. A query like "what's the time complexity of this algorithm?" contains no code keywords but needs a strong model. Edge cases accumulate faster than you can write rules.
Classifier-Based Routing
Train a lightweight model — typically BERT-scale (~110M parameters) — to predict which LLM will produce the best response for a given query. RouteLLM (UC Berkeley / LMSYS) is the open-source reference implementation. It trains four router architectures on Chatbot Arena preference data: similarity-weighted ranking, matrix factorization, a BERT classifier, and a causal LLM classifier.
The benchmark numbers are striking: the best RouteLLM routers cut costs by 85% on MT Bench while maintaining 95% of GPT-4 quality. On more structured tasks like MMLU and GSM8K, the savings are 35–46% — still meaningful. The router adds 10–30ms latency, which is acceptable for most applications.
One underappreciated finding: RouteLLM routers transfer across model pairs. Routers trained on GPT-4 / Mixtral preference data generalize to Claude 3 Opus / Llama 3 without retraining. The classifier learns something about query difficulty that isn't tied to specific models.
Training requires less data than expected. Effective routers were trained on fewer than 1,500 labeled samples — less than 2% of the full Chatbot Arena dataset. If you have production logs with quality annotations or user feedback signals, you likely have enough data to train a custom router already.
Cascade / Waterfall Routing
Instead of a one-shot routing decision, cascade routing tries the cheap model first and escalates if quality is insufficient:
- Send query to cheap model (e.g., Claude Haiku)
- Evaluate response confidence
- If above threshold → return the answer
- If below threshold → escalate to a stronger model (e.g., Claude Sonnet)
This is the most complex strategy but offers a real advantage: you don't need to predict query difficulty upfront. The cheap model's own uncertainty signal drives escalation. Research combining quality, cost, and uncertainty scores achieved 97% of GPT-4 accuracy at 24% of GPT-4 cost — a 4x reduction with near-identical output quality.
The catch: cascading adds latency when escalation triggers. If you call the cheap model, score the response, and then call the expensive model, you've roughly doubled latency for the hard cases. For real-time chat under 300ms, this is often incompatible with the UX requirements.
Building a Production Routing Stack
A minimal production setup layers three concerns:
Infrastructure: Health monitoring, provider failover, circuit breakers. Before any intelligent routing, you need the ability to redirect traffic away from degraded providers. LiteLLM handles this well as a proxy server — it supports latency-based routing, cost-based routing, and automatic fallback chains with a unified API surface across 100+ providers.
Pre-filtering: Domain tagging, content safety checks, PII detection. These are typically cheap rule-based checks that run before the routing decision and can force certain queries to specific models (e.g., all PII-containing queries go to your on-premises model).
Routing logic: The actual model selection — rules, classifier, or cascade depending on your requirements and traffic volume.
A practical model tiering that maps cleanly to costs:
| Tier | Models | Use Cases |
|---|---|---|
| Simple | GPT-4o-mini, Claude Haiku | Formatting, translation, factual Q&A, summarization |
| Medium | GPT-4o, Claude Sonnet | Multi-step reasoning, code generation, analysis |
| Complex | o1, Claude Sonnet (extended thinking) | Planning, security review, research synthesis |
What Actually Breaks in Production
Silent Quality Regressions
The most dangerous failure mode. You route aggressively to cheap models, costs drop, everything looks fine — until retention metrics move. Users who experience degraded responses don't always complain explicitly; they just submit the same query again, leave a thumbs down, or stop using the feature.
Monitor quality alongside cost from day one. Track retry rates, explicit feedback signals, and task completion rates segmented by model tier. If you can't measure quality, you can't safely route.
Consistency Across a Conversation
Mixing models within a single conversation thread creates subtle inconsistencies: different formatting defaults, different response lengths, different handling of edge cases. Users form a mental model of how the assistant behaves; violating that model is jarring even when users can't articulate why.
The fix: use consistent system prompt engineering per model tier to normalize output format, and avoid switching models mid-conversation without re-evaluation. A response normalization layer that enforces formatting standards helps but doesn't fully solve it.
Threshold Miscalibration in Cascades
Setting the confidence threshold for cascade escalation is harder than it looks. Too high: you escalate constantly and spend more than if you'd used the expensive model from the start. Too low: the cheap model handles queries it can't answer, and quality degrades.
Grid search over threshold values is slow and computationally expensive. A better approach: model the joint performance distribution of the model sequence probabilistically, then optimize thresholds via continuous gradient descent. This converges significantly faster and handles edge cases at the distribution boundary more gracefully.
Calibrate against a representative sample of production queries — 1,000 to 5,000 is usually enough. Re-calibrate monthly, because traffic distributions shift as your product evolves.
Router Distribution Drift
Classifiers trained on historical data degrade when:
- New product features generate query types not present in training data
- Model providers release new versions with changed capabilities
- Traffic distribution shifts seasonally or due to growth
Treat your router as a model in production: monitor its performance, version it, and retrain it on a regular cadence. A routing classifier that made sense six months ago may be misaligned with today's traffic.
Latency Budget for the Router Itself
Routing isn't free. Rule-based routing adds ~1ms (negligible). Embedding-based semantic routing adds 20–50ms for the embedding generation call. BERT-scale classifiers add 10–30ms for inference. These numbers compound if you're running multiple checks.
For applications with strict latency requirements, pre-decision routing (classifier decides before any LLM call) is mandatory. Cascade routing — which adds a full model inference plus evaluation before potentially calling a second model — is only viable when the application can tolerate the additional latency for escalated queries.
When Routing Isn't Worth It
Routing makes sense when:
- LLM costs are scaling faster than revenue
- You can identify distinct use case clusters with genuinely different quality requirements
- You're processing more than ~10,000 requests per day (below that, engineering costs typically exceed savings)
- You have quality signals to monitor and act on
It's not worth it when:
- You're early stage with low traffic volume
- Every use case requires frontier-level quality
- Adding routing latency breaks your UX SLA
- You don't have infrastructure to monitor quality by model tier
The meta-lesson: routing is an optimization, not a foundation. Get the product right first. Optimize when the cost pressure is real and you can measure what you're optimizing.
The Emerging Frontier: Semantic and Capability-Based Routing
Two newer approaches are worth tracking for teams already running basic routing.
Semantic routing encodes queries as dense vector embeddings and routes based on cosine similarity to reference prompts associated with specific models. This handles paraphrase variation naturally — "what's the weather?" and "tell me today's forecast" route to the same place — and degrades gracefully when a query doesn't match any known category. The aurelio-labs/semantic-router library is the lightweight Python reference; vLLM's Semantic Router (v0.1 "Iris", January 2026) scales this to production inference clusters.
Capability-based routing focuses on query difficulty rather than topic domain. IRT-Router (ACL 2025) applies Item Response Theory — borrowed from psychometrics — to model LLM capabilities as latent ability scores and queries as latent difficulty scores. The router predicts which model will answer correctly based on these scores, providing interpretable outputs. A notable finding: capable routers make pools of weaker models collectively outperform the best single model in that pool, by selecting optimally across the query distribution.
For multi-step agent workflows, routing extends beyond the entry point — each tool call, sub-task, and reasoning step can be dispatched to the appropriate specialist. A lightweight orchestrator model acts as a dispatcher, routing to specialist agents (coding, search, analysis) each backed by the cost-optimal model for that capability. This is where routing evolves from a cost optimization into an architectural pattern.
The Practical Starting Point
If you're running LLM workloads in production and haven't implemented routing:
- Audit your query mix. Categorize 500 recent queries by complexity and see what fraction are genuinely simple. You'll likely find 40–60% of requests don't need a frontier model.
- Implement rule-based routing first. Define 3–4 complexity tiers with explicit rules. Ship it, measure cost and quality, establish a baseline.
- Train a classifier if rule-based routing shows gaps. Sample labeled queries from production logs, train a small BERT classifier, evaluate on a holdout set, and deploy alongside the rule layer.
- Add cascade routing selectively. Reserve it for use cases where latency tolerance is high and quality requirements justify the complexity.
The 85% cost reduction benchmarks are real — but they assume you've instrumented quality monitoring and can tune thresholds against your actual traffic. The engineering investment is worth it. The risk is optimizing cost without tracking quality, which produces a number that looks good in a dashboard while users quietly leave.
Routing isn't a silver bullet, but it's one of the most tractable levers for LLM cost management in production. The tooling has matured significantly: RouteLLM, LiteLLM, and vLLM's Semantic Router cover the core use cases with production-ready implementations. The pattern is well-understood. What separates teams that capture the savings from teams that introduce quality regressions is the monitoring infrastructure, not the routing logic itself.
