Skip to main content

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.

This post covers how LLM routing and model cascades actually work, the specific strategies that hold up in production, and the pitfalls that cause most implementations to fail.

The Difference Between Routing and Cascading

These terms get conflated, but they describe meaningfully different patterns.

Routing is a one-shot decision: before executing a query, a router classifies it and sends it to exactly one model. The router might use intent classification ("this is a coding question, send to the code-specialized model"), complexity estimation ("short factual query, send to the small model"), or semantic similarity to past queries. One decision, one execution.

Cascading is sequential escalation: the query first goes to the cheapest model. If that model's output confidence is below a threshold, the query escalates to the next tier, and so on until a sufficiently confident answer is produced or the most capable model handles it. You always start cheap and only pay for expensive compute when the cheap model can't handle it.

The distinction matters for how you implement and tune each approach. Routing requires an accurate upfront classifier that adds latency before any generation. Cascading requires good confidence calibration and accepts the latency of potentially running multiple models sequentially. Hybrid approaches — route first to avoid obviously mismatched tiers, then cascade within a tier — often work best in practice.

How Confidence-Based Cascades Actually Work

The core mechanism in cascade systems is using the small model's uncertainty as a proxy for whether escalation is needed. When a model generates a response, its token probability distribution carries information: high-confidence generations cluster probability mass on a few tokens, while uncertain generations spread it across many candidates. High entropy in this distribution signals that the model is unsure.

The challenge is that LLM self-reported confidence is poorly calibrated. A model can produce a fluent, authoritative-sounding response with high token probabilities while being factually wrong. Conversely, correct answers on genuinely ambiguous questions may trigger false escalation. This is the biggest practical problem with naive cascade implementations.

Better approaches handle this in a few ways:

Early abstention trains models to explicitly signal when a query exceeds their capability rather than attempting an uncertain answer. One research approach achieved a 13% reduction in cost and 5% reduction in error rate by accepting a 4.1% increase in abstention rate — the model says "I don't know" instead of producing a low-confidence wrong answer, triggering escalation.

Retrieval-coupled confidence combines the model's uncertainty with retrieval quality signals in RAG systems. If the retrieved documents are relevant and the model confidence is high, skip escalation. If retrieval quality is poor, escalate regardless of model confidence.

Empirically calibrated thresholds are non-negotiable. Any system that sets confidence thresholds by intuition will be miscalibrated. The correct approach is to evaluate your specific workload — what fraction of queries the small model gets right at each confidence level — and set thresholds based on your acceptable error rate. This requires labeled data from your domain, not benchmark results from academic papers.

Intent-Based Routing: The Practical Starting Point

Before tackling confidence-based cascading, most teams are better served by simpler intent routing. The idea is straightforward: classify incoming queries into categories, route each category to the model best suited for it.

A customer support system might route:

  • Simple factual lookups (order status, shipping times) → small, fast model
  • Policy interpretation questions → mid-tier model
  • Complex complaints or edge cases requiring judgment → frontier model

The classifier itself can be tiny — a fine-tuned 0.5B parameter model can achieve ~90% routing accuracy for well-defined intent categories while adding only milliseconds of latency. The key is defining categories that genuinely correlate with the required model capability, not just topic categories.

Where this breaks down: intent categories that sound clean in a design document often overlap messily in production. "Simple factual lookup" queries sometimes contain implicit complexity. "Edge cases requiring judgment" sometimes resolve trivially. You need monitoring on routing decisions and a mechanism to catch systematic misrouting.

A practical safeguard is routing to conservative tiers by default for ambiguous queries. The cost of unnecessarily routing to a mid-tier model is much lower than the cost of a bad answer from a small model that should have been escalated.

Semantic Caching: The Compounding Multiplier

Routing and cascading reduce costs on novel queries. Semantic caching eliminates costs for queries you've already answered.

The mechanism: when a query is answered, embed it and store the embedding alongside the response. For each new query, compute its embedding and check cosine similarity against the cache. Above a similarity threshold, return the cached response without calling any model. Below the threshold, execute the query and cache the result.

The latency numbers are striking in practice. Real-world RAG pipelines with semantic caching show 3.4x latency reduction for near-duplicate queries and 123x for exact matches. Combined with routing, semantic caching can reduce total LLM costs by 60% or more in workloads with high query repetition.

The hard part is calibrating the similarity threshold. Too high and you miss valid cache hits, leaving cost savings on the table. Too low and you return cached responses for queries that are superficially similar but meaningfully different, causing incorrect answers. The right threshold depends on your embedding model, the semantic density of your query space, and your tolerance for staleness.

For systems with time-sensitive data, you also need cache invalidation logic. A cached response about current product inventory may be wrong within hours.

Production Pitfalls That Kill Routing Systems

Several failure patterns appear repeatedly in production deployments:

Routing without observability. If you can't see which queries route to which models and what the outcomes are, you can't detect miscalibration. The routing layer needs structured logging of every routing decision: the model selected, the classifier's confidence, the query features used, and ideally the downstream quality signal if you have one. Without this, you're flying blind.

Single provider dependency. Routing across cost tiers only works if each tier is available. When a model provider has an outage or rate limits you, your routing system needs automatic fallback — not just to a different tier, but potentially to a different provider. Multi-provider fallback is harder to implement but necessary for production reliability.

Routing cold starts. Intent classifiers trained on historical data perform worse on new query types that weren't in training. A new product launch, a new market, or a new feature can generate query patterns that the router misclassifies. The router should have a mechanism to flag low-confidence routing decisions and route them more conservatively.

Cascade latency accumulation. Sequential cascading adds latency every time a query escalates. In a three-tier cascade, a query that needs the top tier has paid for three model calls. For latency-sensitive applications, this can be worse than just routing directly. Profile your latency budget before deciding cascade vs. direct routing.

Confidence miscalibration at the tail. Your routing system's accuracy on typical queries doesn't predict its behavior on tail queries — the rare, unusual inputs that are often exactly the ones where errors are most costly. Evaluate routing quality separately on your tail distribution.

Putting It Together: A Layered Architecture

For most teams, the right implementation order is:

  1. Add semantic caching first. It's the simplest win, requires no classifier development, and immediately reduces costs on repeated queries. Use an off-the-shelf solution (there are several mature open source options) rather than building from scratch.

  2. Implement intent routing for well-defined categories. Identify the clearest tier-appropriate categories in your workload — the queries that obviously need a frontier model and the queries that obviously don't. Route these explicitly before touching anything else.

  3. Add confidence-based cascading for the ambiguous middle. Once you have intent routing working and monitored, layer in confidence cascading for queries that fall into categories where the right tier genuinely depends on difficulty.

  4. Build fallback chains for reliability. Every routing path needs a fallback. The small model fails to produce a confident answer; the fallback is the mid-tier. The primary provider is unavailable; the fallback is an equivalent model from a different provider.

The tooling ecosystem has matured enough that you don't need to build the routing infrastructure yourself. LiteLLM provides a production-ready router with load balancing and fallback support for 100+ models. RouteLLM provides research-backed routing strategies specifically optimized for cost-performance tradeoffs. For intent classification at the edge, vLLM's semantic router integrates with production serving infrastructure.

The Economics Are Now Compelling

Three years ago, routing strategies were theoretically sound but operationally fragile. The model landscape was sparse, pricing differentials were smaller, and the tooling required significant custom work.

That's changed. Current model families offer 5–25x cost ratios between efficient and frontier tiers with well-documented capability boundaries. Routing frameworks are battle-tested in production. The research on confidence calibration and cascade design has caught up with the practical requirements.

The teams that aren't implementing routing are essentially subsidizing their AI costs unnecessarily. Not every query needs a frontier model. The engineering work to identify which queries need which tier, and to route them there reliably, pays for itself quickly at any meaningful scale.

The remaining open problem is measurement: routing quality is only as good as your ability to detect when it's wrong. The teams that will get this right are the ones that treat routing decisions as first-class telemetry, not infrastructure plumbing.

References:Let's stay in touch and Follow me for more thoughts and updates