Skip to main content

5 posts tagged with "model-routing"

View all tags

Model Routing in Production: When the Router Costs More Than It Saves

· 10 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.

The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.

This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.

Quality-Aware Model Routing: Why Optimizing for Cost Alone Wrecks Your AI Product

· 9 min read
Tian Pan
Software Engineer

Every team that ships LLM routing starts the same way: sort models by price, send easy queries to the cheap one, hard queries to the expensive one, celebrate the 60% cost reduction. Six weeks later, someone notices that contract analysis accuracy dropped from 94% to 79%, the coding assistant started hallucinating API endpoints that don't exist, and customer satisfaction on complex support tickets fell off a cliff — all while the routing dashboard showed "95% quality maintained."

The problem isn't routing itself. Cost-optimized routing treats all quality degradation as equal, when in practice the queries you're downgrading are disproportionately the ones where quality matters most.

Compound AI Systems: Why Your Best Architecture Uses Three Models, Not One

· 10 min read
Tian Pan
Software Engineer

The instinct is always to reach for the biggest model. Pick the frontier model, point it at the problem, and hope that raw capability compensates for architectural laziness. It works in demos. It fails in production.

The teams shipping the most reliable AI systems in 2025 and 2026 aren't using one model. They're composing three, four, sometimes five specialized models into pipelines where each component does exactly one thing well. A classifier routes. A generator produces. A verifier checks. The result is a system that outperforms any single model while costing a fraction of what a frontier-model-for-everything approach would.

Hybrid Cloud-Edge LLM Inference: The Routing Layer That Determines Your Cost, Latency, and Privacy Profile

· 10 min read
Tian Pan
Software Engineer

Most teams pick a side: run everything in the cloud, or push everything to the edge. Both are wrong for the majority of production workloads. The interesting engineering happens in the routing layer between them — the component that decides, per-request, whether a query deserves a 70B frontier model on an H100 or a 3B quantized model running on local silicon.

This routing decision isn't just about latency. It's a three-variable optimization across cost, privacy, and capability — and the optimal split changes based on your traffic patterns, regulatory environment, and what "good enough" means for each query type. Teams that get the routing right cut inference costs 60–80% while improving p95 latency. Teams that get it wrong either overspend on cloud GPUs for trivial queries or ship degraded answers from edge models that can't handle the complexity.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.