Skip to main content

4 posts tagged with "llm-routing"

View all tags

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.

The Router Is the Product: Why Your Cheap Classifier Decides More Behavior Than Your Flagship Model

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped what they called "the routing project": a tiny BERT classifier in front of their flagship model that decided whether a query was simple enough for a cheaper, faster fallback. It paid for itself in three weeks. The cost dashboards lit up green. The flagship's eval suite — three hundred adversarial cases, weekly grading runs, the works — still passed every Friday.

Six weeks in, retention on a particular product surface dropped four points and nobody could find the cause. The flagship was fine. Latency was fine. The router, it turned out, was sending 71% of queries to the cheap model. It had been since week two. The cheap model was the product for most users, and the cheap model had no eval suite at all.

This is the most common failure mode I see in 2026 among teams that adopted LLM routing for cost control: the eval discipline gets attached to the expensive tail of the system, and the cheap head — the part that defines the product for most of the request volume — runs blind.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

· 10 min read
Tian Pan
Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

Hybrid Cloud-Edge LLM Architectures: When to Run Inference On-Device vs. in the Cloud

· 11 min read
Tian Pan
Software Engineer

Most teams treat the cloud-vs-edge decision as binary: either you pay per token to a cloud provider or you run everything locally. In practice, the interesting architecture is the one in between — a routing layer that sends each query to the cheapest compute tier that can handle it correctly. The teams getting this right are cutting inference costs 60–80% while improving both latency and privacy compliance. The teams getting it wrong are running frontier models on every autocomplete suggestion.

The hybrid cloud-edge pattern has matured significantly over the past two years, driven by two converging trends: small language models (SLMs) that fit on consumer hardware without embarrassing themselves, and routing systems sophisticated enough to split traffic intelligently. This article covers the architecture, the decision framework, and the failure modes that make hybrid harder than it looks.