Skip to main content

2 posts tagged with "model-cascades"

View all tags

Your Model Router Is a Load Balancer That Cannot See the Load

· 11 min read
Tian Pan
Software Engineer

A load balancer in front of a web fleet works because every machine reports back: CPU, queue depth, error rate, latency. The balancer reads the load and routes accordingly. A model router does not get that telemetry. It decides which model handles a query by looking only at the query, before the model has done anything. The router predicts difficulty from the prompt. Real difficulty only shows up in the answer. By the time the signal exists, the routing decision is already three seconds old and the cheap model has already shipped a confident, wrong reply to your user.

This is the structural defect at the center of model routing, and most teams ship a router without ever framing it this way. They frame it as a classifier — train a model to label queries as "easy" or "hard," validate it on a held-out set, ship when accuracy clears 90%. The classifier metaphor is wrong in a way that matters. A classifier predicts a label that already exists. The router is predicting a label that does not exist yet, will not exist until the routed model has answered, and may never exist in a form clean enough to learn from.

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.