LLM Model Routing Is Market Segmentation Disguised As A Cost Optimization
The cost dashboard makes the case for itself. Sixty percent of traffic is "easy," a quick eval shows the smaller model lands within a couple of points on the global accuracy metric, and the routing layer ships behind a feature flag the same week. The graph bends. Finance is happy. The team moves on.
What nobody tracks is that the customer who hit the cheap path on Tuesday afternoon and the expensive path on Wednesday morning is now using two different products. The two models fail differently. They format differently. They refuse different things. They handle ambiguity, follow-up questions, and partial inputs with different defaults. From the customer's seat, the assistant developed amnesia overnight and nobody can tell them why — because internally, the change was filed as a finops win, not a product release.
This is the most common pattern I see in teams that ship LLM routing without owning the consequence. The layer is presented as an infrastructure optimization — same product, cheaper backend — but it is actually a market-segmentation decision being made by whichever cluster of routing rules happens to fire on the customer's input that minute. The team that ships routing without owning the segmentation is shipping a product whose identity is whichever model the cost graph picked on Tuesday.
The aggregate metric is averaging over the shape that matters
The pitch for a router is almost always a single number: "the small model is 94% as accurate as the big one on our eval." That number is the average of a distribution. Distributions have shapes. Two distributions can have the same mean and produce wildly different user experiences.
The shape that breaks teams looks like this. The easy queries — short, well-formed, common — pass on both models and contribute almost nothing to the gap. The hard queries fail on both models. Both of those segments are inert in your eval. The gap is concentrated in a middle band where the bigger model gets it right and the smaller one is plausibly wrong: the kind of wrong where the output looks reasonable, the format is right, and a non-expert reader cannot tell. That middle band is often 10–20% of traffic, and the customers who live in it are not randomly distributed. They are concentrated in a segment — the long-tail tenant, the non-English speaker, the user with the unusual schema, the workflow your design partners did not foreshadow.
When the global accuracy delta is two points, that two points is the segment-weighted average of "no change" plus "no change" plus "this 15% of users degraded by 30 points." The aggregate metric was specifically engineered, by averaging, to wash out the only signal you needed. This is why teams keep being surprised when the routing change ships clean in eval and a specific cohort starts churning a month later. The cohort was never visible in the metric.
The discipline that catches this is not "more eval data." It is per-segment eval slicing — at minimum by tenant size, locale, and a coarse task-type partition — with a contract that no segment regresses by more than a defined budget regardless of what the global average does. If you cannot name the segments your eval is sliced on before you ship, you are not ready to ship the router.
Two failure modes are not the same as one failure mode at half the rate
The second silent cost of routing is that you are not just adding error to a system; you are adding a second kind of error, with its own profile, that the team's debugging muscle is not trained on.
A frontier model and a small model are not the same model with different accuracy knobs. They have different priors, different refusal surfaces, different tokenizer quirks, different defaults for ambiguous inputs, different propensities to hallucinate names and numbers, different formatting habits, different tool-calling reliability, and different sensitivity to prompt ordering. When you route across them, every customer interaction is now drawn from a two-component mixture. The support team sees one customer report "the assistant suddenly started inventing API endpoints" and a different customer report "the assistant suddenly started refusing to answer," and those two reports are the same release.
The team's mental model — built up in the months before the router shipped — is that the assistant has a personality and a failure profile. After the router, the assistant has two personalities and the customer's session is sampling between them. The on-call rotation now needs runbooks for two failure modes. The eval suite now needs to run twice. The prompt that worked under the big model regresses under the small one in a way that the team's "fix the prompt" reflex makes worse, because tightening the instruction for the small model often tips the big model into over-refusal.
The architectural consequence is that the routing boundary is not a free abstraction — it imports a real complexity tax onto every downstream system that touches the assistant's output. The team that prices the router only by inference cost is missing the line items for: doubled eval cost, doubled debug cost, doubled prompt-maintenance cost, and a permanent obligation to keep two model behaviors in approximate alignment as both vendors release updates on their own schedules.
- https://arxiv.org/abs/2406.18665
- https://arxiv.org/html/2502.00409v3
- https://arxiv.org/html/2502.08773v1
- https://arxiv.org/html/2410.10347v1
- https://github.com/lm-sys/RouteLLM
- https://blog.logrocket.com/llm-routing-right-model-for-requests/
- https://docs.litellm.ai/docs/routing
- https://www.requesty.ai/blog/intelligent-llm-routing-in-enterprise-ai-uptime-cost-efficiency-and-model
