Skip to main content

LLM Model Routing Is Market Segmentation Disguised As A Cost Optimization

· 10 min read
Tian Pan
Software Engineer

The cost dashboard makes the case for itself. Sixty percent of traffic is "easy," a quick eval shows the smaller model lands within a couple of points on the global accuracy metric, and the routing layer ships behind a feature flag the same week. The graph bends. Finance is happy. The team moves on.

What nobody tracks is that the customer who hit the cheap path on Tuesday afternoon and the expensive path on Wednesday morning is now using two different products. The two models fail differently. They format differently. They refuse different things. They handle ambiguity, follow-up questions, and partial inputs with different defaults. From the customer's seat, the assistant developed amnesia overnight and nobody can tell them why — because internally, the change was filed as a finops win, not a product release.

This is the most common pattern I see in teams that ship LLM routing without owning the consequence. The layer is presented as an infrastructure optimization — same product, cheaper backend — but it is actually a market-segmentation decision being made by whichever cluster of routing rules happens to fire on the customer's input that minute. The team that ships routing without owning the segmentation is shipping a product whose identity is whichever model the cost graph picked on Tuesday.

The aggregate metric is averaging over the shape that matters

The pitch for a router is almost always a single number: "the small model is 94% as accurate as the big one on our eval." That number is the average of a distribution. Distributions have shapes. Two distributions can have the same mean and produce wildly different user experiences.

The shape that breaks teams looks like this. The easy queries — short, well-formed, common — pass on both models and contribute almost nothing to the gap. The hard queries fail on both models. Both of those segments are inert in your eval. The gap is concentrated in a middle band where the bigger model gets it right and the smaller one is plausibly wrong: the kind of wrong where the output looks reasonable, the format is right, and a non-expert reader cannot tell. That middle band is often 10–20% of traffic, and the customers who live in it are not randomly distributed. They are concentrated in a segment — the long-tail tenant, the non-English speaker, the user with the unusual schema, the workflow your design partners did not foreshadow.

When the global accuracy delta is two points, that two points is the segment-weighted average of "no change" plus "no change" plus "this 15% of users degraded by 30 points." The aggregate metric was specifically engineered, by averaging, to wash out the only signal you needed. This is why teams keep being surprised when the routing change ships clean in eval and a specific cohort starts churning a month later. The cohort was never visible in the metric.

The discipline that catches this is not "more eval data." It is per-segment eval slicing — at minimum by tenant size, locale, and a coarse task-type partition — with a contract that no segment regresses by more than a defined budget regardless of what the global average does. If you cannot name the segments your eval is sliced on before you ship, you are not ready to ship the router.

Two failure modes are not the same as one failure mode at half the rate

The second silent cost of routing is that you are not just adding error to a system; you are adding a second kind of error, with its own profile, that the team's debugging muscle is not trained on.

A frontier model and a small model are not the same model with different accuracy knobs. They have different priors, different refusal surfaces, different tokenizer quirks, different defaults for ambiguous inputs, different propensities to hallucinate names and numbers, different formatting habits, different tool-calling reliability, and different sensitivity to prompt ordering. When you route across them, every customer interaction is now drawn from a two-component mixture. The support team sees one customer report "the assistant suddenly started inventing API endpoints" and a different customer report "the assistant suddenly started refusing to answer," and those two reports are the same release.

The team's mental model — built up in the months before the router shipped — is that the assistant has a personality and a failure profile. After the router, the assistant has two personalities and the customer's session is sampling between them. The on-call rotation now needs runbooks for two failure modes. The eval suite now needs to run twice. The prompt that worked under the big model regresses under the small one in a way that the team's "fix the prompt" reflex makes worse, because tightening the instruction for the small model often tips the big model into over-refusal.

The architectural consequence is that the routing boundary is not a free abstraction — it imports a real complexity tax onto every downstream system that touches the assistant's output. The team that prices the router only by inference cost is missing the line items for: doubled eval cost, doubled debug cost, doubled prompt-maintenance cost, and a permanent obligation to keep two model behaviors in approximate alignment as both vendors release updates on their own schedules.

The user does not care about your cost graph; they care about predictability

Reliability research has a result that the routing literature tends to elide: at moderate accuracy levels, the user's ability to predict when to trust the system matters more than the system's average accuracy. A 95% model whose 5% failures cluster predictably (always on inputs the user can recognize as out-of-scope) is a better product than a 96% model whose 4% failures are scattered uniformly across input types.

A router that flips a customer between two models without their knowledge is the worst possible outcome on this axis. The customer's mental model is built from a history of interactions. When that history is drawn from two distributions, the customer cannot form a stable expectation. They start asking the same question twice to see if the answer changes. They lose trust in answers that are correct, because they have no way to distinguish "this is right" from "this is the model that gets things right." They begin to suspect the product is degrading even when measured quality is improving — because their experiential variance is up.

This is the failure mode that finance never sees, because nobody bills variance. It shows up six weeks later as quiet churn from the cohort that lived in the routing seam, and the team attributes it to "the AI just isn't reliable enough yet" rather than "we shipped a market-segmentation decision behind a config flag and the customer noticed."

The right product-level move is to pin a customer's traffic to a single model class within a session, and ideally within a billing window. The cost graph takes a small hit; the customer's experiential variance does not balloon. If routing has to vary across the cohort, the variation should track something the customer can recognize — task type, complexity tier they explicitly opted into — not an internal feature of the input that's invisible to them.

Tier promotion is a release event, not a routing decision

The third miss is that customers do cross tiers. A growing tenant's traffic profile shifts. A product team ships a new feature that produces a query shape the router classifies into a different bucket. A model vendor releases an update and your routing thresholds shift. Each of these events promotes or demotes individual customers across the routing boundary, and each crossing is, from the customer's perspective, a release of a different product into their account — without release notes, without a changelog, without a way to roll back.

Because the crossings are continuous and per-tenant, no single event triggers any of the team's existing release-management muscle. There is no PR review, no canary, no rollback plan, no support enablement. The first signal the team gets is a confused ticket — "did something change last Thursday?" — and the support engineer has no way to answer because the routing log isn't joined to the support tooling, and the change wasn't a deploy.

The discipline that catches this is treating tier crossings as first-class telemetry. Every promotion or demotion event — from cheap tier to expensive, or vice versa — gets logged with the customer ID, the trigger (traffic-shape change, threshold update, vendor release), and a timestamp the support team can correlate against tickets. When a customer asks "did something change," the support engineer can answer "yes, on the 14th your traffic crossed into the higher tier because your typical input length doubled," and that is a wildly better support experience than "I'll have to check with engineering."

The deeper move is to write a cost-quality contract per tier. The router is allowed to optimize cost only within a bounded quality envelope, defined per segment and per task type. When a customer's traffic shape would push them across the tier boundary, the system either pins them to the previous tier (preserving experience) or treats the crossing as an explicit migration with a notification. Routing then stops being "whichever model is cheapest right now" and starts being "the cheapest model that meets the contract this customer is on" — which is the actual product semantics every other tiered SaaS in the world has converged on.

The real question is who owns the segmentation

If you take only one thing from this: shipping a router is not a finops decision and pretending it is one is the source of every downstream surprise. It is a product decision about who gets which version of your assistant, made under cost pressure, and disguised by an aggregate metric that was specifically constructed to hide the segmentation from the people approving the change.

The fix is not to stop routing. Routing is real and the cost case for it is real — frontier-to-cheap price spreads of 30x to 300x are not going away, and a single model strategy at scale is its own kind of malpractice. The fix is to put the routing boundary inside the same governance as any other release: per-segment quality budgets, per-tenant pinning, tier-crossing telemetry, support-visible reasons, and a product owner whose job description includes "the customer experiences one product, not several."

The teams that get this right end up with a routing layer that looks, from the outside, like a tiered product offering — because that is what it is, and naming it as such is the only way to keep the product coherent. The teams that get this wrong end up with a cost graph that bends down, an experience graph that bends down with it, and no internal vocabulary to connect the two until the cohort that lived in the seam has already left.

References:Let's stay in touch and Follow me for more thoughts and updates