Skip to main content

The 20% Problem in Model Routing: When Cost Optimization Creates Second-Class Users

· 9 min read
Tian Pan
Software Engineer

Your routing system works exactly as designed. Eighty percent of queries go to the cheap model; twenty percent escalate to the capable one. Latency is down, costs dropped by 60%, and leadership is happy. Then someone pulls the data by user segment, and you see it: users writing in non-native English are escalated at half the rate of native speakers, and their satisfaction scores are 18 points lower. The routing system treated the query complexity signal as neutral, but it wasn't — it was a proxy for language proficiency, and you've been giving a systematically worse product to a specific group of users for months.

This is the 20% problem. It's not a bug in the router. It's an emergent property of any cost-optimized routing system that nobody measures until it's too late.

How Model Routing Works (and Why It's Genuinely Useful)

The core idea is simple: most queries don't need your most capable model. A chatbot answering "what are your business hours?" doesn't need GPT-4-class reasoning. Route it to a smaller, cheaper model and reserve the expensive one for queries that actually benefit from it.

Research has validated this extensively. FrugalGPT, one of the earliest systematic studies, showed you can match GPT-4 performance with up to 98% cost reduction by combining cascading, prompt adaptation, and caching. RouteLLM, which learns routing decisions from human preference data, demonstrated 85% cost reduction on chat benchmarks while retaining 95% of the quality. A production customer support system that switched from a single premium model to a tiered routing setup cut its monthly LLM spend from $42,000 to $18,000.

These numbers are real, and routing genuinely deserves to be part of your inference stack. The problem isn't routing itself — it's what happens when you measure only aggregate quality and aggregate cost.

Why the Escalated 20% Isn't Randomly Distributed

When a routing system escalates a query, it's making a judgment: this query is too hard for the cheap model. That judgment is almost always based on some proxy for complexity — query length, token entropy, embedding distance from training examples, or a fine-tuned router that learned from human preference data.

Each of these signals correlates with things that aren't just complexity. Query length correlates with how much context the user had to provide to get a coherent question. Embedding distance from familiar examples correlates with how unusual the framing is. Even router models trained on human preference data inherit the demographic distribution of whoever rated those preferences — and Chatbot Arena, the dataset RouteLLM used, skews heavily toward English-speaking, technically sophisticated users.

The Hybrid LLM paper (ICLR 2024) identified that roughly 20% of queries benefit meaningfully from expensive models — but the specific 20% depends on which pairs of models you're comparing and which workloads you're serving. That percentage clusters by query type in ways that often correlate with who is asking.

Consider what this looks like concretely:

  • A user who formulates questions concisely in clear English may consistently trigger lower complexity scores, landing on the cheap model even for genuinely hard questions.
  • A user writing in their second language, with longer and more hedged phrasing, may score higher complexity — and paradoxically escalate more often.
  • A user asking about topics underrepresented in the cheap model's training data (non-Western legal systems, minority languages, domain-specific technical jargon outside of software) may get confidently wrong answers from the cheap model because it has no mechanism to signal its own ignorance.

None of this is intentional. But intent doesn't determine impact.

The Disparate Impact You're Not Measuring

Research on LLM underperformance makes the problem precise. A 2024 study on targeted LLM underperformance found that model quality degradation is not uniformly distributed across users — it's most severe for users with lower English proficiency, less formal education, and non-US origins. At intersections (non-native English speaker with less formal education, for example), the gaps are larger than either factor alone.

When you layer routing on top of a model that already has these quality gaps, the effects compound in two ways.

First, the cheap model is almost always worse for underrepresented groups than the expensive model is. It saw less training data covering their use cases, was likely less carefully fine-tuned for fairness, and its calibration was validated on benchmarks that skew toward standard English and Western contexts.

Second, the routing decision that should catch "this query is too hard for the cheap model" fails more often for these users. The cheap model's responses to difficult non-Western or non-English queries may sound fluent and confident — because that's what language models do. The router, checking confidence signals, sees no red flag. The user gets a wrong answer they have no reason to doubt.

The aggregate quality number looks fine. You're meeting your SLA. But a specific group of users is getting a systematically worse product, and your monitoring isn't designed to see it.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates