When the Cheap Model Is the Expensive One
A finance team flags that the LLM bill is up 18% this quarter. An engineer pulls the usage dashboard, sees that 70% of traffic now hits the budget model instead of the frontier one, and is briefly confused: the routing change was supposed to cut spend. The per-token price went down exactly as the spreadsheet promised. The bill went up anyway.
This is not a billing error. It is the most common way a cost optimization quietly inverts itself. The spreadsheet that justified the downgrade priced one thing — tokens — and the production system pays for something else entirely: finished tasks. A weaker model does not just produce cheaper tokens. It changes the behavior of every component around it, and those second-order effects land on the same invoice.
The trap is seductive because the first-order math is genuinely correct. A budget model can be 10x to 30x cheaper per token than a frontier model, and for a large fraction of traffic it returns an answer that is indistinguishable in quality. The mistake is not the routing decision. The mistake is measuring the routing decision at the wrong boundary.
Cost Per Token Is the Wrong Unit
Every model comparison page quotes dollars per million tokens. It is the wrong denominator. Your users do not buy tokens; they buy resolved requests. The number that belongs on the dashboard is cost per successful task — total spend across every attempt, retry, fallback, and tool call, divided by the count of requests that actually ended in a usable answer.
The two numbers diverge fast. Consider a classification step. The frontier model costs three cents per call and succeeds 95% of the time. The budget model costs one cent per call and succeeds 75% of the time. Per token, the budget model is a clear win. Per success, the frontier model costs about 3.2 cents per resolved task and the budget model costs about 1.3 cents — still cheaper, but the gap collapsed from 3x to roughly 2.5x, and that is before counting what happens to the 25% that failed.
Those failures do not vanish. They retry, they fall back, they escalate to a human, or they ship a wrong answer that someone downstream has to catch. Each of those outcomes has a price, and none of it appears in the per-token comparison. A model with a 60% success rate at a low per-call price routinely costs more per success than a model with a 95% success rate at triple the price. The headline number told you the opposite.
If your observability stack reports tokens-by-model and cost-by-model but cannot report cost-by-resolved-task, you are flying on the instrument that is most likely to be wrong.
Retries and Fallbacks Are Hidden Line Items
The single largest amplifier is the retry. A weaker model fails structured-output validation more often — a malformed JSON object, a missing required field, an enum value it invented. Most production harnesses respond to that automatically: re-run the call. Each retry is a full inference pass, the entire prompt sent again, and in practice silent retries can multiply token usage by two to five times with no change in user traffic at all.
This cost is invisible by construction. The retry happens inside the harness, the dashboard aggregates tokens by model, and the extra spend gets attributed to "the budget model" as if it were normal load. Nobody sees a line item labeled "retries caused by downgrade." They see the budget model's token count creep up and assume the feature is just getting more use.
Fallback chains hide an even sharper version. A well-built system routes the cheap model first, validates the output, and on failure escalates to the frontier model. That is correct engineering — but it means every failed budget-model call costs you twice: once for the cheap attempt that didn't work, plus the full price of the frontier call that did. Route 70% of traffic to a budget model with a 20% failure rate, and 14% of all traffic now pays both prices. The budget tier stopped being a cost center you can reason about and became a tax on the frontier tier.
The fix is not to remove retries or fallbacks. It is to attribute their cost to the routing decision that caused them. Tag every retry and every fallback with the model that triggered it, and the budget model's true cost per task stops hiding inside aggregate token counts.
The Prompt-Length Tax
A weaker model rarely fails loudly enough to trip a validator. More often it degrades subtly — vaguer answers, missed edge cases, a tone that is slightly off — and the team's instinct is to compensate in the prompt. Add two more few-shot examples. Spell out three edge cases the model keeps missing. Append a longer system instruction about format.
Every one of those additions is permanent input cost on every single request, forever. This is the prompt-length tax: the budget model needs a heavier prompt to reach the same quality bar the frontier model cleared with a lean one. Research on over-prompting shows the pattern clearly — token cost rises linearly with each example you add, while accuracy gains flatten out after the first few. You pay more and more to claw back less and less.
Do the arithmetic at the request level. Suppose the downgrade saves 1.5 cents per call on raw generation, but propping the budget model up takes 800 extra tokens of examples and instructions on every request. At budget-model input rates that may only be a fraction of a cent, so the trade still looks fine — until you remember that the inflated prompt is also what gets resent on every retry and carried into every fallback. The prompt-length tax and the retry amplifier multiply each other. The heavier prompt makes each retry more expensive, and the weaker model makes retries more frequent.
There is a context-window cost too. Examples consume room that the actual task input needs. On a model with a smaller window, a long document plus enough few-shot examples to establish the pattern may simply not fit, forcing truncation or chunking — which is its own new source of failures and its own new round of calls.
The Work Doesn't Disappear — It Moves
When a model gets weaker, the work it used to do does not evaporate. It relocates to wherever the system is softest, and those places are usually more expensive per unit than the model ever was.
It moves to downstream tools. A frontier model might resolve an ambiguous request in one shot; a budget model issues three tool calls to gather context it could have inferred, and each of those calls has its own latency and, often, its own metered cost.
It moves to human review. This is the most expensive destination by a wide margin. A loaded engineer-hour dwarfs any token bill. If a downgrade pushes even a small slice of traffic from "auto-resolved" to "needs a human glance," the labor cost can erase the entire token saving and then some. Total cost of ownership — the real number — includes incident response, the engineer-hours spent diagnosing why quality dropped, and the review queue that grew. Teams that track only API spend routinely find their true consumption is two to three times what they estimated.
It moves to the user. The cheapest place to dump degraded quality is onto the person asking the question — a vaguer answer, a re-ask, an abandoned session. That cost never touches your invoice. It touches retention, and you find out about it a quarter later.
The general principle: a downgrade does not reduce the work, it reprices it. And it almost always reprices it upward, because the model was the cheapest worker in the system. Everything the work flows toward — retries, tools, humans — bills at a higher rate.
Measure at the Task Boundary, Not the Call Boundary
The discipline that prevents all of this is one rule: a model-routing change is a system change, and it needs an end-to-end cost eval before it ships. Not a per-token comparison. An eval that runs a representative sample of real traffic through both the old path and the new path, all the way to a finished outcome, and reports four numbers for each path:
- Cost per successful task — total spend including retries and fallbacks, divided by tasks that ended usable.
- Success rate — the share of tasks resolved without escalation, measured by the same eval gate production uses.
- Retry and fallback rate — how often the cheap path triggered a second attempt or a frontier escalation.
- Effective prompt size — the input tokens actually sent, including any examples added to prop the model up.
Run that, and the inverted optimization shows up before the finance team finds it. You will sometimes see the budget model win cleanly — that is the whole point of routing, and on simple, high-volume, well-bounded tasks it genuinely does. You will sometimes see it lose once retries and fallbacks are counted. Either way you shipped a measurement, not a guess.
This also reframes the routing layer itself. A router is a classifier deciding cheap-vs-expensive per request, and a misroute is a distinct failure mode — not a model error, a routing error — that aggregate cost dashboards are structurally blind to. The router deserves its own regression suite and its own slice in the eval, because a router that routes too aggressively produces exactly the symptom that looks like success: a high budget-model hit rate sitting right next to a rising bill.
The Takeaway
The cheap model is not a lie. It is a real lever, and on the right traffic it delivers the savings the spreadsheet promised. What is a lie is the spreadsheet's unit. Cost per token measures a thing nobody buys; cost per successful task measures the thing you actually ship.
Before the next downgrade, instrument the task boundary. Attribute every retry and every fallback to the routing decision that caused them. Count the prompt tokens you added to compensate. Watch where the displaced work lands. Then run both paths end to end and compare the only number that matters.
Do that, and routing becomes the cost win it is supposed to be. Skip it, and you will keep shipping downgrades that look free per token and arrive, one quarter later, as a bill nobody can explain.
- https://www.codeant.ai/blogs/cheap-llm-models
- https://www.codeant.ai/blogs/why-token-pricing-misleads-llm-buyers
- https://www.codeant.ai/blogs/llm-production-costs
- https://inferbase.ai/blog/hidden-costs-of-llm-apis
- https://www.braintrust.dev/articles/how-ai-observability-helps-lower-llm-cost-at-scale
- https://www.ptolemay.com/post/llm-total-cost-of-ownership
- https://blog.logrocket.com/llm-routing-right-model-for-requests/
- https://arxiv.org/html/2509.13196v1
- https://divyam.ai/blog/hidden-cost-of-llmflation/
