Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts
The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.
The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.
The list price is the marginal cost of the median request
Per-token pricing reports a number for the typical request: median input length, median output length, median complexity. It does not report the all-in cost of the request distribution your product actually serves. Two costs are usually missing from the dashboard.
The first is the retry-and-degrade cost. When the workhorse fails — wrong schema, refused output, hallucinated tool call, partial answer — something downstream retries. Sometimes that's a synchronous retry in the inference layer (one more API call, one more token bill). Sometimes it's the user clicking "try again," which is the same call rebilled but now charged against a different metric (engagement instead of cost). Sometimes the retry includes a fallback to a stronger model anyway, so you paid the workhorse price and then the reasoning price on the same logical request. The dashboard sums tokens by call, not by resolved request, so the retry stack is invisible.
The second is the escalation cost. The request that the workhorse fails on becomes a support ticket, a human reviewer's queue entry, a customer-success engineer's afternoon. None of those line items show up in inference spend. They show up in headcount, in customer-satisfaction surveys, in churn. A 5% long-tail failure rate at a million requests per day is 50,000 escalations, and a team that hasn't priced the escalation queue is making a model-selection decision against incomplete data.
The price-reversal phenomenon makes this concrete. Within reasoning models, the cheaper-listed option can end up costing more end-to-end roughly one in five comparisons, because cheaper reasoning models think harder to compensate for weaker base capability and the thinking-token bill swamps the per-token-price advantage. Thinking-token variance on the same query against the same model can reach 9.7×. The list price is signal; the all-in cost is the metric.
What a request classifier is actually classifying
The phrase "route easy prompts to the fast model and hard prompts to the slow model" makes routing sound like a binary decision on a feature the model can see. It is not. A request classifier is predicting a counterfactual: will the workhorse fail on this input badly enough that paying for the reasoning model upfront is cheaper than discovering the failure later?
The features that predict workhorse failure cluster into a few categories. Input shape matters but is a weak signal on its own — a 50-token instruction to rewrite a contract clause is harder than a 500-token narrative summary. The strong signals are structural: schema complexity (number of required fields, depth of nested types, count of mutually-exclusive cases), the presence of reasoning markers in the prompt (multi-step, calculate, compare, derive), the prior failure rate for similar request shapes, the customer tier (whose escalations cost the most), and whether the request needs a tool call whose argument structure the workhorse has historically gotten wrong.
The classifier doesn't have to be a deep model. A logistic regression over those features, trained on a few weeks of production traces labeled with "workhorse succeeded" vs. "workhorse failed and was retried/escalated/regenerated," reaches usable accuracy. The interesting result is that the classifier doesn't have to be very accurate — an 80% accurate classifier processing millions of requests still saves cost on the 80% it routes correctly. The asymmetry in failure cost between the two routes does most of the work; the classifier just has to be right more often than not.
The all-in cost model
Routing decisions made on per-token price alone end up under-routing to the reasoning model, because the per-token price doesn't reflect the cost of the route. An all-in cost model needs four terms per request, per route:
- Direct inference cost: tokens in × input rate + tokens out × output rate, including thinking tokens for reasoning models.
- Expected retry cost: probability of failure × cost of the retry (which often includes a fallback to a stronger model).
- Expected escalation cost: probability of human handoff × loaded cost of human review.
- Trust-damage cost: a discount factor applied to lifetime revenue when the request fails in a user-visible way; this term is fuzzy but non-zero and dominates the math for high-trust workflows.
- https://blog.logrocket.com/llm-routing-right-model-for-requests/
- https://www.lmsys.org/blog/2024-07-01-routellm/
- https://arxiv.org/abs/2404.14618
- https://www.intoai.pub/p/price-reversal-phenomenon
- https://www.anyscale.com/blog/building-an-llm-router-for-high-quality-and-cost-effective-responses
- https://github.com/lm-sys/RouteLLM
- https://research.ibm.com/blog/LLM-routers
- https://www.getmaxim.ai/articles/top-5-llm-routing-techniques/
- https://www.morphllm.com/llm-router
- https://route.withmartian.com/
- https://arxiv.org/html/2510.00202v1
