5 posts tagged with "llm-routing"

Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts

May 13, 2026 · 10 min read

Software Engineer

The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.

The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.

The 20% Problem in Model Routing: When Cost Optimization Creates Second-Class Users

May 4, 2026 · 9 min read

Tian Pan

Software Engineer

Your routing system works exactly as designed. Eighty percent of queries go to the cheap model; twenty percent escalate to the capable one. Latency is down, costs dropped by 60%, and leadership is happy. Then someone pulls the data by user segment, and you see it: users writing in non-native English are escalated at half the rate of native speakers, and their satisfaction scores are 18 points lower. The routing system treated the query complexity signal as neutral, but it wasn't — it was a proxy for language proficiency, and you've been giving a systematically worse product to a specific group of users for months.

This is the 20% problem. It's not a bug in the router. It's an emergent property of any cost-optimized routing system that nobody measures until it's too late.

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.

The Router Is the Product: Why Your Cheap Classifier Decides More Behavior Than Your Flagship Model

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I talked to last quarter shipped what they called "the routing project": a tiny BERT classifier in front of their flagship model that decided whether a query was simple enough for a cheaper, faster fallback. It paid for itself in three weeks. The cost dashboards lit up green. The flagship's eval suite — three hundred adversarial cases, weekly grading runs, the works — still passed every Friday.

Six weeks in, retention on a particular product surface dropped four points and nobody could find the cause. The flagship was fine. Latency was fine. The router, it turned out, was sending 71% of queries to the cheap model. It had been since week two. The cheap model was the product for most users, and the cheap model had no eval suite at all.

This is the most common failure mode I see in 2026 among teams that adopted LLM routing for cost control: the eval discipline gets attached to the expensive tail of the system, and the cheap head — the part that defines the product for most of the request volume — runs blind.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

About Tian Pan