Skip to main content

The Latency-Budget Router That Was a Quality-Loss Router by Another Name

· 10 min read
Tian Pan
Software Engineer

A model router that optimizes a single loss function will deliver exactly what that loss function asks for, and nothing else. When the function is "stay under the p95 latency target," every query that would have benefited from extended reasoning gets snapped to the cheapest path the router can defend, because the fast model returns under the SLO and the slow-but-correct model would not. The latency dashboard turns green. The aggregate eval moves a fraction of a point and the team rounds it to noise. The per-slice view nobody graphs is where the actual regression lives: concentrated in the multi-step, ambiguous, and out-of-distribution queries that should have been routed to reasoning and instead got the model that finishes fast and is wrong with confidence.

This is not a routing bug. The router is doing exactly what it was built to do. The bug is in the framing — a system whose optimizer is denominated entirely in latency will produce quality regressions invisible to the metric the team is paid to keep green. It will then ship those regressions silently, because the people watching the dashboard are not the people watching the answers.

The Single-Objective Trap

The reason latency-budget routing feels reasonable when you build it is that latency is a clean, immediate, server-side signal. A router classifier predicts the cheapest model that returns under the deadline, and the dashboard updates in real time. Quality is none of those things. Quality lags by hours or days, requires human or LLM-judge scoring, and is denominated in units that aggregate across query classes whose distributions move independently. Putting latency in the objective and quality in a downstream eval is not a design decision — it is the path of least instrumentation resistance.

The trap is that the routing decision is causally upstream of every quality measurement the team will ever take. Once the cheap path has answered, the costly path will not. There is no counterfactual to compare against unless the team built one in, and most teams do not — the shadow infrastructure to run both models on every query and grade the difference is exactly the cost the router was supposed to avoid. So the regression accumulates inside a cohort the dashboard has no slice for, and the team's understanding of "quality" reduces to "the aggregate score I see at the end of the week, which has moved by less than the noise floor."

Recent work on multi-objective serving — Bayesian-optimization frameworks like BOute, Lagrangian-RL controllers like PROTEUS — frames this explicitly: routing is a constrained optimization problem with at least three axes (cost, latency, quality), and treating any one of them as the loss function while leaving the others as "we'll watch the dashboard" is mathematically equivalent to assigning the latter zero weight. The model the router serves will reflect those weights whether the team intended them or not.

Why the Aggregate Eval Will Lie to You

Suppose the candidate routing change moves 60% of traffic from a reasoning model to a fast model. Suppose the fast model is genuinely at parity on the 80% of queries the router classifier identifies as "easy." Suppose on the remaining 20% — the queries the classifier got wrong, plus the queries on the boundary, plus the out-of-distribution tail — the fast model is meaningfully worse.

Aggregate eval score: 80% × parity + 20% × meaningful regression. If the eval suite is balanced like production traffic, the headline number moves by a small fraction of the regression on the affected slice. If the affected slice is 5% of production traffic and the regression on it is 15 points, the aggregate moves by 0.75 points — well within the noise band the team has been calibrating against. The PR ships, the dashboard stays green, and the slice that took the hit is small enough that the support team sees the symptom (longer ticket resolution, more escalations, more "the AI doesn't get my question" feedback) as a separate phenomenon from the routing change three weeks earlier.

This is the same failure mode that shows up in every place teams compress a multi-objective system into a single headline metric. It is not specific to AI. What is specific to AI is that the affected slice — the multi-step, ambiguous, out-of-distribution queries — is also the cohort the team's product strategy is implicitly betting on, because every easy query is one a non-AI product could also handle. The thing the router quietly downgraded is the thing that justifies the router existing in the first place.

The Reasoning-Eligibility Lane

The pattern that closes the gap is to stop treating "should this query get reasoning?" as a budget decision and start treating it as a classification decision that runs ahead of the budget logic. A reasoning-eligibility classifier — recent hybrid-LLM routers describe this as a "to think or not to think" head — takes the incoming query, predicts whether the reasoning model is likely to produce a meaningfully better answer, and pins eligible queries to the reasoning path regardless of how much latency budget is left in the window.

The mechanical change is small. The semantic change is large: it inverts the priority order from "stay under latency unless you can prove the reasoning is worth it" to "use reasoning when the classifier says it matters, and absorb the latency hit on that cohort." The latency dashboard now has a known floor — the cohort the classifier sends to reasoning will have a worse p95 than the cohort it sends to the fast path, and that is the intended state, not a regression. The aggregate latency target is now a property of the eligibility classifier's calibration rather than the router's willingness to compromise.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates