Skip to main content

The Provider Auto-Router That Quietly Routed Your Premium Traffic To Haiku

· 10 min read
Tian Pan
Software Engineer

Your platform team adopted the provider's "auto" model identifier for cost reasons. The first dashboard after rollout was hard to argue with: a 34% spend reduction with no measurable quality drop on the weekly eval. Three months later, customer satisfaction on your shortest, highest-volume surface had been sliding for two quarters, and a product-led investigation eventually traced the regression to a model identifier nobody on the engineering team had touched. The code said "auto." The provider had been redefining what "auto" meant the whole time.

The lesson is not that auto-routing is bad. The lesson is that "auto" is a moving target whose distribution drifts with provider economics, and your eval's representativeness is the only check standing between vendor optimization and your product quality. If the eval does not match the traffic, the discount you celebrated is being paid out of a quality slope nobody is reviewing.

This post is about what changes when a routing decision you used to make in your own code moves across the network and becomes a provider knob. The mechanics look familiar — classifier in front of a model pool, cheap tier for easy prompts, expensive tier for hard ones — but the operational implications are not familiar. You no longer own the classifier. You no longer see the per-request decision. You no longer get to revalidate the threshold when the cost curves move. And you almost never get notified when any of those change.

The Auto Alias Is A Routing Policy You Do Not Author

When you call a concrete model — claude-sonnet-4-6, claude-haiku-4-5, the explicit identifier — your code expresses a choice and the provider executes it. When you call an auto alias, your code expresses an intent and the provider decides on your behalf. The distinction sounds philosophical until you internalize that the policy behind the alias is something the provider tunes for their own objectives, on their own cadence, in response to their own load and margin curves.

A reasonable auto-router classifies prompt complexity in a few hundred milliseconds, sends easy queries to a small model, and reserves the large model for hard ones. The literature on this is encouraging — well-tuned routers report 50% to 80% cost reductions against a single large-model baseline, with under 2% quality loss on the hard slice. Those numbers are real for the system as measured. They are not a contract about your system after the router's parameters move.

Two things commonly drift. First, the classifier's complexity threshold. A provider running auto across many tenants will retune the boundary as model economics change — Haiku gets cheaper to serve, the team relaxes the threshold so more traffic qualifies. Your traffic mix did not change. The classifier's mix changed. Second, the model pool itself. A provider can keep the alias stable and swap one of the pooled models for a faster or cheaper successor. The alias still points to "the right tier for this prompt." The behavior under the alias is no longer the behavior you benchmarked.

Both are silent changes. The release notes, if they exist at all, file them under "improved efficiency" or "routing improvements." Your client code did not change. Your spend dropped. The dashboard you trusted said everything was fine.

How The Eval Misses It

The eval suite missed the regression for a specific, structural reason. Most eval suites are written by engineers who care about hard problems — the model's reasoning ceiling, its instruction following on long contexts, its ability to handle a tricky tool call. The suite ends up heavy on the cases that look hard, because those are the cases that fail interestingly. Easy cases get included for coverage but they pass everywhere, so nobody invests in expanding them.

When the eval is dominated by longer, harder prompts, an auto-router that is being tuned more aggressively on short prompts will route the eval correctly. Your eval sees Sonnet for almost every example, so the eval score holds. Meanwhile, your product traffic is something like 70% short, frequent prompts — autocomplete, suggestion ribbons, the tooltip that explains a chart — and the router is sending the majority of that volume to Haiku. The product surface most users touch most often is now running on a different model than the eval is measuring.

This is the failure mode where "we have good evals" is true and useless at the same time. The eval set is high quality. It just is not representative. The router's distribution shift hides inside the gap between the prompts your eval covers and the prompts your customers send.

The fix is not "more eval cases." It is per-surface eval slices weighted to match production traffic. If 60% of your prompts are autocomplete-style, then 60% of your eval mass should be on autocomplete-style examples, scored on the dimensions that matter for that surface — latency, succinctness, refusal rate, factual accuracy on short queries. A routing distribution shift will then move the aggregate score in the direction of the shift, instead of being absorbed by the long-context reasoning slice that the router never touched.

The Visibility You Lost By Saying "Auto"

When you owned the routing decision, you knew which model served each request because your code chose it. When the provider owns the routing decision, the response usually tells you which model actually ran — but only if you log it, and only if you treat that log as a first-class signal rather than a debugging field.

A practical baseline is a daily report of which concrete model served each surface, as a percentage of that surface's total traffic. Not as a count. As a share. The autocomplete surface should have a stable distribution: maybe 95% Haiku and 5% Sonnet, give or take a few points week to week. When that distribution moves — Sonnet's share drops below 1%, or Haiku's share on the long-form surface jumps from 12% to 30% — you have a population-level signal that the router's behavior on your traffic changed, before any quality metric has had time to slide.

This is the same pattern as alerting on a 4xx rate as a fraction of total traffic instead of as an absolute count. Absolute counts are noisy because volume is noisy. Population shares are noisy in a way that matches your traffic, and a real shift moves them in a way randomness does not.

The reason this matters: by the time a quality score has moved, the regression is already in front of customers. By the time CSAT has moved, it is already in front of churn. The model-share distribution moves first because it is the cause, not the symptom. Watching the cause buys you weeks of lead time.

What Pinning Actually Buys You

The cleanest defense against silent routing drift is to not opt into it: name the concrete model, not the alias. Production AI systems treat the model the same way they treat any other versioned dependency. You pin a database engine version. You pin a Kubernetes minor. You pin a security library by SHA. The model is the same kind of dependency, and the alias is the same kind of trap that "latest" tags are in container registries — convenient until they aren't.

Pinning gives you three things that the alias does not.

It gives you a stable contract: the model behind your call today is the model behind your call next quarter, unless you change it. Behavior changes still happen — providers can revise a specific model — but they have to ship under a name that changes, and the change is something you can see and decide about.

It gives you a clear deprecation surface: when the provider sunsets your pinned model, they have to tell you. The deprecation has a date, a successor recommendation, and a window to migrate. The alias has none of that because the alias is the migration path. Your migration is the provider's edit to a routing table you don't see.

It gives you an auditable economic decision: the cost-quality tradeoff is in your code, not in the provider's policy. You decided to route this surface to Haiku because the eval says it is fine. If the eval changes, you change the routing. If the surface is more important than you thought, you upgrade it. The decision sits in version control next to the prompt it serves, and any future engineer can see why the choice was made.

Sub-agent systems already model this well — the model field accepts an alias, a full identifier, or an "inherit from parent" sentinel. The right discipline is to use the full identifier for anything in front of customers, reserve the alias for development and exploration, and inherit only when the parent is itself pinned.

When You Do Want Auto

There are cases where opting into provider routing is reasonable. Internal tooling where the cost variance matters more than the quality consistency. Exploratory features where you do not yet know which model is the right fit and the router's classification is a useful prior. Workloads with traffic that is genuinely random in complexity and where a poorly-served outlier is annoying but not expensive.

The discipline that makes these cases safe is the same discipline that makes pinning unnecessary in those cases: you have to instrument the system so the router's behavior is visible. The model that served each call is logged. The share-of-traffic per model is plotted. The eval is sliced by which model actually ran, not by which model you expected to run. When the router moves under your feet, you see the move on the dashboard you already watch.

The architectural realization is that "let the provider decide" is not a way to opt out of the routing decision — it is a way to opt into a routing decision someone else is making continuously, in response to incentives that are not yours. That can be a reasonable choice. It is not a free one. And the bill comes due not in cost, where it is visible, but in quality on the surfaces your eval was never weighted to cover.

A Short Checklist Before You Ship "Auto"

A few questions worth asking before any production surface goes through a provider's auto alias:

  • Do you log the concrete model that served each request, and is that log queryable per surface?
  • Does your eval suite have a slice that matches the traffic mix on the surfaces using the alias, scored on the dimensions those surfaces care about?
  • Is there an alert on the model-share distribution per surface, so a routing shift surfaces before a quality regression?
  • Does your contract or your provider documentation commit to a routing policy, or only to "intelligent routing" as a marketing description?
  • Have you decided, in writing, the conditions under which you would migrate from the alias to a pinned model — and the conditions under which you would migrate back?

If the answer to any of these is no, the alias is doing something for the provider you may not be doing for yourself. That is a fine trade to make deliberately. It is a bad trade to make by default, in the same line of code that names a model.

The team that opts into provider routing without owning the audit is letting the vendor set its margin against a quality metric only the customers feel. The team that pins the model and slices the eval by surface keeps the decision local, the cost honest, and the regression detectable while there is still time to act on it.

References:Let's stay in touch and Follow me for more thoughts and updates