Skip to main content

The Model Identifier Your Provider Re-Pointed to a Finetune for One Tenant and to Base for Everyone Else

· 11 min read
Tian Pan
Software Engineer

A customer support team escalates: "Your assistant used to handle refund-eligibility questions correctly. Last week it started getting them wrong." The on-call engineer pulls a transcript, replays the exact prompt against the same model identifier in a dev account, gets the correct answer, and closes the ticket as "cannot reproduce." Two weeks later the same complaint shows up from a different customer. The engineer replays again, in the same dev account, and gets the correct answer again. The team starts blaming a prompt change nobody made.

The model identifier in the request never changed. The string in the response field matched the string in the request field. The eval suite stayed green for six weeks. The model serving production traffic was a different set of weights from the model serving the eval suite, and had been for the entire life of the account — except for the last six weeks, when it became the same set of weights and the team noticed only because a customer noticed first.

This is what happens when the model identifier is treated as a name for the weights, when in fact it is a label on a routing decision the provider is free to revise.

The identifier is a routing label, not a model

A high-volume customer negotiates a "preferred model" arrangement with their provider: the customer's traffic gets a tenant-scoped fine-tuned variant trained on a corpus the customer cares about, and the variant is served behind the same public model identifier the customer's code already calls. The customer doesn't need to update SDKs, gateway configs, or prompt templates. The string claude-sonnet-4-6 (or gpt-5.1, or any other identifier) keeps appearing in request and response payloads. The fine-tune is invisible at the API boundary, which is exactly what the customer asked for when they signed the contract.

The provider's routing layer makes a per-tenant decision on every request. For your tenant, the identifier resolves to a specific set of weights stored on a dedicated pool. For everyone else's tenant, the same identifier resolves to the base model. The response payload echoes the request identifier in both cases because that is what the API contract guarantees: the model field tells you which identifier you asked for, not which weights answered.

Most teams never notice this gap because the variance between "fine-tuned for our domain" and "base with a system prompt" is small enough on average that A/B tests at signup justify the program and then nobody re-validates. The gap becomes a problem when one of three things happens. The provider re-points the identifier back to base, because the dedicated capacity cost no longer pencils out at the provider's mid-quarter operations review. The team's eval account, sitting on a dev workspace without the tenant-scoped routing, was always on base and never saw the fine-tune. Or production traffic shifts toward the specialty domain the fine-tune was trained on, and the gap that used to be averaged away becomes visible to a customer.

Why the eval suite stays green

The team built their eval suite the way everyone builds eval suites: a development account with an API key, a dataset of representative prompts, a scoring rubric, and a CI job that runs nightly. The eval account belongs to a separate workspace from production for the same reasons every other secret lives in a separate workspace. The eval account is not paying for dedicated capacity. The eval account is not enrolled in the preferred-model program. The eval account has been on the base model since the dev workspace was created, because the routing decision is bound to the tenant, not to the identifier in the request.

The eval suite's job is to certify "this identifier produces these answers." It has been doing that job correctly for the model served to the eval account, which is the base model. When the production tenant was on the fine-tune, the eval suite's certification did not describe production. When the production tenant gets re-pointed to base, the eval suite's certification suddenly does describe production — but the certification doesn't get more accurate; production just degrades to match the certification's accuracy. The numbers don't move because nothing the eval is measuring has changed.

The team monitoring the eval suite sees a flat green line and concludes the model is stable. The customer-facing quality on the specialty workload regresses by whatever lift the fine-tune was providing, which on a narrow domain can easily be 10–15 points on a task-specific rubric, and which is invisible in any aggregate metric averaged across the long tail of normal traffic.

How the gap stayed hidden for six weeks

Six weeks is the timeline because the regression has to compete with all the other reasons a model output might be slightly worse this week than last week. Prompt drift from people editing system messages without telling each other. Tool descriptions added to the registry that change the model's planning. Upstream context-window changes that drop early conversation turns. Customer-mix shifts that surface new question patterns. Provider-side temperature or sampler changes that show up as system_fingerprint differences but get dismissed because everyone has stopped tracking that field.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates