Skip to main content

The Bug Report Against a Model Version You No Longer Serve

· 11 min read
Tian Pan
Software Engineer

A customer support ticket arrives on a Tuesday. The customer attached a screenshot of an output your product generated six weeks ago. They say it is wrong, or unsafe, or simply not what they expected, and they want it fixed. Your support engineer pastes the prompt back into the same API endpoint and gets a clean, reasonable answer. The bug, as far as the system can tell, does not exist.

The bug exists. The model that produced the screenshot does not. Since the customer filed the ticket, the weights behind your v1-chat endpoint have been swapped twice — once for a quality bump, once for a cost optimization — and the original checkpoint is no longer reachable. The customer's "this is broken" is now an unfalsifiable claim against a moving target, and the support team has no path to either confirm it or close it out.

This is not a quirky edge case. It is the predictable consequence of treating model versioning as an internal MLOps concern when it is actually a customer-visible product contract. The endpoint URL is stable. The artifact behind it is not. Until your support workflow, your retention policy, and your customer contract acknowledge that gap, every bug report against a rotated checkpoint will land in the same triage void.

The Endpoint Is Not the Model

The mental model your team built around the endpoint is wrong, and it is wrong in a way that has been working for you up until now. You named the route v1-chat and you wrote in the changelog that "the v1 contract is the schema, not the model." That sentence is technically defensible — the request shape, the response shape, the auth headers, the rate limits, none of those have changed. The model, on the other hand, has been continuously upgraded behind the same URL because that is how the team chose to interpret "v1."

The customer interpreted "v1" differently. To them, the endpoint is a black box that produced a specific output on a specific day, and the name on the door implied that whatever was inside the box would keep behaving the same way. The frontier providers have learned this lesson the hard way and now expose both pinned snapshot IDs and floating aliases — Claude's claude-sonnet-4-5-20250929 versus claude-sonnet-4-5, OpenAI's dated suffixes versus the bare family names. The pinned snapshot is a promise that the weights and the configuration behind that ID will not change for the lifetime of the ID. The floating alias is a convenience that explicitly forfeits that promise.

If you only offer the floating alias, you have shipped a contract where the substance is mutable. The customer cannot pin even if they want to. The fact that you have engineering reasons for rotating the checkpoint — newer weights are cheaper, safer, better — does not change what the customer signed for, which was the behavior they saw during the trial. The endpoint name and the served artifact need to be two distinct things, and the customer needs the ability to name the latter when something goes wrong with it.

The Triage Path That Doesn't Exist

Once the artifact is mutable, the entire support workflow needs to absorb that. Most don't. The default support engineer's playbook assumes a deterministic, reproducible system: take the inputs from the bug report, re-run them against the current system, observe whether the bug is still present, escalate if yes, close if no. That playbook quietly breaks the moment "the current system" is not the same system the bug was filed against.

The failure mode goes like this. The customer's screenshot was generated against checkpoint A. The support engineer re-runs it against checkpoint C, which is what's behind the endpoint today. The output is different — sometimes better, sometimes just different in ways that do not exhibit the original problem. The engineer closes the ticket as "cannot reproduce." The customer either reopens it with more screenshots of the same checkpoint-A behavior they cannot regenerate, or quietly loses faith in the product. Neither outcome is a fix.

What the workflow is missing is a triage path for bugs filed against retired artifacts. It needs three things the current ticket pipeline almost certainly does not have: the model version that produced the original output captured in the ticket itself, a way to route a non-reproducible bug to the team that retired the checkpoint rather than dead-ending it at support, and a policy decision — made once, written down — about what the company will tell a customer when the bug is real but the artifact is gone. "We changed the model and we cannot reproduce your issue" is a perfectly honest answer. The problem is that no one has ever said it out loud, so the support engineer makes it up on the fly and the customer hears improvisation.

The Customer's Eval Was a Procurement Decision

This part is the one that turns a support inconvenience into a contractual problem. During the trial, the customer ran your endpoint against their own eval set — the one their compliance team approved, the one the procurement committee referenced when they signed off on the contract. The numbers from that eval are now sitting in a deck, in a memo, in a security questionnaire response, in an internal justification document. Those numbers were generated against checkpoint A.

When you silently rotated to checkpoint C, you invalidated the empirical basis for the procurement decision. The customer probably did not notice, because the rotation was silent. But they should have noticed, because the numbers no longer hold. Re-run the same eval today and the scores will differ — perhaps better, perhaps worse, perhaps better on average but worse on the specific subset of cases that mattered to whichever stakeholder pushed the deal across the line. Either way, the document that justified buying your product no longer describes the product they have.

The legally interesting failure happens at renewal time, or at the first audit, when the customer's compliance team pulls the original eval report and asks the AI team to refresh it. The refresh shows different numbers. Now someone has to explain, in writing, why the product the company is paying for behaves differently from the product the company evaluated. The answer — "the vendor upgraded the model" — is fine if you told them and they consented. It is not fine if you didn't, and "we treat the model as an implementation detail behind the endpoint" is not the comfort to a regulated buyer that it is to your platform team.

Model Versioning as a First-Class Contract

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates