Skip to main content

The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor

· 11 min read
Tian Pan
Software Engineer

There is a quiet myth in AI engineering that goes like this: a "minor" model bump — claude-x.6 to claude-x.7, gpt-y.0 to gpt-y.1, the patch-level snapshot rolling forward by a date — should be a drop-in upgrade. The provider releases notes that talk about improved reasoning, lower latency, better tool use. The version number ticks gently. Nothing about the change reads as breaking.

Then it ships. And the on-call channel lights up with reports that the summarizer is now adding a paragraph that wasn't there before, that the JSON extractor is escaping unicode it used to leave alone, that the agent loop is now hitting the max-step ceiling on tasks that used to terminate in three calls. The eval scores look fine in aggregate; the user-visible feature is subtly wrong.

The mistake is treating LLM versions like software versions. They aren't. Software semver is a contract — a 1.2.3 → 1.2.4 patch promises behavioral compatibility, and breaking that promise is a bug the maintainer has to fix. LLM versions carry no such contract. The output distribution is the API surface, and that distribution shifts with every weight change, every tokenizer tweak, every serving-stack optimization. The version number is marketing. The behavior is the thing you actually depend on, and nobody is committing to keep it stable.

The compatibility contract that doesn't exist

Pull the deprecation pages of any major provider and you will find policies about availability — when a model gets retired, how long the snapshot stays callable, what the migration window looks like. You will find almost nothing about behavioral compatibility. The contract is "you can keep calling this endpoint for N months." It is not "the next version will respond the same way to your prompts."

That gap is structural, not negligent. Even the providers can't make the guarantee. A model is a 200-billion-parameter function whose behavior emerges from training; small changes in data mix, RLHF reward shaping, or post-training safety filtering produce non-local effects on outputs that no one — not the lab, not the eval team — fully maps before release. Apple's research on model-update compatibility coined the term "negative flips" for the phenomenon: instances that the previous model got right, the new model gets wrong, even when aggregate accuracy improves. The aggregate hides the regressions, and the regressions are exactly what your production system depends on.

So the right mental model is: every model bump is a breaking change in disguise. The version number doesn't matter. The patch level doesn't matter. The marketing copy that says "improved instruction following" doesn't matter. What matters is whether the joint distribution P(output | your specific prompts, your specific inputs, your specific tool schemas) shifted. And the answer is almost always yes.

Why minor bumps break worse than major ones

The counterintuitive part is that minor updates often break production harder than major ones. This is a behavioral pattern, not a quirk of any single provider, and there are three reasons it keeps showing up.

The first is psychological. A major version bump — Claude 3 to Claude 4, GPT-4 to GPT-5 — sets off institutional alarms. Engineers re-run evals. Product managers schedule rollout reviews. There is a budget, a launch plan, a rollback story. A minor bump or a snapshot date roll triggers none of that. It looks routine. It gets pushed through change management as a one-line config edit. The operational machinery that catches problems is calibrated to the version number, not to the actual scope of the change.

The second is structural. Major version bumps usually come with explicit migration guides because the provider knows the change is large. Minor bumps don't, because the provider's internal evals showed an aggregate improvement and they had no reason to flag anything. The lab's eval suite is not your eval suite. The aggregate gain on MMLU is invisible to a customer-support classifier whose distribution looks nothing like a benchmark. The smaller the version delta, the more likely the lab considers the change a free win, and the less they document about what shifted underneath.

The third is the alias trap. Most teams are not literally pinned to a snapshot — they are calling claude-sonnet or gpt-4o-latest, an alias that auto-upgrades when the provider rolls a new snapshot. Pinning to a snapshot is a deliberate engineering decision that someone has to make and re-make every time a new snapshot lands. Aliases default to "always current," which means the alias subscriber experiences every minor change as an instant unannounced production deployment, with zero rollout staging. This is the worst of both worlds: you get the breakage of a model swap and the surprise of a configuration push you didn't authorize.

What actually changes between snapshots

The visible release notes describe what the lab thinks is interesting. The actual delta is usually larger.

A minor bump can change the tokenizer, even subtly — and a different tokenization changes how prompts get cached, which changes cost, which changes whether your prompt-caching strategy is still sound. It can change the default temperature behavior, the way the model handles system vs user role precedence, the verbosity of refusals, the strictness of JSON mode, the likelihood of emitting a code fence around structured output. It can change the model's calibration on uncertainty — what used to come back as "I'm not sure, but…" might now come back as a confident hallucination, or vice versa.

It can also change the serving layer without changing the model. A new snapshot date might keep the weights the same and roll out a new attention kernel, a new quantization scheme, a different speculative-decoding path, a different fallback policy when the cluster is under load. From the customer side this looks like a model update, but it's actually an infrastructure update with model-shaped consequences. You cannot distinguish "weights changed" from "serving path changed" from the outside, and the lab usually does not tell you which one happened.

The implication is that "minor update" is not a category that means anything operationally. The right unit is "any change to the version string we send in the API call." Every such change should trigger the same compatibility dance, regardless of whether the change is a snapshot date roll, a patch number, or a major version. The provider's labelling should not determine your rollout discipline; your rollout discipline should treat all of them as breaking until evaluated.

The four-layer rollout discipline

If every model change is a breaking change, the operational answer is to run every change through the same rollout pipeline. Four layers cover the realistic failure modes.

Layer one: a pinned eval baseline. Before the version change is even attempted, there is a frozen test set with known-good outputs from the current production version. The set is not a vendor benchmark; it is drawn from real production traffic, deduplicated and filtered, with edge cases the team has been bitten by historically. The eval suite produces a per-prompt diff against the new version, not just a summary score. The summary score will lie. The per-prompt diff is what surfaces the regressions that aggregate over to a wash.

Layer two: shadow traffic. The candidate version runs alongside production, taking the same inputs but its outputs are not user-visible. Every shadow output is logged with the corresponding production output, and an LLM-as-judge or rule-based checker scores them on the dimensions that matter for the use case — schema compliance, factual agreement with retrieved context, tone match, completion of the task without trailing artifacts. Shadow runs need to last long enough to see the long-tail inputs that don't appear in the curated eval set; a week of shadow traffic typically surfaces classes of regression that the eval suite missed.

Layer three: canary rollout. Once shadow looks clean, route a small slice of real traffic — five percent is a common starting point — to the new version. Watch the operational metrics that proxy for behavior change: token count per response, latency P50 and P99, error rate on downstream parsing, retry rate on tool calls, user-side signals like thumbs-down rates or session abandonment. Do not collapse the canary window because the eval looks fine. The whole point of canary is to surface what the eval missed. Forty-eight hours of canary at low percentage is cheap insurance against a multi-day rollback.

Layer four: a real rollback path. "Rollback" cannot mean "edit the config and redeploy." It has to mean "flip a flag and traffic moves back to the pinned version in under five minutes." This requires keeping the previous version pinned and warm — not deprecated, not retired internally, kept in the routing table with positive intent — until the new version has been at one hundred percent for long enough that you trust it. The cost of keeping the old snapshot warm is small. The cost of not being able to roll back when an outage starts is the entire incident.

The discipline only works if every layer is automated. A manual eval is a shipped regression. A shadow run that requires a human to read the logs is a shadow run that doesn't happen. Build the diff tooling and the judge harness and the canary metrics dashboard before the next minor bump lands, not during.

The org-chart consequence

Treating model bumps as breaking changes has an organizational cost that engineering leaders frequently underestimate. It means model upgrades cannot be a one-engineer task. It means the eval suite has to be staffed and maintained by someone who is on the hook when it goes stale. It means there is a person whose job is to watch the provider's release feed and trigger the rollout pipeline. It means the rollback path is exercised on a schedule — a fire drill — so that on the day it matters, the muscle memory exists.

Teams that don't do this aren't avoiding the cost. They're paying it in incidents. Every "the model just got dumber overnight" thread on engineering forums is a team that absorbed an unannounced behavioral change because their version-tracking discipline assumed the provider would warn them. The provider didn't warn them, because there was nothing to warn about under the contract that exists. The gap between the contract that exists and the contract teams assume exists is the gap where production breaks.

The healthier framing: the LLM provider is not a library vendor. It is closer to a managed service whose behavior shifts on its own cadence, like a third-party API that doesn't version its responses. You wouldn't deploy a critical workflow against a third-party API without contract tests, replay logs, and a fallback path. The same instinct applies to model versions, and the version string the provider hands you is the part you should trust the least.

Treat the version string as a constant, not a variable

The cleanest mental shift is to stop thinking of the model version as a value that updates. Treat it as a constant in your codebase. When that constant changes, it is a code change. It goes through the same review process as a database migration: an explicit decision, an evaluation, a staged rollout, a documented owner, a rollback plan. The version string is no less load-bearing than a SQL schema, and pretending otherwise is how teams discover their summarizer started hallucinating yesterday and nobody knows why.

The semver lie is convenient. It lets teams move fast in the name of staying current. The cost of the lie is that production becomes a function of the provider's release cadence rather than the team's release cadence, and the team has handed over the only lever that matters for reliability. Take it back. Pin the snapshot. Run the eval. Ship the canary. The model that runs in production tonight should be the model the team approved this morning, not whichever one the provider rolled out while everyone was asleep.

References:Let's stay in touch and Follow me for more thoughts and updates