The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor
There is a quiet myth in AI engineering that goes like this: a "minor" model bump — claude-x.6 to claude-x.7, gpt-y.0 to gpt-y.1, the patch-level snapshot rolling forward by a date — should be a drop-in upgrade. The provider releases notes that talk about improved reasoning, lower latency, better tool use. The version number ticks gently. Nothing about the change reads as breaking.
Then it ships. And the on-call channel lights up with reports that the summarizer is now adding a paragraph that wasn't there before, that the JSON extractor is escaping unicode it used to leave alone, that the agent loop is now hitting the max-step ceiling on tasks that used to terminate in three calls. The eval scores look fine in aggregate; the user-visible feature is subtly wrong.
The mistake is treating LLM versions like software versions. They aren't. Software semver is a contract — a 1.2.3 → 1.2.4 patch promises behavioral compatibility, and breaking that promise is a bug the maintainer has to fix. LLM versions carry no such contract. The output distribution is the API surface, and that distribution shifts with every weight change, every tokenizer tweak, every serving-stack optimization. The version number is marketing. The behavior is the thing you actually depend on, and nobody is committing to keep it stable.
The compatibility contract that doesn't exist
Pull the deprecation pages of any major provider and you will find policies about availability — when a model gets retired, how long the snapshot stays callable, what the migration window looks like. You will find almost nothing about behavioral compatibility. The contract is "you can keep calling this endpoint for N months." It is not "the next version will respond the same way to your prompts."
That gap is structural, not negligent. Even the providers can't make the guarantee. A model is a 200-billion-parameter function whose behavior emerges from training; small changes in data mix, RLHF reward shaping, or post-training safety filtering produce non-local effects on outputs that no one — not the lab, not the eval team — fully maps before release. Apple's research on model-update compatibility coined the term "negative flips" for the phenomenon: instances that the previous model got right, the new model gets wrong, even when aggregate accuracy improves. The aggregate hides the regressions, and the regressions are exactly what your production system depends on.
So the right mental model is: every model bump is a breaking change in disguise. The version number doesn't matter. The patch level doesn't matter. The marketing copy that says "improved instruction following" doesn't matter. What matters is whether the joint distribution P(output | your specific prompts, your specific inputs, your specific tool schemas) shifted. And the answer is almost always yes.
Why minor bumps break worse than major ones
The counterintuitive part is that minor updates often break production harder than major ones. This is a behavioral pattern, not a quirk of any single provider, and there are three reasons it keeps showing up.
The first is psychological. A major version bump — Claude 3 to Claude 4, GPT-4 to GPT-5 — sets off institutional alarms. Engineers re-run evals. Product managers schedule rollout reviews. There is a budget, a launch plan, a rollback story. A minor bump or a snapshot date roll triggers none of that. It looks routine. It gets pushed through change management as a one-line config edit. The operational machinery that catches problems is calibrated to the version number, not to the actual scope of the change.
The second is structural. Major version bumps usually come with explicit migration guides because the provider knows the change is large. Minor bumps don't, because the provider's internal evals showed an aggregate improvement and they had no reason to flag anything. The lab's eval suite is not your eval suite. The aggregate gain on MMLU is invisible to a customer-support classifier whose distribution looks nothing like a benchmark. The smaller the version delta, the more likely the lab considers the change a free win, and the less they document about what shifted underneath.
The third is the alias trap. Most teams are not literally pinned to a snapshot — they are calling claude-sonnet or gpt-4o-latest, an alias that auto-upgrades when the provider rolls a new snapshot. Pinning to a snapshot is a deliberate engineering decision that someone has to make and re-make every time a new snapshot lands. Aliases default to "always current," which means the alias subscriber experiences every minor change as an instant unannounced production deployment, with zero rollout staging. This is the worst of both worlds: you get the breakage of a model swap and the surprise of a configuration push you didn't authorize.
What actually changes between snapshots
The visible release notes describe what the lab thinks is interesting. The actual delta is usually larger.
A minor bump can change the tokenizer, even subtly — and a different tokenization changes how prompts get cached, which changes cost, which changes whether your prompt-caching strategy is still sound. It can change the default temperature behavior, the way the model handles system vs user role precedence, the verbosity of refusals, the strictness of JSON mode, the likelihood of emitting a code fence around structured output. It can change the model's calibration on uncertainty — what used to come back as "I'm not sure, but…" might now come back as a confident hallucination, or vice versa.
- https://machinelearning.apple.com/research/model-compatibility
- https://www.anthropic.com/research/deprecation-commitments
- https://platform.claude.com/docs/en/about-claude/models/overview
- https://developers.openai.com/api/docs/deprecations
- https://arxiv.org/html/2411.13768v3
- https://arxiv.org/html/2601.22025v1
- https://www.qwak.com/post/shadow-deployment-vs-canary-release-of-machine-learning-models
- https://dev.to/simon_paxton/llm-performance-drop-hosted-models-feel-worse-for-3-reasons-37fa
- https://medium.com/@komalbaparmar007/llm-canary-prompting-in-production-shadow-tests-drift-alarms-and-safe-rollouts-7bdbd0e5f9d0
