The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor
There is a quiet myth in AI engineering that goes like this: a "minor" model bump — claude-x.6 to claude-x.7, gpt-y.0 to gpt-y.1, the patch-level snapshot rolling forward by a date — should be a drop-in upgrade. The provider releases notes that talk about improved reasoning, lower latency, better tool use. The version number ticks gently. Nothing about the change reads as breaking.
Then it ships. And the on-call channel lights up with reports that the summarizer is now adding a paragraph that wasn't there before, that the JSON extractor is escaping unicode it used to leave alone, that the agent loop is now hitting the max-step ceiling on tasks that used to terminate in three calls. The eval scores look fine in aggregate; the user-visible feature is subtly wrong.
