Canary Deploys for LLM Upgrades: Why Model Rollouts Break Differently Than Code Deployments
Your CI passed. Your evals looked fine. You flipped the traffic switch and moved on. Three days later, a customer files a ticket saying every generated report has stopped including the summary field. You dig through logs and find the new model started reliably producing exec_summary instead — a silent key rename that your JSON schema validation never caught because you forgot to add it to the rollout gates. The root cause was a model upgrade. The detection lag was 72 hours.
This is not a hypothetical. It happens in production at companies that have sophisticated deployment pipelines for their application code but treat LLM version upgrades as essentially free — a config swap, not a deployment. That mental model is wrong, and the failure modes that result from it are distinctly hard to catch.
