Canary Deploys for LLM Upgrades: Why Model Rollouts Break Differently Than Code Deployments
Your CI passed. Your evals looked fine. You flipped the traffic switch and moved on. Three days later, a customer files a ticket saying every generated report has stopped including the summary field. You dig through logs and find the new model started reliably producing exec_summary instead — a silent key rename that your JSON schema validation never caught because you forgot to add it to the rollout gates. The root cause was a model upgrade. The detection lag was 72 hours.
This is not a hypothetical. It happens in production at companies that have sophisticated deployment pipelines for their application code but treat LLM version upgrades as essentially free — a config swap, not a deployment. That mental model is wrong, and the failure modes that result from it are distinctly hard to catch.
Why LLM Upgrades Are Not Config Swaps
When you deploy a new version of your service code, the difference between old and new is fully enumerable: you can diff it, test it against known inputs, and assert on exact outputs. A model upgrade doesn't give you that. The new model is a function with billions of parameters you cannot inspect, and its behavior across the full distribution of production inputs is unknowable until you run it on production inputs.
Three failure modes show up specifically during model upgrades and don't exist in code deploys:
Output schema drift. Model providers improve their models without treating output format as a breaking-change contract. A field that was reliably named category in the previous version might now appear as ticket_category. Constrained decoding guarantees syntactically valid JSON, but it does not guarantee your specific field names and structure are preserved. Downstream parsers that dict["category"] will throw KeyError on every request, silently.
Semantic drift without surface regression. A new model can produce outputs that score well on automated evals while behaving differently in ways your eval suite didn't measure. The model might start attributing errors to the wrong component in your support ticket classifier, or begin generating legal disclaimers in different positions, breaking the text splitter downstream. The output looks coherent; the evaluation score holds; the product behavior breaks.
Latency and cost profile changes. A model that produces higher-quality output often does so with more tokens, longer generation time, or different throughput characteristics. A rollout that improves quality scores can simultaneously trigger SLA breaches or 3x your inference bill. Neither shows up in a quality eval.
Shadow Testing: Zero-Risk Validation on Real Traffic
The safest first step for any model upgrade is shadow testing: route production traffic to both the incumbent model and the candidate model simultaneously, serve users only the incumbent's response, and collect the candidate's output for comparison. Users see no change. You accumulate real production queries and real model behavior.
The value of shadow testing over benchmark evaluation is precisely that production traffic is not your benchmark distribution. Your benchmark is the queries you thought to write. Production traffic is the queries your actual users send — which includes edge cases, malformed inputs, queries in unexpected languages, and novel entity types that your evaluation set will never fully represent.
Shadow testing surfaces a class of problems that can only be found on real traffic: the specific JSON schema the candidate emits for your longest inputs, the latency profile under concurrent load, the hallucination rate on domain-specific terminology your users actually use. Running a shadow test for 24–72 hours before any traffic split gives you evidence you can act on.
Implementation is straightforward at the infrastructure layer. At the API gateway or load balancer, duplicate each request to both models. The incumbent's response returns synchronously to the user. The candidate's response is captured asynchronously and logged with the same metadata: input tokens, output tokens, latency, timestamp, user segment. You then run your scoring pipeline over both response sets and compare.
Traffic Splitting: Deterministic Routing, Progressive Allocation
Once shadow testing passes your quality gates, move to live traffic splitting. The standard error here is using random traffic assignment. Random assignment means the same user can see model A on one request and model B on the next, which makes it impossible to reason about whether quality differences are caused by the model or by session-level context variation.
Use hash-based routing instead. Hash the user ID (or session ID, or customer ID — whatever is stable) and assign the bucket deterministically. A user in bucket 0–9 always sees the candidate; everyone else sees the incumbent. This keeps individual user experiences consistent and makes your A/B comparison statistically clean.
The progression schedule matters as much as the routing logic. Start at 5–10% of traffic. Monitor for 24 hours before moving to 25%. Move to 50% only after confirming the candidate holds at 25%. This is not a superstition about percentages — it's about having enough time to see tail failures. Schema violations and semantic regressions often appear in specific input categories that might only represent 2–3% of your traffic. A 10% canary over 24 hours exposes you to a representative sample of that long tail.
At each stage, track metrics at the cohort level, not just in aggregate. Aggregate metrics mask segment-specific failures. The candidate model might perform identically on English queries while silently degrading on French ones. A 5% quality drop in an underrepresented segment is invisible in aggregate but affects real users.
The Metrics That Actually Catch Model Regressions
Generic "quality score" monitoring will not catch the failures described above. The metrics you need to instrument are specific to what LLMs actually fail at:
Output schema conformance rate. Track the percentage of responses where every required field is present, correctly named, and of the expected type. Instrument at the parsing layer, not the model layer — you want to know whether your actual downstream consumers succeed. If your parser emits KeyError exceptions, count them and alert on them. A drop in schema conformance rate is your earliest warning of a field-name drift problem.
Downstream parser success rate. Distinct from schema conformance: this is whether the full parsing pipeline succeeds end-to-end. A response can pass schema validation and still break a parser that makes assumptions about value formats or nested structure depth. Track parse_success as a binary metric per request and alert if it drops below your baseline.
- https://arize.com/glossary/canary-deployment/
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://www.dycora.com/deployment-and-shadow-mode-testing-validating-a-new-model-on-live-traffic-without-user-impact/
- https://aws.amazon.com/blogs/aws/new-for-amazon-sagemaker-perform-shadow-tests-to-compare-inference-performance-between-ml-model-variants/
- https://www.traceloop.com/blog/the-definitive-guide-to-a-b-testing-llm-models-in-production
- https://medium.com/marvelous-mlops/traffic-splits-arent-true-a-b-testing-for-machine-learning-models-62f77d10c993
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://latitude.so/blog/prompt-rollback-in-production-systems
- https://orq.ai/blog/model-vs-data-drift
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-mlflow-models-online-progressive?view=azureml-api-2
- https://portkey.ai/blog/canary-testing-for-llm-apps/
- https://www.featbit.co/articles2025/feature-flags-with-llm-deployment
