Canary Deploys for LLM Upgrades: Why Model Rollouts Break Differently Than Code Deployments
Your CI passed. Your evals looked fine. You flipped the traffic switch and moved on. Three days later, a customer files a ticket saying every generated report has stopped including the summary field. You dig through logs and find the new model started reliably producing exec_summary instead — a silent key rename that your JSON schema validation never caught because you forgot to add it to the rollout gates. The root cause was a model upgrade. The detection lag was 72 hours.
This is not a hypothetical. It happens in production at companies that have sophisticated deployment pipelines for their application code but treat LLM version upgrades as essentially free — a config swap, not a deployment. That mental model is wrong, and the failure modes that result from it are distinctly hard to catch.
Why LLM Upgrades Are Not Config Swaps
When you deploy a new version of your service code, the difference between old and new is fully enumerable: you can diff it, test it against known inputs, and assert on exact outputs. A model upgrade doesn't give you that. The new model is a function with billions of parameters you cannot inspect, and its behavior across the full distribution of production inputs is unknowable until you run it on production inputs.
Three failure modes show up specifically during model upgrades and don't exist in code deploys:
Output schema drift. Model providers improve their models without treating output format as a breaking-change contract. A field that was reliably named category in the previous version might now appear as ticket_category. Constrained decoding guarantees syntactically valid JSON, but it does not guarantee your specific field names and structure are preserved. Downstream parsers that dict["category"] will throw KeyError on every request, silently.
Semantic drift without surface regression. A new model can produce outputs that score well on automated evals while behaving differently in ways your eval suite didn't measure. The model might start attributing errors to the wrong component in your support ticket classifier, or begin generating legal disclaimers in different positions, breaking the text splitter downstream. The output looks coherent; the evaluation score holds; the product behavior breaks.
Latency and cost profile changes. A model that produces higher-quality output often does so with more tokens, longer generation time, or different throughput characteristics. A rollout that improves quality scores can simultaneously trigger SLA breaches or 3x your inference bill. Neither shows up in a quality eval.
Shadow Testing: Zero-Risk Validation on Real Traffic
The safest first step for any model upgrade is shadow testing: route production traffic to both the incumbent model and the candidate model simultaneously, serve users only the incumbent's response, and collect the candidate's output for comparison. Users see no change. You accumulate real production queries and real model behavior.
The value of shadow testing over benchmark evaluation is precisely that production traffic is not your benchmark distribution. Your benchmark is the queries you thought to write. Production traffic is the queries your actual users send — which includes edge cases, malformed inputs, queries in unexpected languages, and novel entity types that your evaluation set will never fully represent.
Shadow testing surfaces a class of problems that can only be found on real traffic: the specific JSON schema the candidate emits for your longest inputs, the latency profile under concurrent load, the hallucination rate on domain-specific terminology your users actually use. Running a shadow test for 24–72 hours before any traffic split gives you evidence you can act on.
Implementation is straightforward at the infrastructure layer. At the API gateway or load balancer, duplicate each request to both models. The incumbent's response returns synchronously to the user. The candidate's response is captured asynchronously and logged with the same metadata: input tokens, output tokens, latency, timestamp, user segment. You then run your scoring pipeline over both response sets and compare.
Traffic Splitting: Deterministic Routing, Progressive Allocation
Once shadow testing passes your quality gates, move to live traffic splitting. The standard error here is using random traffic assignment. Random assignment means the same user can see model A on one request and model B on the next, which makes it impossible to reason about whether quality differences are caused by the model or by session-level context variation.
Use hash-based routing instead. Hash the user ID (or session ID, or customer ID — whatever is stable) and assign the bucket deterministically. A user in bucket 0–9 always sees the candidate; everyone else sees the incumbent. This keeps individual user experiences consistent and makes your A/B comparison statistically clean.
The progression schedule matters as much as the routing logic. Start at 5–10% of traffic. Monitor for 24 hours before moving to 25%. Move to 50% only after confirming the candidate holds at 25%. This is not a superstition about percentages — it's about having enough time to see tail failures. Schema violations and semantic regressions often appear in specific input categories that might only represent 2–3% of your traffic. A 10% canary over 24 hours exposes you to a representative sample of that long tail.
At each stage, track metrics at the cohort level, not just in aggregate. Aggregate metrics mask segment-specific failures. The candidate model might perform identically on English queries while silently degrading on French ones. A 5% quality drop in an underrepresented segment is invisible in aggregate but affects real users.
The Metrics That Actually Catch Model Regressions
Generic "quality score" monitoring will not catch the failures described above. The metrics you need to instrument are specific to what LLMs actually fail at:
Output schema conformance rate. Track the percentage of responses where every required field is present, correctly named, and of the expected type. Instrument at the parsing layer, not the model layer — you want to know whether your actual downstream consumers succeed. If your parser emits KeyError exceptions, count them and alert on them. A drop in schema conformance rate is your earliest warning of a field-name drift problem.
Downstream parser success rate. Distinct from schema conformance: this is whether the full parsing pipeline succeeds end-to-end. A response can pass schema validation and still break a parser that makes assumptions about value formats or nested structure depth. Track parse_success as a binary metric per request and alert if it drops below your baseline.
Semantic similarity to incumbent. Use sentence embeddings to compute the cosine similarity between the candidate's and incumbent's responses on matched inputs. Aggregate these scores over a rolling window. A sudden drop in average similarity during a canary rollout signals that the candidate has shifted behavior meaningfully — even if individual scores look acceptable. This is not a quality signal; it's a change-detection signal.
Per-segment latency and cost. Track p50, p95, and p99 latency per user segment and per input length bucket. New models often have different generation profiles on long inputs. A latency spike at p99 on inputs longer than 2,000 tokens might be invisible in the aggregate p95, but it will cause timeout failures for a real subset of users.
Refusal and format-error rate. Count the fraction of responses that are refusals, format failures (model ignored the output schema instruction), or degenerate completions. Models differ in how aggressively they refuse ambiguous inputs and how reliably they follow structured output instructions. A candidate model that has a 0.5% higher refusal rate might appear similar to the incumbent overall, but 0.5% at scale is tens of thousands of failed requests per day.
Automatic Rollback: Making Rollback Cheap Enough to Actually Use
Rollback is only effective if the threshold for triggering it is low. The failure pattern in most LLM incidents is that teams set their rollback thresholds conservatively — they don't want to roll back a good upgrade because of noise — and by the time the metric crosses the threshold, the window for a low-impact rollback has closed.
The fix is to separate rollback from reverting: design your pipeline so rolling back to the previous model version takes under five minutes and requires no code change. Model version should be a runtime parameter, not a deployment artifact. If swapping back is a config write, your threshold can be aggressive because the cost of a false positive is low.
Gate your canary progression explicitly. Before moving from 10% to 25%, the pipeline should check: schema conformance rate ≥ baseline, downstream parse success rate ≥ baseline, p95 latency within 10% of baseline, per-segment quality scores within acceptable range. These gates run automatically. If any gate fails, the progression halts and alerts fire. If multiple gates fail simultaneously, an automatic rollback to 0% triggers without waiting for human review.
Record model_version, schema_version, and prompt_version on every logged event. When an incident surfaces days after a rollout, you need to be able to correlate the failure timestamp with the exact model and prompt combination that produced it. Without this, post-incident analysis is archaeology.
The Prompt Coupling Problem
A detail that most progressive delivery write-ups skip: model version and prompt version are coupled, and deploying them independently breaks things.
Prompts are tuned against a specific model. When you upgrade the model, the prompt that worked well against the old model may perform poorly against the new one — different instruction-following behavior, different default verbosity, different tendency to follow structural constraints. Conversely, a prompt improvement that was tested against the new model may regress on the incumbent.
The practical implication is that your canary pipeline should treat (model_version, prompt_version) as a joint deployment unit, not two independent dimensions. Shadow test and canary the combination. If you need to roll back the model, roll back to the matching prompt version as well. Storing these as a joint artifact in your model registry, rather than tracking them separately, prevents the version-skew failures that come from treating them independently.
What a Production-Grade Rollout Looks Like
The full pipeline for a model upgrade in a production system that's learned these lessons:
-
Shadow test for 48 hours. Run the candidate on 100% of production traffic in shadow mode. Score outputs on schema conformance, semantic similarity, and quality metrics. Pass all gates before continuing.
-
Canary at 5% with 24-hour hold. Route 5% of traffic (deterministically by user hash) to the candidate. Monitor per-segment metrics against gates. Do not progress until gates pass.
-
Advance in stages: 10% → 25% → 50% → 100%. Each stage requires a manual or automatic gate-check. Each stage holds for 12–24 hours minimum.
-
Automatic rollback if gates fail at any stage. Rollback is a parameter write, not a deploy. It takes minutes.
-
Full audit trail. Every event carries model_version and prompt_version. Post-incident analysis can reconstruct exactly which requests were served by the candidate and when.
This pipeline takes longer than a traditional code deploy. A full rollout from shadow test to 100% traffic might take a week. That is the right tradeoff. The cost of a week of cautious rollout is far lower than the cost of a 72-hour incident where thousands of users absorb silent quality degradation while your monitoring is still looking for a signal.
The Broader Discipline
The same thinking that applies to model version upgrades applies to prompt changes. A prompt modification is a behavioral change to a non-deterministic function. It deserves the same shadow testing, the same canary progression, and the same automatic rollback infrastructure. Teams that invest in this pipeline for model upgrades often discover that the bigger source of production incidents is prompt drift — the accumulated micro-adjustments that each seem harmless but compose into a regression.
The mental model shift required here is from "LLM as a stateless API call" to "LLM as a deployed service with behavioral contracts." Code services have deployment pipelines. LLM services need them too, built for the specific ways they fail: silently, gradually, and in ways that aggregate metrics routinely miss.
- https://arize.com/glossary/canary-deployment/
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://www.dycora.com/deployment-and-shadow-mode-testing-validating-a-new-model-on-live-traffic-without-user-impact/
- https://aws.amazon.com/blogs/aws/new-for-amazon-sagemaker-perform-shadow-tests-to-compare-inference-performance-between-ml-model-variants/
- https://www.traceloop.com/blog/the-definitive-guide-to-a-b-testing-llm-models-in-production
- https://medium.com/marvelous-mlops/traffic-splits-arent-true-a-b-testing-for-machine-learning-models-62f77d10c993
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://latitude.so/blog/prompt-rollback-in-production-systems
- https://orq.ai/blog/model-vs-data-drift
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-mlflow-models-online-progressive?view=azureml-api-2
- https://portkey.ai/blog/canary-testing-for-llm-apps/
- https://www.featbit.co/articles2025/feature-flags-with-llm-deployment
