Model Migration Bills You Twice: The Eval Re-Anchoring Tax Nobody Prices
Every model upgrade gets sold to the team as a swap: a one-line config change, a measurable win on latency or cost or quality, and a few days of prompt re-tuning to absorb the new model's quirks. The procurement deck shows per-token deltas, the engineering ticket lists the rollout phases, and the FP&A team books the quarterly savings. Then the eval scores come in and nobody recognizes them. Quality is flat where it should have moved. Two judges that used to agree are now diverging by ten points. The snapshot suite is red, but the diffs look like rewordings. Somebody in standup asks the question that should have been on the migration plan from day one: what is the model actually scoring against?
This is the second bill — the eval re-anchoring tax — and it is reliably larger than the first. The human-annotated reference scores were anchored to the previous model's output distribution. The LLM-as-judge graders were calibrated against the old model's failure modes. The snapshot fixtures captured the old model's wording. The team's intuition for "good output" was trained on the old model's stylistic tells. None of that survives the swap intact.
The teams that price model migrations as a per-token delta are the ones who get surprised by labor costs in the next quarter. The teams that have done this twice know the eval system is configuration coupled to the model, and switching models is a config change with a reviewer bill attached.
The Four Anchors That Break When You Swap Models
A production eval system is not one artifact. It is at least four, and each one is silently anchored to the model it was built against.
Human-annotated reference scores. When a labeler graded an output as a 4-out-of-5 on coherence three months ago, they were grading it against an implicit distribution of what the old model produced. The new model's outputs occupy a different region of that distribution — usually because the model is more verbose, or less verbose, or formats lists differently, or hedges with different qualifiers. The reference score doesn't transfer. The annotator who scored against "this is what acceptable looks like for model A" needs to reset against "this is what acceptable looks like for model B," and that reset is hours of work per annotator before any new labeling lands.
LLM-as-judge calibrations. Research over the last year keeps making the same point: judges lack natural numeric calibration, and the meaning of a 7-out-of-10 depends on the model being judged, the judge's own version, the prompt phrasing, and the day. Calibration drift across model versions is now documented enough that "judges are stable" is no longer a defensible assumption. Worse: when the judge and the generator share a family (a Claude judging another Claude, an OpenAI model judging another OpenAI model), self-preference bias creeps in, and a migration that changes either side of that pair can systematically shift scores in a direction that has nothing to do with quality.
Snapshot test fixtures. Snapshot suites assume that "equivalent output" means "literally the same string, modulo whitespace." Model migrations almost always break this assumption. The new model's structurally equivalent answer uses different wording, a different sentence order, a different bullet style. Every snapshot becomes a diff. The team can either rewrite every fixture (expensive and re-anchors the suite to the new model with no diff against the old one) or weaken the snapshot match to a semantic check (which removes the regression guard the suite was meant to provide).
Team intuition. This one is the hardest to surface and the most expensive to rebuild. After six months of staring at a model's output, the reviewers know its tells: the way it hedges, the way it caves to a particular kind of pushback, the specific way it hallucinates a function signature. That intuition lives in the team's heads, and on the new model it is wrong. The first three weeks after the swap, the team is judging against the old mental model, and their PR reviews on prompt edits are subtly miscalibrated.
The Anatomy of a Surprise Quarter
The procurement conversation that approved the migration usually went like this: per-token price drops by some percentage, latency improves by some percentage, vendor publishes benchmark numbers that look favorable, the team commits to a date. The bill that nobody priced shows up across the next two quarters as labor:
- Re-annotating a representative sample of the eval set against the new model's distribution
- Calibrating each LLM-judge against a fresh human reference
- Rewriting or weakening snapshot fixtures
- Running parallel evaluation periods where both models score the same inputs
- Investigating divergence cases and deciding which model's answer is actually correct
- Recalibrating the team's mental model — the soft skill that doesn't show up on a Jira ticket
- https://www.langchain.com/articles/llm-as-a-judge
- https://deepchecks.com/llm-judge-calibration-automated-issues/
- https://arxiv.org/html/2508.06225v2
- https://arxiv.org/html/2311.11123v2
- https://www.evidentlyai.com/blog/llm-regression-testing-tutorial
- https://www.breakthebuild.org/prompt-regression-testing-101-how-to-keep-your-llm-apps-from-quietly-breaking/
- https://www.statsig.com/perspectives/slug-prompt-regression-testing
- https://www.braintrust.dev/articles/ab-testing-llm-prompts
- https://www.traceloop.com/blog/the-definitive-guide-to-a-b-testing-llm-models-in-production
- https://www.godaddy.com/resources/news/calibrating-scores-of-llm-as-a-judge
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://www.patronus.ai/llm-testing
