Skip to main content

The Model Migration That Broke Your Prompt Cache Without Warning

· 10 min read
Tian Pan
Software Engineer

The migration looked clean. Evals were re-anchored against the new model version. Judge prompts were re-calibrated. Two weeks of shadow traffic showed behavior parity within tolerance. p50 and p99 latency were inside the budget. The rollout call signed off on Thursday afternoon and the team went home.

By Friday morning, the inference bill was 3x normal. Eval scores were still fine. Latency was still fine. No one on the rollout call had thought to instrument the cache hit rate, because the prefix had not changed — the system prompt was byte-identical, the tool definitions were byte-identical, the conversation framing was byte-identical. What had changed was the model version in the request body, and the provider keys its prefix cache on (prefix bytes + model version). Every request after the cutover landed on a cold cache. The warm-up curve took six weeks of organic traffic to recover, and the team paid full input-token rates for every token on every request for the duration.

This is the cache-invalidation blind spot that lives between the model team's rollout dashboard and the FinOps dashboard. The model team owns behavior parity, latency, and eval scores. FinOps owns the monthly bill. The cache hit rate sits in between, governed by no one, and a model migration is the most expensive operation you can perform against it.

The provider does not consider your cache part of the contract

Anthropic, OpenAI, AWS Bedrock, and the rest treat the prefix cache as a best-effort acceleration layer, not a guaranteed service. Their documentation says, plainly, that the cache key includes the model identifier. Switching from one Sonnet version to the next, or from one Opus version to the next, invalidates every cached prefix you have built up — not because your prompt changed, but because the cache is keyed on the model's weights, and the weights changed.

This is not a bug. The KV cache stores the key and value vectors that the model's attention computed for the static portion of the prompt. Those vectors are a function of the weights. New weights mean new vectors mean a new cache entry. There is no shared representation that survives a version change, and there is no way for the provider to "rehydrate" your old cache against the new model — the math does not work.

What this means operationally: the day you cut over, your effective input-token cost per request resets to the un-cached rate. If you were running a 50,000-token system prompt at 90% cache hit rate, you were paying for roughly 5,000 tokens of input per request. The day after the cutover, you are paying for 50,000. That is the 10x cost spike, hiding behind a "behavior parity" rollout that on every other axis looked successful.

And the recovery is not instant. Prefix caches have a TTL — typically five minutes on most providers, with longer-tier options for committed-throughput customers. A cache entry only survives if a subsequent request hits the same prefix before the TTL expires. The warm-up curve is a function of your organic traffic distribution: high-traffic prefixes warm in minutes, long-tail prefixes warm in days or weeks, and the integrated cost differential over that recovery window is the actual size of the incident.

The model-team dashboard and the FinOps dashboard are blind to each other

The rollout call had everyone who mattered for behavior in the room: model owner, eval owner, latency owner, on-call. It had no one whose job description included the input-token bill. That bill is reviewed monthly by a different team, on a different cadence, with a different dashboard. By the time anyone reads the spike, the cache is half-warm again and the incident looks like a one-time anomaly rather than a foreseeable consequence of the migration.

The fix is not to invite finance to every rollout call. The fix is to put cache hit rate on the same dashboard as latency and eval scores, and to make it a release-gate metric. Both Anthropic and OpenAI return cache metadata in the response: cache_read_input_tokens, cache_creation_input_tokens, cached_tokens. You can compute per-request hit rate as cache_read / (cache_read + cache_creation + uncached), and per-token hit rate as cache_read / total_input. Either is fine; pick one and graph it next to the things you already watch.

If your dashboard had been showing cache hit rate, the Friday morning spike would have shown up on the rollout call itself, the moment shadow traffic started rerouting. Instead, it showed up six days later as a column in a billing spreadsheet, and someone had to reconstruct the causal chain from logs.

The pre-rollout discipline you skipped

A model migration with non-trivial cache spend should not roll out cold. The disciplines are not exotic:

Synthetic prefill traffic before the cutover. Two or three hours before you flip user traffic, send the top N most-common prefixes against the new model version at a rate that keeps them inside the cache TTL. This is exactly the cache-warming pattern that latency-sensitive teams use for cold starts after a deployment; the only difference is that you are warming for cost rather than for time-to-first-token. The cost of synthetic prefill is the cost of one cache-write per distinct prefix. The cost of not doing it is the cost of one cache-write per request for the entire warm-up window.

Staggered rollout windows matched to the cache TTL. A 1% → 10% → 50% → 100% ramp at 5-minute intervals is fine for behavior validation but counterproductive for cache economics, because each step at small traffic share fails to keep prefixes warm against the TTL. Either ramp fast enough that the new-version traffic alone sustains the cache, or ramp slowly enough that you run synthetic prefill in parallel. The wrong combination — slow ramp without prefill — gives you the worst of both: cold cache for the whole window, and prolonged exposure to that cost.

A parallel-run period where both versions stay warm. During the migration window, send a small percentage of traffic to the old version specifically to keep its cache warm, so that a rollback does not also incur a cold-start cost. Without this, rollback is a second cost incident, which makes the team less willing to roll back even when they should.

None of these are novel. They are the same patterns engineering teams use for database migrations, CDN cutovers, and search index swaps. The novelty is treating the prompt cache as an asset with the same operational properties as those systems, rather than as an opaque vendor optimization that runs in the background.

Aliases are convenient and load-bearing in the worst way

Many teams point at claude-sonnet-latest or an equivalent alias instead of pinning a specific dated version like claude-sonnet-4-6-20260315. Aliases are convenient: you get the new model for free, no code change, no deploy. They are also the mechanism by which a provider can invalidate your entire cache without anyone on your team taking an action that would have triggered a review.

Use immutable, dated version pins for any model spend large enough that a 5–10x temporary multiplier would matter. The model identifier becomes part of your deployable contract. Upgrades require a code change, a code change requires a PR, a PR requires a review, a review requires someone to ask "what is our cache warm-up plan." The friction is the feature.

There is a second-order benefit: rollback becomes trivial and cache-safe. If you were on claude-sonnet-4-5-20260101 for six months, your cache is fully warm on that version. If the migration to 4-6-20260315 goes badly — behavior regression, judge gaming, anything — you change the string back and your cache is already there. With an alias, the old version may have been silently retired and the rollback path no longer exists, or rolling back means cold-starting the cache on a version that has been off-traffic long enough to lose its warm state.

For very large customers, the contractual layer matters too. Committed-throughput contracts often come with cache-warming options or dedicated capacity that survives version transitions. If your cache spend is six figures a month, that line item exists to be negotiated; if you have not asked the vendor what their migration story is for your cache, you are buying the default behavior whether you like it or not.

Cache hit rate as a release-gate metric

The post-rollout discipline is the easiest one to add and the one that pays compounding interest. Make cache hit rate a metric the rollout call signs off on, with explicit thresholds:

  • Pre-rollout baseline. Measure the hit rate on the incumbent model for the week leading up to the migration. This is your target.
  • Rollout-window threshold. Define what's acceptable during the warm-up window — perhaps "no worse than half of baseline within four hours of cutover" — and page on it the same way you would page on a latency regression.
  • Recovery SLO. Define when the new version should be back to baseline — perhaps "within 72 hours of full cutover." If it is not, that is a real signal: either the prefill plan was inadequate, or the prefix distribution has long-tail components you under-counted, or the new version is genuinely behaving differently in ways that affect cache reuse.

This is not surveillance theater. Teams that run cache hit rate as a first-class metric have reported single tasks dropping from 3% hit rate to over 90% with a focused intervention — roughly 30x cost differential on the same traffic. The metric is sensitive enough that a regression shows up clearly, and the actions that move it are concrete enough that a regression has obvious follow-ups.

The architectural shift you did not realize had happened

The deeper realization behind all of this: the line "the model upgrade is invisible to your code" was true in 2023 and is not true now. Provider-side caching has quietly become a load-bearing cost lever, sometimes the difference between a profitable feature and a losing one. The moment a system depends on a cache hit rate to clear its margin, the cache is part of the system's interface, and the interface contract includes "what happens when the model version changes."

The old mental model was: the model is an opaque function over text, the only things that matter at the API boundary are tokens-in and tokens-out, and the provider's internal optimizations are their business. That model worked when caching was a marginal optimization. It does not work when caching is the difference between paying $0.30 per 1M cached tokens and $3.00 per 1M un-cached tokens, with the same prefix, on the same provider, on consecutive days.

Treat the prompt cache the way you treat any other production cache: as an asset with a warm-up cost, a hit-rate metric, a TTL, an invalidation surface, and a rollout plan. The vendor will keep optimizing the layer underneath you, sometimes in ways that quietly invalidate everything you have built up. The version pin and the cache-hit-rate dashboard are how you stay in control of the cost story rather than discovering it on a Monday morning in a finance review.

The team that ran the careful migration did everything right against the contract they thought they had with the provider. The contract had a clause they had not read.

References:Let's stay in touch and Follow me for more thoughts and updates