The Model Migration That Broke Your Prompt Cache Without Warning
The migration looked clean. Evals were re-anchored against the new model version. Judge prompts were re-calibrated. Two weeks of shadow traffic showed behavior parity within tolerance. p50 and p99 latency were inside the budget. The rollout call signed off on Thursday afternoon and the team went home.
By Friday morning, the inference bill was 3x normal. Eval scores were still fine. Latency was still fine. No one on the rollout call had thought to instrument the cache hit rate, because the prefix had not changed — the system prompt was byte-identical, the tool definitions were byte-identical, the conversation framing was byte-identical. What had changed was the model version in the request body, and the provider keys its prefix cache on (prefix bytes + model version). Every request after the cutover landed on a cold cache. The warm-up curve took six weeks of organic traffic to recover, and the team paid full input-token rates for every token on every request for the duration.
