Skip to main content

The KV Cache Warm-Up Cron That Ran in Blue and Never in Green Because the Host Pinning Never Moved

· 11 min read
Tian Pan
Software Engineer

The incident review reconstructed a deployment from twelve days earlier as the cause of a 3.6× spend increase, and nobody on the call had been in the room when the change shipped. The deployment was routine: blue/green swap, traffic moved to green on schedule, blue decommissioned, the pipeline turned green, the release engineer closed the ticket. None of the production SLOs tripped. None of the application-layer alerts fired. The system ran exactly as designed.

What had been designed was a five-minute cron that pre-warmed the provider's prompt cache against the stable system-prompt prefix every five minutes. The warm-up gave the team a 91% cache hit rate on cold starts and roughly a 4× cost advantage on the first request per session. The cron had been authored a year ago when the blue/green pattern was first introduced, and its host selector was pinned to the blue pool to avoid running the warm-up twice during overlap windows. When green became the live color and blue went away, the cron lost its host and silently transitioned from "running every five minutes" to "running never." The cache hit rate decayed over the next 36 hours as the provider's cache TTL aged out the pre-warmed prefixes. The cost dashboard, averaging per-request cost across a daily window, smoothed the slope until the next billing cycle made it loud.

This is a failure mode that sits in the seam between two systems that don't know about each other: the deployment pipeline that owns rollout, and the scheduler that owns recurring jobs. Each side did exactly what it was supposed to do. The cron scheduler honored its host selector. The deployment system rotated the pool. The result was a silent regression that nobody owned, because the regression lived in the relationship between the two and not inside either.

Cache Warmth as a Deployment Dependency the Deployment System Cannot See

Prompt caches are a first-class cost lever now. On a stable system-prompt prefix of a few thousand tokens, cache hits cost roughly a tenth of what cache writes cost and arrive in a fraction of the latency. Teams running production LLM workloads routinely report 60–90% spend reductions when caching is working, and corresponding latency improvements on long-prefix workloads. The economics push every serious application toward maintaining cache warmth across user sessions.

The Anthropic prompt cache has a default TTL of five minutes, with an optional one-hour extended TTL for high-throughput workloads. Either way, the warmth is perishable. If no request touches the cached prefix within the TTL window, the prefix is evicted. The next request pays the cache-creation premium — roughly 25% more than the base input rate — and the latency penalty of a cold read. A team that wants stable cache warmth across periods of low organic traffic ends up running a synthetic ping against the prefix on an interval shorter than the TTL.

That ping is the warm-up job. It is not part of the request path. It is not part of the application. It is operationally invisible to everything except the cost dashboard. And the cost dashboard, which averages over a window, is the slowest possible detection mechanism for a regression in cache warmth. By the time a daily average moves enough to cross an alert threshold, the cache has been cold for a day. By the time a weekly average moves, the bill is already inflated by twelve days.

The deployment system has no native concept of cache warmth. Its model of correctness is "did the new code start successfully and is it serving traffic." Its model of the cron is, at best, "is the cron defined in the manifest." Neither of those is a check on whether the cron is actually executing against the right target. A cron pinned to a deployment color that no longer exists satisfies both checks and silently does nothing.

The Failure Pattern: Infrastructure Pinned to a Transient Identifier

The deeper pattern here is broader than cron jobs and prompt caches. It applies anywhere a piece of infrastructure is pinned to an identifier that the surrounding system treats as transient: a deployment color, a host name, a pod template hash, a node label, a service account that gets rotated. The pinned infrastructure survives the change of its anchor as a no-op rather than as a fault, because the scheduling layer's contract is "find a target matching this selector and run there." A selector that matches nothing is not an error in that contract; it's an empty result set.

The same shape recurs in several places. A Prometheus scrape job pinned to a label that a workload no longer carries silently stops collecting that workload's metrics. A backup job pinned to a hostname that got replaced during a node refresh silently stops backing up. A log forwarder pinned to a pod selector that drifted after a rename silently drops the logs. In each case, the visible signal is "things look normal," and the invisible signal is the slow decay of whatever the job was producing.

The blue/green pinning case is especially insidious because blue and green are conceptual aliases for "the live pool" and "the standby pool," but the names are physical. A cron pinned to "blue" is pinned to a name, not to a role. When a release flips the role, the name becomes a graveyard, and the cron is now operating against a target that no longer corresponds to anything meaningful. The release engineer has no reason to think about it because the cron is not part of the deployment manifest, and the cron's owner has no reason to think about it because the deployment never asked.

The Detection Gap: Why Cache Hit Rate Lags as an Alert Signal

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates