Skip to main content

The Tool Result Your Prompt Cache Kept Serving After the Source Already Changed

· 10 min read
Tian Pan
Software Engineer

A support agent looks up a customer's subscription status at 14:02, finds it active, and the answer goes into the prompt prefix that the caching layer just blessed as the reusable portion of the context. At 14:14, billing cancels the subscription. At 14:19, the same customer asks a follow-up question, the cached prefix is reused because the conversation prefix still matches, and the agent cheerfully tells the customer their plan is active and offers to walk them through a feature they no longer have access to. The downstream system is correct. The model is consistent with the context. The user has been lied to by a cache hit.

This is the failure mode that prompt caching introduces into systems that were previously honest about staleness. Before caching, a tool call was a request against the source of truth, with whatever freshness contract that source advertised. With caching, that tool result becomes a tenant of the prompt prefix, and the prefix has its own TTL, controlled by the model provider, that nobody on the team explicitly opted into.

The seductive part is that prompt caching does exactly what the docs promise. Anthropic's cache holds a prefix for five minutes (or an hour at extra cost), OpenAI's automatic cache reuses the longest matching prefix above the token threshold, the cost goes down, the time-to-first-token goes down, and the hit-rate dashboard turns green. None of that is wrong. The wrongness lives in the gap between what the cache promises about prefix matching and what your application implicitly promises about data freshness, and the gap is filled by whatever your users assume.

Prompt cache hits are a freshness contract you did not sign

A cache hit is not just a performance optimization. It is a statement to the model: "the tokens in this prefix are the current context for this request." If a tool result lives inside that prefix, the model treats it as currently true. The model has no way to ask "was this fetched five seconds ago or five minutes ago?" — it only sees the tokens.

That makes the prompt prefix a de facto cache of whatever data flowed into it, with a TTL controlled by the inference provider's cache layer. A customer record fetched at 14:02 and reused at 14:19 is being served as fresh-at-14:19, not fresh-at-14:02. Nobody designed it that way; it is a consequence of putting volatile data above the cache cut.

The standard production advice — system prompt first, then tools, then static documents, then user message last — was written for hit-rate optimization. It happens to also be correct for freshness, but for a reason most teams don't internalize: the layers above the cache cut are implicitly being asserted as time-invariant. When you put dynamic content there, you're not just costing yourself cache hits when it changes — you're shipping stale answers when it doesn't change but the underlying source does.

This is the same lesson distributed-cache designers learned decades ago about TTL: a TTL bounds how stale your data can be from the cache's perspective, but it tells you nothing about whether the source changed in the meantime. The difference is that an HTTP cache is honest about what it is. A prompt prefix that has absorbed a tool result looks like part of the conversation, not part of a cache. The freshness contract has been laundered through the LLM context.

The two cache layers that drift out of consistency

In any production agent there are typically two caches in the call path, and they are owned by different teams with different SLOs.

The first is the tool result cache: a Redis lookup in front of the CRM, a memoized HTTP response, a database query plan. Its TTL is chosen by whoever owns the data layer, and the choice usually reflects how often the underlying record actually changes — sixty seconds for inventory, fifteen minutes for product catalog, twenty-four hours for a static FAQ. The contract is explicit. There is usually an invalidation hook somewhere that fires when the source mutates.

The second is the prompt prefix cache: the model provider's KV cache or its equivalent, holding the tokenized prefix that produced the model's first hidden states. Its TTL is whatever the provider gives you — five minutes is the new default at Anthropic since the silent regression earlier this year — and it has no notion of where the tokens originally came from. It cannot be invalidated by your data layer because your data layer does not know it exists.

The drift between these layers is where the bug lives. A tool result is fetched, stored in the tool result cache with a sixty-second TTL, and at the same moment flows into the prompt that gets cached for five minutes by the provider. Sixty seconds later the data layer invalidates its copy. The next caller will refetch and get fresh data — but the prompt prefix still holds the stale value, and any request that reuses that prefix in the remaining four minutes gets the old answer. The tool cache did its job. The prompt cache did its job. The composition is broken.

What makes this hard to catch is that the two TTLs are negotiated by different people. The data engineer who set the Redis TTL has never heard of cache_control blocks. The application engineer who picked the cache breakpoint placement does not know what the upstream invalidation contract is. There is no single owner of "how stale can this fact be, end to end."

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates