Skip to main content

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

· 11 min read
Tian Pan
Software Engineer

The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.

This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.

A cache that sits in front of a tool call is not an optimization the way a CDN in front of a JSON endpoint is an optimization. The consumer of a CDN payload is a browser that will render it; the consumer of a tool result is a model that will plan over it, weave it into a multi-step chain, and produce a confident answer downstream. The freshness contract that was implicit in your API caching becomes a correctness surface the moment a model is the reader.

The default TTL is the contract you didn't read

Every cache framework ships with a default. Redis-style proxies default to whatever the team configured at the platform level. LiteLLM and similar middleware expose a ttl parameter that, if unset, falls back to the integration's default — often one hour for response caching. Frameworks for tool caching like the recent ToolCacheAgent work explicitly model per-tool cacheability and expiration rules, but most production teams haven't adopted them yet; they are still running on a global TTL that someone picked the day the cache layer shipped.

The result is a uniform freshness policy laid over wildly non-uniform data volatilities. Inventory levels change every few minutes. Customer profile data changes when the customer logs in to update it. Pricing changes when finance pushes a new SKU table. Knowledge-base articles change on the editorial team's schedule. A one-hour TTL is a reasonable compromise for one of these and a correctness bug for the others — and the team chose the compromise without naming which datums it was compromising for.

The Microsoft cache-aside pattern documentation has a sentence buried in it that more teams should tattoo on their on-call laptops: cache policy should be treated as a product contract, with explicit definitions of where stale reads are acceptable and where strict freshness is required. Most agent platforms have not written this contract. The contract exists — it just lives implicitly in whatever the framework defaulted to.

Why staleness becomes a hallucination in the trace

The reason stale-cache failures get pattern-matched as hallucinations is that the trace looks identical. A clean tool call, a clean response, a confident downstream answer. The eval set, if it exists, was built with fresh data; it has no entry for "what does the agent do when the inventory it retrieves is four hours old?" So the eval pass rate stays green while the user-perceived accuracy decays.

This pattern shows up across every kind of cached tool. RAG retrieval over an embedding index that was built before the source document was edited returns the wrong chunk faster, not slower; the agent confidently cites a paragraph that no longer exists in the live doc. A cached pricing table tells the agent the discount is 15% when the discount expired at midnight. A cached availability check says the calendar is open when it was just booked.

Streamkap's writeup on this — that stale data is worse than no data, because no data prompts the model to admit ignorance while stale data prompts it to confidently deliver the wrong answer — is the right framing. The model doesn't have access to a freshness signal it isn't given. If your tool returns the same shape of payload whether the underlying data is two seconds old or two days old, the model will treat both with equal confidence, because that's what the schema told it to do.

Per-tool TTLs derived from data volatility

The first discipline that has to land is naming the volatility class for every tool the agent can call, and deriving the TTL from there rather than from the cache framework's default.

The volatility classes don't have to be precise. A working set of four buckets covers most production agent stacks:

  • Real-time critical. Inventory at order time, payment authorization, calendar booking. Cache TTL is zero or near-zero, or the cache is bypassed entirely on the high-stakes path.
  • Minutes-fresh. Search result counts, recent activity feeds, in-flight order status. TTL of 30 to 300 seconds with jitter to prevent thundering herds at expiry.
  • Hours-fresh. Product catalog metadata, account settings, knowledge-base articles. TTL of one to twelve hours depending on the editorial cadence of the source.
  • Days-fresh or static. Reference taxonomies, currency lists, country codes, infrequently changing documentation. Long TTLs are appropriate; the failure mode here is forgetting to invalidate when the rare update lands.

The categorization isn't the hard part. The hard part is that nobody on the team currently owns the question, so the categorization never gets done. The platform team owns the cache framework. The feature team owns the tool wrapper. The product team owns the user experience. The freshness contract sits in the gap between those three owners, and the result is that a generic default ships and stays.

A simple intervention: require every tool registration to declare a volatility field, and have the cache layer read its TTL from that field rather than from a framework default. The first time a feature team has to fill in volatility: "real-time" for the inventory tool and volatility: "hours-fresh" for the catalog tool, the conversation about what the right number is becomes concrete instead of theoretical.

Thread freshness metadata into the result

The second discipline is that the model should be able to reason about staleness when it matters, and right now it can't, because the staleness information is thrown away at the cache boundary.

The fix is to wrap cached tool results with explicit freshness metadata that the model sees. Rather than:

{"available": 142, "warehouse": "SEA-3"}

return:

{"available": 142, "warehouse": "SEA-3", "as_of": "2026-04-28T09:14:00Z", "data_age_seconds": 14400, "source": "cached"}

Or, more pragmatically for prompt-engineered agents that don't reason well over arbitrary metadata fields, prepend a one-line preamble: "This inventory snapshot is from 4 hours ago and may not reflect current stock." The cost is a few tokens. The benefit is that the model now has the option to flag the staleness, ask the user whether to re-check, or escalate to a fresh call — none of which it can do if the staleness is invisible to it.

This is not a theoretical pattern. Recent observability writeups on tool-call tracing emphasize that source freshness and last-updated timestamps belong in the metadata schema for every retrieval and every tool invocation. The model can use them; the trace can use them; the eval can use them. The team that strips them at the cache boundary has decided the model doesn't need to know — and that decision is the contract nobody wrote.

A second-order benefit: when the freshness metadata is in the trace, you can run an eval that asks the agent specifically how it handles a stale result. You couldn't run that eval before because the trace didn't carry the staleness signal. Now the eval set can include scenarios where the cached value is X minutes old and the model has to decide whether to use it, refresh it, or disclose it.

Cache-bypass tiers for high-stakes calls

Some tool calls are not cacheable at all, and the freshness contract for them is "always bypass." Order placement, payment authorization, account changes, and any call whose downstream effect is a write that depends on the read should ride the bypass tier — accepting the latency cost in exchange for the correctness guarantee.

The pattern is a per-call freshness mode in the tool invocation: freshness: "strict" forces a cache miss; freshness: "best-effort" accepts a cached value within the tool's TTL; freshness: "stale-ok" accepts an expired cached value if the source is unreachable. The agent's planner — or the tool wrapper itself — picks the mode based on the call's role in the chain.

This is the same shape as the standard cache-aside, read-through, and write-through patterns from classical systems design, just with the model in the loop as a planner rather than as a cache reader. The classical patterns assume a programmer chose the strategy at code time. Agent systems need the strategy to be choosable per call, because the same tool can be called by the same agent in two different roles in the same session: once to ground a conversational answer (best-effort is fine) and once to authorize a transaction (strict is required).

Eval the stale-cache scenario explicitly

The last discipline is the easiest to skip and the highest-leverage. If your eval set was built from the live system on a quiet afternoon, it has zero stale-cache scenarios in it. The cache hit rate during eval is artificially high; the cache age is artificially low; the eval is a measurement instrument that systematically underweights the failure mode you most need to catch.

The fix is a stale-cache eval slice. Synthesize scenarios where the cached tool result is N minutes, hours, or days old, and the source-of-truth has shifted since the cache was warmed. Grade the agent on three dimensions:

  • Did it produce the right answer (using the live data)?
  • If it used the stale cache, did it disclose the staleness to the user?
  • Did it choose to refresh when the staleness exceeded a threshold the prompt named?

A team running this eval slice for the first time will usually discover their agents fail it spectacularly — because the agent has no signal that the data is stale, has no instruction about what to do when it is, and has no incentive (in the loss the eval is measuring) to disclose. All three are fixable, but only after the eval surfaces the problem. RAG-evaluation work in 2026 is starting to add stale-retrieval-rate metrics for exactly this reason: averaging stale and fresh cases together lets the average hide a failure mode that should be reported as its own number.

What to do before the next incident

The shape of this work is unglamorous. It is not a model upgrade, it is not a new tool, it is not a slick demo. It is a freshness contract document that names every tool the agent can call, the volatility class of its underlying data, the TTL that derives from that, the freshness metadata threaded into the result, the bypass tier for high-stakes calls, and the eval slice that grades stale-cache behavior.

Most of the agent teams that have hit a stale-cache incident in production wrote a version of this document the week after — under the duress of a postmortem, with a customer-trust regression already on the books. The teams that haven't hit the incident yet are running on a contract their cache framework wrote for them on the day they enabled it, and the contract is "one TTL fits everything, no freshness metadata, no bypass tier, no stale-case eval." That contract is doing the same thing on every team that adopts it: it works fine until the day it doesn't, and the failure looks like a hallucination, and the next quarter is spent debugging the model when the bug is in the cache layer the model can't see.

The architectural realization that has to land is that caching tool results is not the same shape of optimization as caching API responses. The consumer is a planner. The plan is only as fresh as the value. The value is only as fresh as the cache the team didn't think of as a correctness surface. Naming the freshness contract — explicitly, per tool, in code, in metadata, in evals — is what turns the cache from a quietly-accumulating bug into a deliberately-managed trade-off.

Write the contract before the incident writes it for you.

References:Let's stay in touch and Follow me for more thoughts and updates