The Conversation Summary Your Agent Regenerated Each Turn Because the Cache Key Included a Timestamp
A cache that is being written to but never read from is not a cache. It is a logging system with extra latency, billed by the kilobyte. And the cruelest version of this failure mode is the one where the cache looks healthy from every angle except the one that matters: the set calls succeed, the get calls return quickly, the keys are well-formed, the values are valid, the TTLs are sensible. The only thing wrong is that no get call ever finds the key a previous set call wrote, because a single field in the key changes every time it is computed.
This is the story of a debugging session that added a timestamp to a cache key "so I can tell which cache entry I'm looking at," and the system that quietly paid for fourteen extra LLM calls per conversation for two weeks before anyone noticed.
The Setup the Cache Was Built For
The agent in question handled long-running customer support conversations. Average conversation length crept up through the year as users learned the product could carry context across many turns, and the team eventually hit the ceiling of the model's context window often enough to need a strategy. They picked the obvious one: a running summary, regenerated as the conversation grew, prepended to each new turn in place of the older messages.
The summary itself was generated by a separate model call. Cheaper model, terse prompt, structured output. It cost a few cents per generation, which sounded fine in isolation, but multiplied across the conversation volume it would have been brutal if every turn paid for it. So the team did the right thing and cached it.
The cache key was straightforward:
hash(conversation_id, last_message_id)
The semantics were exactly what you'd want. Two turns that produce the same summary input produce the same key. The summary for "conversation 47, after message 12" is computed once and reused on every subsequent read until message 13 arrives, at which point the key changes and a new summary is computed. The hit rate sat at 94% for months, which is roughly the ratio of "turns that read an existing summary" to "turns that mint a new one," and it was approximately what the math predicted.
The Debugging Session That Added a Field
A junior engineer was investigating an unrelated bug where the summary occasionally seemed stale. The actual cause turned out to be a race condition in the upstream message store, but during the investigation, the engineer wanted a quick way to disambiguate cache entries in a debugger. They added a cached_at timestamp to the cache key.
Their reasoning was reasonable in context. "I keep looking at two entries and I can't tell which one I just wrote." The timestamp meant every write produced a visibly distinct key, and they could correlate cache contents to logs by the timestamp suffix. The PR did exactly what it said it did. The reviewer saw a small change in a cache layer, a one-line addition, no test changes required because the cache had no tests beyond "the round-trip works," and approved it.
The race-condition bug was eventually fixed elsewhere. The timestamp field was forgotten. The cache layer continued to function as a write-through store: every call wrote a new entry, returned the freshly computed value, and moved on. From the outside, nothing looked wrong. The summary endpoint returned correct results. Latency was a little higher, but well within the noise band. No errors were thrown.
The Two Weeks Nobody Noticed
The cache hit rate dropped from 94% to 0% the day the change shipped. The LLM bill for summary generation tripled across the next two weeks. The team noticed the cost spike during the monthly finance review, but their first reading was a product-level story rather than a system-level one: "Conversations are getting longer, more of them are crossing the compression threshold, so we're doing more summaries." It was a coherent narrative. It fit the trend line of the prior six months. It just happened to be wrong.
The actual diagnosis came when an engineer profiled a single fifteen-turn conversation end to end and counted fourteen summary generations against a baseline of one. The summary for turns 3 through 14 had been cached and reused under the old key scheme; under the new scheme, each turn produced a key that no previous turn had written, so the cache was effectively cold on every read.
A few things made the misdiagnosis last as long as it did:
- The cache layer's own metrics page showed "writes per second" and "reads per second" but not "hit rate," because nobody had wired hit rate into the dashboard when the cache was built. The team had been operating on the assumption that the cache was working because the application was working.
- The LLM provider's billing dashboard aggregates by model, not by call site. The increase showed up as "more calls to the summary model," which was true but uninformative.
- The conversation length distribution actually had drifted longer. There was a real, smaller secondary signal that confirmed the team's first hypothesis, which meant the cache regression hid inside a real trend.
- The summary endpoint had no SLO that would have caught the latency drift. Each individual call was within budget. The aggregate cost was the symptom, and the cost dashboard was only reviewed monthly.
- https://factory.ai/news/compressing-context
- https://factory.ai/news/evaluating-compression
- https://www.zenml.io/llmops-database/evaluating-context-compression-strategies-for-long-running-ai-agent-sessions
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://www.typewise.app/blog/llm-cost-optimization-support-teams-caching-context-compression
- https://bbinsight.com/blog/5-ways-context-compaction-cuts-enterprise-llm-costs
- https://www.morphllm.com/llm-cost-optimization
- https://redis.io/blog/why-your-cache-hit-ratio-strategy-needs-an-update/
- https://oneuptime.com/blog/post/2026-02-06-monitor-redis-performance-cache-hit-rates-opentelemetry/view
- https://oneuptime.com/blog/post/2026-02-06-otel-cache-hit-rates-latency-optimization/view
- https://aws.amazon.com/caching/best-practices/
- https://docs.aws.amazon.com/whitepapers/latest/database-caching-strategies-using-redis/caching-patterns.html
- https://www.augmentcode.com/guides/ai-agent-loop-token-cost-context-constraints
