Skip to main content

The Caching Hierarchy for Agentic Workloads: Five Layers Most Teams Stop at Two

· 11 min read
Tian Pan
Software Engineer

Most teams deploying AI agents implement prompt caching, maybe add a semantic cache, and call it done. They're leaving 40-60% of their potential savings on the table. The reason isn't laziness — it's that agentic workloads create caching problems that don't exist in simple request-response LLM calls, and the solutions require thinking in layers that traditional web caching never needed.

A single agent task might involve a 4,000-token system prompt, three tool calls that each return different-shaped data, a multi-step plan that's structurally identical to yesterday's plan, and session context that needs to persist across a conversation but never across users. Each of these represents a different caching opportunity with different TTL requirements, different invalidation triggers, and different failure modes when the cache goes stale.

Here are five caching layers that production agentic systems need, why each solves a distinct problem, and why getting the cache key wrong at the higher layers causes the most expensive failures.

Layer 1: Prompt Cache — The Foundation Everyone Has

Prompt caching is the most straightforward layer. Every major LLM provider now offers it: Anthropic caches prompt prefixes for 5-60 minutes, OpenAI enables automatic caching by default, and Google's Gemini supports explicit cached contexts. The mechanism is simple — the provider stores the computed internal state (KV cache) for a prompt prefix so it doesn't reprocess those tokens on subsequent requests.

The numbers are compelling. Cost savings range from 50% to 90% on input tokens, and latency drops by 13-31% on time-to-first-token depending on the provider. For an agent with a 4,000-token system prompt making 20 calls per session, that's the difference between processing 80,000 input tokens and processing 4,000 once plus 20 smaller incremental calls.

But here's the trap that catches most agent builders: naively enabling full-context caching can paradoxically increase latency. Research on long-horizon agentic tasks found that caching the entire conversation — including tool calls and results — triggers cache writes for content that will never be reused. Tool results often contain user-specific data with zero cross-session value. The cache fills up with junk, and the write overhead exceeds the read savings.

The fix is selective caching. Cache only the stable prefix: system prompt, tool definitions, and static context. Treat tool calls, results, and dynamic conversation as ephemeral. Position dynamic content at the end of the prompt to maximize the cacheable prefix length. This seemingly obvious advice is violated by most agent frameworks that concatenate messages without considering cache boundaries.

Layer 2: Semantic Cache — Recognizing the Same Question in Different Words

A semantic cache sits in front of your LLM calls and intercepts queries that are meaningfully similar to previous ones. Instead of exact string matching, it embeds the query into a vector space and performs similarity search against cached query-response pairs. "How do I reset my password?" and "I can't log in — how do I change my password?" hit the same cache entry.

The performance uplift is substantial: cached responses return in 5-20 milliseconds versus 1-5 seconds for a full LLM call. That's a 15x speedup when the cache hits. In production customer support systems, 31% of LLM queries exhibit sufficient semantic similarity to benefit from this layer.

The challenge for agentic workloads is cache eligibility. Not every agent response should be cached semantically. A factual lookup ("What's the refund policy?") is highly cacheable. A personalized recommendation ("Based on this user's purchase history, suggest...") is not — the response depends on context that the query embedding doesn't capture. Effective semantic caches need validation strategy tags: metadata attached to each entry that specifies conditions under which the cached response remains valid.

Invalidation is where semantic caches get dangerous for agents. A stale response from a chatbot is annoying. A stale response from an agent that then takes action based on that response — placing an order, modifying a database, sending an email — is a production incident. Semantic caches for agentic workloads need tighter TTLs and dependency-aware invalidation: if the underlying data source changes, all cached responses derived from it must be evicted, not just the ones that happen to expire.

Layer 3: Tool Result Cache — The Layer Most Teams Miss Entirely

Every agent call involves tool use, and tool calls are often the slowest and most expensive part of the pipeline. An agent querying a database, calling an API, or searching a document store might wait 200-2000 milliseconds per tool call. Over a multi-step task with 5-10 tool invocations, tool latency dominates total response time.

Tool result caching stores the outputs of tool calls keyed by the tool name and input parameters. When the agent calls search_documents(query="Q3 revenue projections") and the same query was made 30 seconds ago, the cached result returns instantly.

The critical design decision is TTL per tool category. A weather API result is stale after 15 minutes. A database query against a table that updates hourly can be cached for 30 minutes. A document search against a corpus that changes daily can be cached for hours. Most implementations use a single TTL for all tools, which means either over-caching volatile tools (serving stale data) or under-caching stable tools (wasting compute).

The right approach is category-specific TTLs with dependency tracking. Each tool registers its data volatility profile, and the cache layer enforces appropriate lifetimes. When a tool call modifies state (a write operation), all cached results from related read operations must be invalidated immediately. This is the same problem that database caching solved decades ago, but agent frameworks rarely implement it because they treat tools as stateless functions rather than data access points with consistency requirements.

There's also a privacy dimension that many teams overlook. Tool results often contain user-specific data. A workflow-level tool cache that's shared across users leaks information. The solution is two-tier tool caching: a shared cache for universal tool results (API documentation, public data lookups) and a session-scoped cache for user-specific results, with strict isolation between sessions.

Layer 4: Plan Cache — Reusing Strategy, Not Just Data

This is the layer that separates sophisticated agent deployments from the rest. When an agent receives a task, it typically generates a plan — a sequence of steps and tool calls needed to complete the task. Plan caching recognizes that many tasks are structurally identical even when the specifics differ.

"Book a flight from SFO to JFK on June 15" and "Book a flight from LAX to ORD on July 3" require the same plan: search flights, filter by criteria, select option, confirm booking. The entities change but the strategy doesn't. A plan cache extracts these structural templates from completed tasks and reuses them for new requests.

Recent research on Agentic Plan Caching demonstrates the impact: 50% cost reduction and 27% latency reduction while maintaining 96.6% of optimal task performance. The overhead of keyword extraction and cache management is just 1% of total costs. The mechanism works in three steps:

  1. Extract: After a successful task, a rule-based filter strips the execution log down to its structural skeleton, then a lightweight LLM removes context-specific values to create a generalized template.
  2. Match: New tasks are matched against cached plans using keyword extraction rather than embedding similarity. This sounds counterintuitive, but keyword matching produces fewer false positives and false negatives than semantic similarity for plan retrieval — because plans need structural similarity, not semantic similarity.
  3. Adapt: A lightweight model (not the expensive frontier model) takes the matched template and fills in task-specific details. The frontier model is only needed when no cached plan matches.

The failure mode here is the most expensive in the entire hierarchy. A wrong plan cache hit — applying a template that's structurally inappropriate for the current task — causes the agent to execute an entire sequence of incorrect actions before failing. Unlike a stale data cache that returns a wrong answer, a stale plan cache causes wrong actions. Plan caches need higher match thresholds and should fall back to fresh planning when confidence is below a generous margin.

Layer 5: Session State Cache — Continuity Without Recomputation

The final layer handles the conversational context that persists within a user session but must be strictly isolated between users. When a user is mid-way through a multi-turn agent interaction — say, debugging a deployment issue across five messages — the session state cache preserves the agent's understanding of the problem, the tools it's already called, the hypotheses it's eliminated, and the current plan of action.

Without session state caching, every new message in a conversation requires the agent to re-derive context from the full message history. For a 20-message conversation, that means processing an increasingly large context window, hitting prompt token limits, and paying for the same context repeatedly.

Session state caching stores a structured representation of the agent's working memory: current goals, completed steps, gathered facts, and pending actions. This is distinct from simply growing the conversation history — it's a compressed, semantically meaningful summary that gives the agent continuity without the linear growth in token consumption.

The TTL for session state is tied to user engagement patterns. A customer support session might stay warm for 30 minutes after the last message. A developer debugging session might need a 2-hour window. After expiration, the state is evicted — not persisted to long-term storage, because session state that's hours old is almost always stale enough to be misleading rather than helpful.

The key design constraint is isolation. Session state must never leak between users, and it must never be shared across sessions for the same user unless explicitly designed to do so. This seems obvious, but implementations that use a shared Redis instance without proper key namespacing have caused real data leakage incidents in production agent systems.

The Compounding Effect: Why Order Matters

These five layers aren't independent — they form a hierarchy where each layer reduces the load on the layers below it. A plan cache hit means fewer tool calls, which means fewer tool cache lookups, which means fewer LLM calls, which means less pressure on the prompt and semantic caches. The compounding effect means the marginal value of each layer increases when the layers above it are working well.

The cost math makes the case clearly. Consider an agent handling 10,000 tasks per day:

  • Prompt cache alone: 50% input token savings → significant but incomplete.
  • Add semantic cache (31% hit rate): Another 30% reduction in LLM calls that reach the model.
  • Add tool result cache: 40-60% reduction in external API calls and database queries.
  • Add plan cache: 50% reduction in planning tokens, plus fewer tool calls per task.
  • Add session state: 30-40% reduction in context tokens for multi-turn sessions.

Teams that stop at two layers are optimizing the cheapest part of the pipeline (token costs) while ignoring the expensive parts (tool calls, planning, and context recomputation).

Getting Started: Pragmatic Ordering

If you're building this hierarchy incrementally, here's the order that maximizes value per engineering effort:

  1. Prompt cache: Enable it. It's free from most providers and requires zero infrastructure.
  2. Tool result cache: Add category-aware TTLs to your most-called tools. This is usually a Redis layer with tool-specific key schemas.
  3. Plan cache: Start with keyword-based matching on your most common task types. You don't need the full extraction-adaptation pipeline on day one — even a hand-curated set of plan templates provides substantial savings.
  4. Semantic cache: Add it once you have enough query volume to make hit rates worthwhile. Below a few hundred daily queries of similar types, the embedding overhead exceeds the savings.
  5. Session state cache: Implement when multi-turn sessions are a significant portion of your traffic and context window costs are growing.

The order surprises most teams because semantic caching — the layer that gets the most attention in blog posts and vendor pitches — is actually fourth in pragmatic value for agentic workloads. That's because agents have fundamentally different access patterns than chatbots: they execute plans and call tools more than they answer repeated questions. Optimize for what your agents actually do, not for what LLM caching tutorials assume they do.

References:Let's stay in touch and Follow me for more thoughts and updates