Prompt Caching: The Optimization That Cuts LLM Costs by 90%
Most teams building on LLMs are overpaying by 60–90%. Not because they're using the wrong model or prompting inefficiently — but because they're reprocessing the same tokens on every single request. Prompt caching fixes this, and it takes about ten minutes to implement. Yet it remains one of the most underutilized optimizations in production LLM systems.
Here's what's happening: every time you send a request to an LLM API, the model runs attention over every token in your prompt. If your system prompt is 10,000 tokens and you're handling 1,000 requests per day, you're paying to process 10 million tokens daily just for the static part of your prompt — context that never changes. Prompt caching stores the intermediate computation (the key-value attention states) so subsequent requests can skip that work entirely.
How Prefix Caching Actually Works
Under the hood, transformer models build a "KV cache" — key-value tensors computed during the attention mechanism. Prefix caching persists these tensors across API calls. When a new request shares a prefix with a cached one, the model skips recomputing those attention states and jumps straight to processing only the new tokens.
The economic model differs by provider:
Anthropic Claude requires explicit opt-in via cache_control markers. Cache writes cost 25% more than standard input tokens, but cache reads cost only 10% of the base rate — a 90% discount. For Claude 3.5 Sonnet, that's $3.75/M tokens to write, $0.30/M to read (vs. $3.00/M standard).
The break-even is remarkably low: you need only 1.4 cache hits per cached prefix to come out ahead. After that, every additional read is pure savings.
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
system=[{
"type": "text",
"text": large_static_context, # Your docs, codebase summary, etc.
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": user_query}]
)
OpenAI caches automatically — no code changes required. You get a 50% discount on cached tokens with no write premium. Monitor cache efficiency via response.usage.prompt_tokens_details.cached_tokens.
Self-hosted vLLM enables automatic prefix caching (APC) via a hash-table-based KV block storage with O(1) lookups. Production deployments report 14–24x throughput improvement over naive implementations for workloads with shared prefixes.
Key technical constraints across providers:
- Minimum 1,024 tokens to be eligible for caching
- 5-minute default TTL (extends with activity; Anthropic can reach 1 hour)
- Up to 4 cache checkpoints per request (Anthropic)
- Cache is private to your organization — no cross-tenant leakage
Prompt Structure Is Cache Architecture
Prefix caching only works when requests share an identical prefix. This sounds obvious, but it has non-trivial consequences for how you structure prompts.
The golden rule: static content first, dynamic content last.
[System prompt — stable across all requests]
[Retrieved documents — stable for a given session]
[Conversation history — grows per turn]
[Current user message — always new]
If you inject the user's name, timestamp, or request ID into the system prompt, you're invalidating the cache on every request. The cache key is the exact byte sequence; even a single character difference creates a new cache entry.
For RAG systems, this changes retrieval architecture. Rather than building one long prompt from scratch each time, cache the retrieved documents explicitly:
# Cache the static document context
system_with_docs = [{
"type": "text",
"text": f"{base_system_prompt}\n\n# Knowledge Base\n{retrieved_docs}",
"cache_control": {"type": "ephemeral"}
}]
# Only the question varies
messages = [{"role": "user", "content": question}]
For conversational agents, you can cache up to 5 conversation turns on Anthropic. This means multi-turn interactions where the system prompt and early exchanges remain static get cheaper as the conversation progresses — the opposite of what you'd expect.
The Parallel Execution Trap
Here's a failure mode that catches teams by surprise: parallelization kills caching.
Cache creation takes 2–4 seconds for large documents. If you fire off 10 parallel requests before the first cache has been written, each request processes the full prompt independently. You get 10 cache writes, 0 cache reads, and a bill 5–10x what you expected.
Real-world data: teams that naively parallelize without cache warming see hit rates as low as 4%. The fix is a dedicated warmup call before parallel processing:
async def process_document_parallel(doc: str, questions: list[str]):
# Warm the cache with a minimal call first
await client.messages.create(
system=[{
"type": "text",
"text": doc,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "Ready."}],
max_tokens=1
)
# Now all parallel requests hit the cache
tasks = [ask_question(doc, q) for q in questions]
return await asyncio.gather(*tasks)
Cost comparison for a 30,000-token document with 3 parallel questions: $0.34 without warming vs. $0.14 with warming — a 59% reduction from this single change.
Beyond Prefix Caching: The Full Stack
Prefix caching handles repeated prefixes, but misses semantically equivalent queries with different phrasing. Semantic caching adds another layer that intercepts queries before they hit the API at all.
The architecture: incoming queries are embedded and searched against a vector store of previous queries. If a sufficiently similar query was answered before, return the cached response directly — 100% savings, no API call at all.
Production metrics from teams running semantic caching on FAQ-style workloads:
- 61–69% cache hit rates
- 97%+ accuracy on hits (measuring whether the cached answer is appropriate)
- 40–50% latency reduction on hits
The full multi-tier stack looks like:
Request
→ Semantic cache (exact/near-duplicate queries) → 100% savings
→ Prefix cache (shared static context) → 50–90% savings
→ Full inference → 0% savings
A well-tuned system with stable system prompts, consistent document retrieval, and repetitive user queries can route 70–80% of tokens through one of the caching layers.
Semantic caching has clear failure cases: creative generation, personalized responses, time-sensitive information, or any workload where queries rarely repeat. The vector similarity threshold also requires calibration — a static 0.8 cosine similarity performs poorly across diverse query types. Adaptive thresholds that adjust based on query complexity outperform static ones.
When Caching Hurts
Caching isn't always beneficial. Watch for:
One-shot workflows: If each user session is completely unique with no shared context, you're paying a 25% write premium (Anthropic) and getting zero reads. Run the numbers before enabling cache_control everywhere.
Dynamic system prompts: Teams that personalize system prompts heavily (injecting user preferences, current date, dynamic instructions) undermine prefix caching entirely. Consider whether the personalization can be moved to a separate, later section of the prompt.
Short prompts: Below the 1,024-token threshold, caching doesn't engage at all. If your system prompt is 500 tokens, neither expanding it artificially nor restructuring helps — you need to focus elsewhere.
Cache misses during cold start: Freshly deployed services, after deployment restarts, or at low-traffic hours where cache TTLs expire frequently will see higher effective costs until the cache warms. Factor this into your cost modeling.
Monitoring What Matters
Both Anthropic and OpenAI return cache utilization in API responses. Build a dashboard tracking:
- Cache hit rate:
cache_read_input_tokens / total_input_tokens— target 70%+ for stable-prompt workloads - Effective cost per request: blended across cache reads, writes, and misses
- Write/read ratio: a high ratio means prompts are changing too frequently or TTLs are expiring
If your hit rate is low despite similar requests, audit your prompt structure. Common culprits: timestamps in system prompts, user IDs in static sections, or document content that varies slightly across requests.
The Practical Checklist
If you're building on LLMs today and haven't implemented caching:
- Identify system prompts over 1,024 tokens — these are immediate wins
- Add
cache_control: {"type": "ephemeral"}to static content (Anthropic) or verify auto-caching is on (OpenAI) - Audit prompt structure: all static content must come before dynamic content
- If running parallel workloads over shared documents, add a warmup call
- Monitor hit rates in production and iterate on prompt structure until you're above 70%
The economics make this one of the highest-ROI optimizations available for LLM-based products. At significant API spend, the difference between an optimized and unoptimized caching strategy can exceed the engineering cost of the entire product. The 10 minutes it takes to add cache_control markers is probably the best-spent time in your infrastructure backlog.
