Skip to main content

Prompt Cache Break-Even: The Exact Math on When Provider-Side Prefix Caching Actually Pays Off

· 9 min read
Tian Pan
Software Engineer

Prompt caching sounds like a clear win: Anthropic and OpenAI both advertise a 90% discount on cache hits, and the documentation shows impressive cost reduction charts. Teams implement it, monitor the cache hit rate counter going up, and assume they're saving money. Some of them are paying more than if they hadn't cached anything.

The issue is the write premium. Every time you cache a prefix, you pay a surcharge — 1.25× on a 5-minute cache window, 2× for a 1-hour window. If your hit rate is too low, those write premiums accumulate faster than the read discounts recover them. Caching is not free insurance; it's a bet you place against your own traffic patterns.

Here's the exact math on when that bet pays off, and the architecture decisions that determine whether you're capturing the discount or subsidizing wasted cache writes.

The Break-Even Formula

For Anthropic Claude Sonnet 4.6, the numbers are:

  • Base input cost: $3.00/M tokens
  • 5-minute cache write: $3.75/M tokens (1.25× premium)
  • Cache read: $0.30/M tokens (90% discount)
  • Net savings per cache hit: $3.00 − $0.30 = $2.70/M

The break-even point for a 5-minute cache is:

cache_write_cost = N × savings_per_read
$3.75 = N × $2.70
N = 1.39 reads

You need approximately 1.4 cache reads within a 5-minute window to recover the write cost. In practice, 2 reads is your minimum viable threshold — anything below that is a guaranteed loss.

For the 1-hour cache window, the write cost doubles to 2× base ($6.00/M), raising the break-even:

$6.00 = N × $2.70
N = 2.22 reads

You need 3 reads within an hour. The math is the same for OpenAI (1.25× write premium, 90% read discount), so the break-even is essentially identical across providers at about 1.4 reads.

These numbers mean that caching is easy to justify for workloads with moderate reuse — but it's a liability for anything with sparse, intermittent traffic. A system prompt cached once per hour, triggered by a slow trickle of requests that never accumulates 3 hits before expiry, will cost more than no caching.

How Prefix Length Changes the Absolute Stakes

The break-even read count stays constant regardless of prefix length, but the dollar magnitude scales linearly. This changes how aggressively you should pursue caching.

For Sonnet 4.6, consider 10 cache reads on a cached prefix of varying lengths:

Prefix LengthWrite Cost10 Read CostsTotal w/ CacheTotal w/o CacheSavings
2,000 tokens$0.0075$0.006$0.0135$0.0677%
50,000 tokens$0.1875$0.15$0.3375$1.5077%
200,000 tokens$0.75$0.60$1.35$6.0077%

Percentage savings are roughly constant. Absolute savings scale with prefix length — a 200K-token system prompt caching correctly saves you $4.65 per 10 requests. At 10,000 daily requests with 90% hit rate, that's $4,185/day. At 10 requests a day with 90% hit rate, it's $4.18/day. The ROI logic is the same; only the business case changes.

The minimum token threshold matters here too. Both Anthropic and OpenAI require at least 1,024 tokens in the cacheable prefix before caching activates. Prompts shorter than that get no benefit from caching infrastructure — every cache write is wasted overhead.

What Real Hit Rates Look Like in Production

The gap between theoretical and actual cache hit rates is where most implementations fail. Two concrete data points from production deployments:

The 7% → 84% improvement: A security tooling company found their initial implementation had a 7.4% cache hit rate. The cause: dynamic content embedded in the system prompt. Timestamps, request IDs, and session tokens placed before the stable content caused every request to hash differently. Moving that dynamic content to appear after the stable prefix, outside the cached block, pushed hit rates to 84% without touching the underlying logic. A single structural change.

Multi-step task vs. single-step task: The same company measured 91.8% hit rates on multi-step agent tasks (where the same large context recurs across steps) versus 35.5% on single-step tasks (where each request is essentially unique). The architecture of the task, not just the prompt design, determines achievable hit rates.

These cases illustrate a general rule: cache hit rate is primarily an architectural property, not a prompt property. It depends on how you structure request flow, where you place volatile vs. stable content, and whether your traffic pattern produces repeated prefix matches within the cache TTL.

Session Architecture Patterns That Maximize Hit Rate

For multi-turn chat, the most effective pattern puts the stable system prompt first, followed by growing conversation history as a secondary cache block:

[Cached block 1: System prompt — never changes]
[Cached block 2: Conversation history — grows each turn]
[Uncached: Current user message]

Each turn writes a new cache entry for the extended history, but reads the system prompt from cache. After the second turn, the full conversation prefix is cached. Hit rates for this pattern in customer support or assistant applications typically land in the 40–60% range, with most of the savings coming from the large, stable system prompt.

For document Q&A (RAG), the win is larger and simpler. If you load a 50K-token document as a cached prefix, every query against that document reads from cache. With 1,000 daily queries against the same document and a 90% hit rate, your input costs drop from ~$150/day to ~$15/day. The math is compelling because the document token count dominates the input.

The worst pattern is the reverse: placing any volatile element — a timestamp, a user-specific ID, a dynamic instruction — in the middle of the prompt, before the stable content. This fragments the cache prefix at the volatile element and eliminates all subsequent cache reuse. Every instruction like Today is {date} embedded early in a system prompt breaks caching for everything that follows it.

The Parallel Request Race Condition

There's a timing failure mode specific to high-throughput batch workloads. When you fire 100 parallel requests simultaneously, the first request triggers a cache write that takes 2–4 seconds to materialize. The other 99 requests arrive during that write window, before the cache entry is available. All 100 pay the write overhead; none can read from cache.

The cost: if you're running 1,000-token prefixes on Sonnet with 100 concurrent requests, you pay $375 in write costs expecting $30 in read costs — and instead get zero reads. You've increased your cost by 25% for no benefit.

The fix is to issue a single warm-up request before firing the parallel batch, or to introduce a short queue that ensures the first request's write completes before siblings arrive. For truly batch workloads, sequential warm-up is cheap and eliminates the race condition.

Calculating Your Actual Break-Even Hit Rate

Rather than working from the required number of reads, you can compute the break-even hit rate directly for a given workload.

For a request where the cached prefix is P tokens, the fresh non-cached input is F tokens, and you expect H total requests per cache TTL window:

Cost with caching = P × write_rate + (H-1) × P × read_rate + H × F × base_rate
Cost without caching = H × (P + F) × base_rate

Setting them equal and solving for the hit rate (the fraction of H that are reads) gives the minimum viable hit rate. For a 1,500-token system prompt on Sonnet:

  • write_rate = $3.75/M, read_rate = $0.30/M, base_rate = $3.00/M
  • Break-even at H ≈ 1.4 (as derived above)
  • As H grows, the per-request amortized write cost drops to near zero

The practical implication: the break-even hit rate decreases as your traffic volume increases. At 10 requests per cache window, you need every request except the first to be a hit. At 100 requests per window, you need 1.4% of them to be hits. For high-volume workloads, caching is almost always correct. For low-volume workloads, it requires deliberate verification.

A Decision Framework

Before implementing prompt caching, verify each of these conditions:

Implement if:

  • Cacheable prefix is at least 1,024 tokens
  • Expected requests per cache window exceed 3 (conservative threshold for 1-hour cache)
  • The stable portion of your prompt represents more than 30% of total input tokens
  • You've audited your prompt for embedded dynamic content and moved it after the stable block

Don't implement if:

  • Requests arrive at intervals longer than your target cache TTL
  • The prompt is mostly unique per request (tool outputs, full document injections that change per call)
  • Your total daily request volume is low enough that the write premium doesn't amortize

Always do before shipping:

  • Log prompt structure for at least 100 requests to measure actual prefix stability
  • Monitor cache hit rate in production, not just cache write count
  • Calculate expected monthly cost with and without caching given your observed hit rate

What the Savings Actually Look Like

Two concrete monthly cost calculations for a team running Sonnet 4.6:

Customer support with 10K tickets/day, 1,500-token system prompt, 500-token fresh context:

  • Without caching: $105/day, $3,150/month
  • With caching at 99.99% hit rate: $94.50/day, $2,835/month
  • Savings: 10% — modest, because the system prompt is only 25% of input

Document Q&A with 1K queries/day, 50K-token document, 200-token query:

  • Without caching: $155/day, $4,653/month
  • With caching at 90% hit rate: $34/day, $1,030/month
  • Savings: 78% — dramatic, because the document dominates input

The difference comes down to the ratio of cached tokens to total input tokens. A large, stable cached prefix is where the economic leverage lives. A small system prompt above mostly unique fresh content produces small savings regardless of hit rate.

The Right Way to Think About It

Prompt caching is a bet on traffic structure, not a feature you enable. The API provides a 90% discount on repeated prefix tokens; whether that discount exceeds the write premium depends entirely on your request patterns.

Get the math right before instrumenting it: calculate your expected hit rate from your actual traffic, verify that the hit count per cache window exceeds your break-even threshold, and eliminate any dynamic content from your cached prefixes before measuring. Teams that skip those steps consistently undercount the cost of cache writes and overcount the benefit of cache reads.

When the conditions are right — stable large prefixes, moderate-to-high request density, no volatile content contaminating the prefix — the savings are substantial and the implementation is cheap. When those conditions aren't met, you're paying a 25% surcharge on every request that touches a cached prefix. Know which situation you're in before you ship.

References:Let's stay in touch and Follow me for more thoughts and updates