Skip to main content

The CDN Edge Cache Your AI Feature Could Not Use Because the Response Varies Per User

· 9 min read
Tian Pan
Software Engineer

The product team set the SLO for the new AI summarizer at 200ms TTFB because that is what the rest of the product hits at p50. Nobody on the call asked where the 200ms came from. It came from a decade of static assets and JSON responses served out of a CDN edge cache with an 85% hit rate, where most requests never reached origin and the ones that did were small. The summarizer is per-user, generated fresh each call, and travels edge → origin → model provider on every request. The SLO was structurally unmeetable on day one. The team discovered this in week six, after the dashboard had been red the whole time.

This is a recurring pattern in AI feature launches. The latency bar an organization built on top of one set of physics gets inherited by a feature with completely different physics, and the gap between the inherited target and the achievable floor becomes a months-long mitigation project instead of a Day 0 design constraint. The numbers do not care that the SLO was negotiated with a customer in good faith.

The Hidden Gift of a Decade of Static Assets

The cache hit rates that quiet product engineers have grown up with are not a property of the application — they are a property of the workload. Static assets are byte-identical across users. API responses for the catalog page are byte-identical across logged-out users and shareable across cohorts of logged-in ones. The CDN sits a few milliseconds from the user, fields most requests entirely, and only escalates the long tail to origin. The team treats edge latency as the typical case because, for that workload, it is.

The numbers behind this gift are large. In typical setups, TTFB for cached content runs around 37ms versus 136ms for uncached — a roughly 73% penalty on the miss path before you have done any work. And that is just cache-miss-but-origin-is-fast. When origin itself is a fan-out to a model provider hundreds of milliseconds away, the penalty is not a factor of three. It is a factor of ten or more.

The crucial part is that this performance was never something the application team earned. It was a hidden subsidy from how the web was built. Static assets are cacheable because URLs are stable identifiers and bytes are deterministic. JSON catalog responses are cacheable because the relevant inputs are coarse — locale, region, maybe an A/B bucket — and a cache keyed on those fields hits often. The team that built on top of this never had to think about cacheability as a property they were responsible for maintaining. It came in the box.

A personalized AI feature does not come in that box. The prompt prefix contains the user's history, the user's preferences, and the user's most recent action. The output is generated rather than retrieved, and the model is a stateless function whose input distribution is essentially unique per request. Every request misses every cache by construction. The team inherits the SLO and not the physics.

The Latency Budget That Was a Property of a Different Stack

The discipline that gets skipped here is restating the latency budget when the cacheability model changes. The conversation that should happen at design time is uncomfortable because it forces a renegotiation with whoever signed off on the customer-facing SLO, but it is the cheapest place to have it. Six weeks later, with a dashboard full of red and a contract obligation in the legal queue, the same conversation costs a quarter.

The mental model that fails here is treating the SLO as a portable target rather than as a property of the workload that produced it. A 200ms TTFB on a cached catalog response and a 200ms TTFB on a per-user generated summary are not the same kind of number. The first is a measurement of how close the CDN edge is to the user. The second would be a measurement of how fast a model can generate the first token of a response that did not exist before the request arrived. The two numbers happen to have the same units, but they belong to different problems.

A useful question to ask before any AI feature launch: what is the floor of the latency this feature can achieve given the caching properties it has, not the caching properties of the surfaces it lives next to? The floor is not the average. The floor is the irreducible minimum: network round-trips you cannot collapse, model TTFT you cannot beat, plus any synchronous setup. If your customer-facing SLO is below the floor, the SLO is broken before you have written code. You can argue the SLO down to the floor before launch, or you can argue with the dashboard for the next two quarters.

Feature-Class Taxonomy: Latency Budgets That Match Cacheability

The first concrete pattern that fixes this is refusing the unified product-wide latency SLO and instead defining a feature-class taxonomy where each class has its own budget calibrated to its cacheability model. A few useful classes:

  • Statically cacheable: deterministic responses that are identical across users (autocomplete dictionaries, common documentation lookups). p50 TTFB target measured at the edge, no model call required for most hits.
  • Semantically cacheable: responses that are not byte-identical but are semantically equivalent across enough users that a vector-similarity cache fires often (FAQ-style answers, common code-review patterns). p50 budget reflects the embedding lookup plus a vector index hit; p99 reflects the cold path through the model.
  • Per-user generated: responses whose prompt prefix contains user-specific data that varies on every call. The latency floor is dominated by the model provider's TTFT; the team has no path under it without changing the model or the prefix structure.

The honest taxonomy admits that the per-user-generated class will never hit the SLO of the cacheable classes and refuses to inherit their numbers. The customer-facing surface then either accepts the higher latency, or the feature design moves work out of the per-user-generated class and into one of the cacheable ones — usually by separating the personalization layer from the generation layer.

What Cache You Can Still Build at the Provider Boundary

Refusing to inherit the wrong SLO does not mean giving up on caching entirely. It means moving the cache from the CDN edge to the model provider boundary, where the unit of cacheability is different. Two layers are worth building deliberately.

Prompt prefix caching. Both major providers ship this now. Anthropic's cache_control breakpoints let you mark a stable prefix and pay roughly 10% of the input price on cache reads, against a 25% write premium. OpenAI auto-caches stable prefixes above a threshold and bills cached tokens at 50% of normal input. The shape of the win is the same: if you can hold a large system prompt or a large tool catalog or a large retrieved chunk constant across many requests, you stop paying for it on each call and you cut TTFT meaningfully.

The pattern that breaks prompt prefix caching is exactly the pattern that personalization tends to ship by default: putting the user's name, account ID, or recent history early in the prompt, before any of the stable scaffolding. The cache key is the prefix; if the prefix varies per user, the cache never warms. The fix is mechanical but easy to forget — move all per-request, per-user content after the cached boundary, and keep the system prompt and tool definitions and any cohort-shared context above it. A few minutes of prompt restructuring can be the difference between a 90% cache discount and zero.

Semantic caching. This is the other layer, and it is fundamentally different from prompt caching. Prompt caching cuts the cost and latency of a call that still happens; semantic caching eliminates the call. You embed the incoming query, look it up against an embedding index of past queries above a similarity threshold, and serve the past response. Reported hit rates in the wild land in the 60-70% range for query-shaped traffic — high enough that the saved cost dominates the embedding-plus-vector-lookup overhead.

Semantic caching has sharper edges than prompt caching. The similarity threshold is a tuning problem with real failure modes: too loose and you serve a wrong answer to a question that looked superficially similar; too tight and the hit rate collapses. And critically, semantic caching for a personalized feature has the same scope problem as edge caching — if the response depends on user history, the cache has to be scoped per-user, which collapses the hit rate because each user's query distribution is small. The win is real for shared question patterns; it does not transfer to genuinely per-user-generated content.

The Architectural Realization You Have to Make Out Loud

The underlying realization that gets missed: personalized AI features inherit different physics from the cached web. The CDN economics that quietly powered every prior feature were not a free property of the stack — they were a property of the workload. Static assets and shareable JSON responses earned the cache hit rate; personalized generated content does not.

The teams that ship AI features without naming this end up paying origin-grade latency on every request while their dashboards still display targets calibrated to edge-grade workloads. The dashboards are not lying; the targets are just from the wrong universe.

The team that makes this realization out loud, early, does three things. They publish a feature-class taxonomy and refuse to negotiate a single product-wide latency SLO across classes with different floors. They build the caches that the model provider boundary actually supports — prompt prefix caching for the system prompt, semantic caching for shared query patterns — and they design the prompt structure to make those caches fire. And they tell the customer-facing stakeholders, in writing, that the personalized-generation class has a different latency floor than the rest of the product and that the SLO has to reflect it.

The team that does not make the realization keeps debugging a dashboard that was red the moment the feature shipped. The work to migrate to the right SLO is the same either way. The only difference is whether it gets done before the contracts are signed.

References:Let's stay in touch and Follow me for more thoughts and updates