The Cached Prompt Prefix That Grew Arms and Legs
Six months ago your prompt prefix was 4,000 tokens. It was stable, cache-warm, and amortized to almost nothing — the per-call surcharge for system instructions was a rounding error against the per-call cost of the response. Today that prefix is 11,000 tokens, your cache hit rate has slid from 92% to 31%, and your inference bill is up 4x. Nobody on the team can point to the PR that did it. There is no commit message saying "increase prompt tokens by 7,000." Every change was small, every change was defended, every change shipped clean.
The prefix grew arms and legs the way a basement collects boxes. One team needed the user's tier injected so the agent could explain plan limits. Another needed today's date in the user's timezone for "remind me tomorrow" to work. A third stapled in the active A/B variant name so eval traces could be sliced. Marketing added the current promo banner so the agent could mention it on prompt. Compliance added a feature-flag manifest so the model could refuse beta features for users not in the rollout. Each was a one-line addition. Each was defensible in isolation. The aggregate destroyed your cache.
This is the new dependency hell. It compounds silently, hides from code review, and only surfaces in the line item your CFO doesn't read until it crosses some threshold. The defenses your team has built against other forms of regression — type checks, CI, eval gates — do not look at this surface. The PR that added one line to the system prompt did not change a test. It did not flip a benchmark. It changed the cacheability of every subsequent request, and the cost of that change is paid one token at a time, by every user, forever, until someone pulls the line back out.
Why The Prefix Is A Shared Resource Nobody Owns
The system prompt sits in a strange place in the codebase. It is technically just a string in a config file, but functionally it is a shared API surface. Every product team that ships an AI feature has a reason to add something to it. The PM wants the agent to know about the current promotion. The growth team wants the agent to recommend the upgrade path. Support wants the agent to never reference deprecated features. Legal wants the disclosure paragraph. The prompt is the place where all of these requirements land, because the prompt is the only place where everyone agrees the model will see the constraint.
The problem is that the prompt has no maintainer. The repo has owners for every other shared resource. The Dockerfile has someone whose job it is to keep the build slim. The CI config has someone who tunes it for cost. The package.json has a steward who pushes back on adding dependencies. The system prompt has no equivalent. The first engineer who wrote it has moved teams. The on-call rotation does not include it. When a new team needs a line added, they find the file, they add the line, they merge. There is no review for prompt-prefix bloat because there is no reviewer who has been told that's their job.
This produces a tragedy of the commons in token space. Each contributing team is rationally optimizing their own feature; none of them sees the aggregate cost of their additions. The cost is real but diffuse. The benefit is concentrated and visible — the feature works, the demo lands, the PR ships. Engineering economics 101 predicts the outcome. The commons collapse, and the bill arrives a quarter later, and the response is a panic optimization sprint instead of the steady-state ownership that would have prevented it.
What Actually Killed The Cache Hit Rate
Prompt caches across major providers share a structural assumption: they key on the bytes of the prefix, in order, up to the first divergent token. Anthropic requires explicit cache_control markers and treats the cache as a strict prefix match. OpenAI applies caching automatically at the 1,024-token threshold but uses the same exact-prefix mechanic underneath. Google's implementation differs in detail but not in spirit. In all three cases, the cache is killed not by content changes deep inside the prompt but by any change at all that happens before the cacheable boundary.
This is what makes the dynamic-injection pattern so destructive. When a team writes "Today is {date}" at line 47 of the system prompt, they have not added 11 tokens of bloat. They have added 11 tokens of bloat and invalidated every byte that follows them in the prefix. The cache reads from position 0 forward and stops at the first byte that doesn't match. Inject a per-request value at position 47, and positions 48 through 11,000 are now cache-cold, no matter how stable they are in isolation.
Engineers learn this the hard way. The first hint is usually a cost spike that doesn't track to a feature launch. Latency rises in lockstep with cost because the model is now processing tokens it used to read from cache memory. The eval scores don't move because the prompt content is still correct; the cache key just doesn't match. By the time the team understands what happened, the bill has compounded for a billing cycle, and the engineer who added the date-injection line is in a different sprint defending a different priority.
The Per-Byte Ownership Log
The pattern that holds the line is treating the prompt prefix like a shared dependency manifest. Every byte in the prefix has an owner, an introduction date, and a dependency. When a team requests an addition, the request is reviewed against the same questions a package.json dependency would face. What does this enable? Who breaks if it's removed? How does it interact with the bytes around it? When does it expire?
- https://genta.dev/resources/prompt-caching-llm-guide
- https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching
- https://www.prompthub.us/blog/prompt-caching-with-openai-anthropic-and-google-models
- https://www.digitalocean.com/blog/prompt-caching-with-digital-ocean
- https://spring.io/blog/2025/10/27/spring-ai-anthropic-prompt-caching-blog/
- https://inference.net/content/llm-observability-monitoring-production-deployments/
- https://bentoml.com/llm/inference-optimization/prefix-caching
