19 posts tagged with "prompt-caching"

Prompt Caching's Hidden Tax: When a Cache Hit Serves the Wrong User's Context

May 17, 2026 · 11 min read

Software Engineer

Prompt caching is sold as a free win. Cache the long shared prefix — your system prompt, your tool definitions, your retrieved context — pay full price only for the short tail that changes, and watch the bill drop. The numbers are real: a cache read costs roughly a tenth of a fresh input token, so a workload with a heavy stable prefix can see its input cost fall by 80% or more. Teams adopt it for that reason, tune it for that reason, and report on it with a single metric: cache hit rate, trending up.

What that framing hides is that the boundary you just drew — the line between the cached prefix and the uncached tail — is not a billing knob. It is a correctness boundary. Everything above the cache breakpoint is content the system has decided is interchangeable across requests. If you draw that line to maximize hit rate, you are letting a finance metric decide which facts in your prompt are allowed to be shared between users, between tenants, and across time. That is an isolation decision, and it deserves to be made on purpose.

The failure mode is quiet because it never throws. A cache hit that serves one user's context shaped by another user's profile returns a perfectly well-formed response. A cache hit that serves personalization that was true when the prefix was warmed and false by the time it is reused returns a confident, coherent, wrong answer. Nothing in your latency graph or your error rate moves. The only signal is a hit rate that looks great — because the key is too coarse.

Pre-Commit Hooks for Prompts: The Inner-Loop Tooling LLM Teams Keep Shipping Without

May 14, 2026 · 10 min read

Tian Pan

Software Engineer

Open a prompt file in any production LLM repo and watch the reviewer's eyes glaze over. The diff is fifteen lines of natural language with a tweaked few-shot example, a reworded instruction, and a stray trailing space the editor left behind. There is no syntax check that ran on it, no linter complaining about contradictory instructions, no scanner that noticed the few-shot example contains a real customer's email address from last Tuesday's support trace, and no smoke eval that confirmed the change didn't tank latency on the prompts the system actually serves. The reviewer approves on vibes — the same way teams approved HTML template diffs in 2008 — and then production telemetry catches the regression six hours later.

The inner-loop tooling around code has had two decades to mature. The inner-loop tooling around prompts is somewhere between "we have a .md file in git" and "we ran promptfoo once after onboarding." The gap is widening because prompts are now the higher-leverage edit in many systems: a thirty-line system-prompt change moves more behavior than a thousand-line service rewrite, and it ships through a review process that treats it like a Word document.

The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn

May 13, 2026 · 9 min read

Tian Pan

Software Engineer

The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.

Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.

Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.

The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.

Prompt Cache Thrashing: When Your Largest Tenant's Launch Triples Everyone's Bill

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

The bill arrives on the first of the month and it is three times what your spreadsheet said it would be. Nobody pushed a system prompt change. The dashboard says request volume is flat. p95 latency looks normal. The token-per-correct-task ratio is unchanged. And yet you owe the inference vendor an extra forty thousand dollars, and the only signal in the observability stack that even hints at why is a metric most teams never alarm on: cache hit rate, which dropped from 71% to 18% somewhere in the second week of the billing cycle, on a Tuesday, at 9:47 AM Pacific, which is when your largest tenant's customer-success team kicked off a coordinated onboarding push for two hundred new users.

Welcome to prompt cache thrashing — the multi-tenant failure mode that the SaaS playbook was supposed to have eliminated a decade ago, reintroduced through the back door by your inference provider's shared prefix cache. The provider's cache is shared across your organization's traffic. Your tenants share that cache with each other whether you want them to or not, and a single tenant whose prefix shape shifts overnight can evict the prefixes everyone else's unit economics depended on. The bill spikes for tenants who did nothing differently. Finance pages engineering. Engineering points at the dashboard, which shows nothing wrong, because the dashboard isn't measuring the thing that broke.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.

Cold Cache, Hot Cache: Why Your LLM Latency Numbers Lie in Staging

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Your staging environment says p50 latency is 400ms. Your production dashboard says 1.8 seconds. You check the code — same model, same prompt, same provider. Nothing changed between deploy and release. The numbers shouldn't diverge this much, but they do.

The culprit is almost always cache state. Prompt caching — the single biggest latency optimization most teams rely on — behaves fundamentally differently under staging traffic patterns than production traffic patterns. And if you don't account for that difference, every latency number you collect before launch is fiction.

About Tian Pan