Skip to main content

3 posts tagged with "prompt-caching"

View all tags

Prompt Cache Thrashing: When Your Largest Tenant's Launch Triples Everyone's Bill

· 10 min read
Tian Pan
Software Engineer

The bill arrives on the first of the month and it is three times what your spreadsheet said it would be. Nobody pushed a system prompt change. The dashboard says request volume is flat. p95 latency looks normal. The token-per-correct-task ratio is unchanged. And yet you owe the inference vendor an extra forty thousand dollars, and the only signal in the observability stack that even hints at why is a metric most teams never alarm on: cache hit rate, which dropped from 71% to 18% somewhere in the second week of the billing cycle, on a Tuesday, at 9:47 AM Pacific, which is when your largest tenant's customer-success team kicked off a coordinated onboarding push for two hundred new users.

Welcome to prompt cache thrashing — the multi-tenant failure mode that the SaaS playbook was supposed to have eliminated a decade ago, reintroduced through the back door by your inference provider's shared prefix cache. The provider's cache is shared across your organization's traffic. Your tenants share that cache with each other whether you want them to or not, and a single tenant whose prefix shape shifts overnight can evict the prefixes everyone else's unit economics depended on. The bill spikes for tenants who did nothing differently. Finance pages engineering. Engineering points at the dashboard, which shows nothing wrong, because the dashboard isn't measuring the thing that broke.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

· 10 min read
Tian Pan
Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.

Cold Cache, Hot Cache: Why Your LLM Latency Numbers Lie in Staging

· 9 min read
Tian Pan
Software Engineer

Your staging environment says p50 latency is 400ms. Your production dashboard says 1.8 seconds. You check the code — same model, same prompt, same provider. Nothing changed between deploy and release. The numbers shouldn't diverge this much, but they do.

The culprit is almost always cache state. Prompt caching — the single biggest latency optimization most teams rely on — behaves fundamentally differently under staging traffic patterns than production traffic patterns. And if you don't account for that difference, every latency number you collect before launch is fiction.