Skip to main content

2 posts tagged with "cost-engineering"

View all tags

The Cache Stampede That Hit Your Model Provider Instead of Your Database

· 10 min read
Tian Pan
Software Engineer

The pager went off at 14:02 UTC. Not for latency, not for errors — for spend. The cost dashboard showed a vertical line: three minutes of input-token billing at roughly nine times the trailing hourly average, then back to normal. No regression had shipped. No tenant had onboarded. Traffic was flat to the minute. The only thing that changed is that a single prompt prefix — the 14K-token system message that every agent in the fleet shared — had quietly expired on the provider side, and a thousand workers had all decided, within the same 200ms window, that they were the ones who needed to write it back.

This is a cache stampede. It is the same bug operators have been writing post-mortems about since memcached shipped in 2003. What is new in 2026 is that the cache it stampedes is no longer yours. It lives inside your model provider, you cannot inspect its state, and every miss costs real money instead of a few extra database queries. The synchronization bug that database engineers learned to jitter away two decades ago has quietly reappeared on a bill line item nobody thought to defend.

The Hidden Token Tax: Where 30-60% of Your Context Window Disappears Before Users Say a Word

· 8 min read
Tian Pan
Software Engineer

You're paying for a 200K-token context window. Your users get maybe 80K of it. The rest vanishes before their first message arrives — consumed by system prompts, tool definitions, safety preambles, and chat history padding. This is the hidden token tax, and most teams don't realize they're paying it until they hit context limits in production.

The gap between advertised context window and usable context window is one of the most expensive blind spots in production LLM systems. It compounds across multi-turn conversations, inflates latency through attention overhead, and silently degrades output quality as useful information gets pushed into the "lost in the middle" zone where models stop paying attention.