Why Token Forecasts Drift After Launch — and How to Catch the Spike Before Finance Does
The pre-launch cost model is a beautiful spreadsheet. It assumes a synthetic traffic mix run through a representative prompt at a tested cache hit rate and a clean tool-call path. The post-launch reality is that none of those assumptions survive the moment the feature actually starts working. The intents your synthetic traffic didn't cover are precisely the ones that stick. The marketing surge from a campaign engineering didn't get the meeting invite for lands on the highest-cost branch in your routing tree. The heavy-user cohort that uses 40× the median doesn't show up until week three.
The industry-wide version of this problem is now well-documented: surveys put the share of enterprises missing their AI cost forecasts by more than 25% at around 80%, and report routine cost increases of 5–10× in the months immediately after a successful launch. The crucial detail in those numbers is the word successful. Failed AI features stay on budget. The drift is driven by the feature working, not by the team doing something wrong. That makes it a planning artifact problem, not an engineering problem — and the planning artifact most teams reach for, the monthly bill, is the worst possible detector.
The Three Distributions Your Forecast Got Wrong
Pre-launch cost models almost always quote a single number per query: average input tokens, average output tokens, average tool-call depth. That single number is the mean of a distribution that nobody has observed yet. Three things happen on launch day that make the mean useless.
The intent distribution shifts. The synthetic eval set was built by a small team imagining what users would type. The actual traffic is dominated by intents the team didn't think to test. Support queues seeing this pattern report that intent distributions reshape after launch and cause misroutes in the high-volume tier — and a misroute in an LLM workflow is not just a wrong answer, it's a full extra round-trip into a more expensive branch. Long-tail intents that account for 5% of queries can account for 30% of token spend, because they're the ones that fall through to the model with the largest context and the deepest tool chain.
The per-user distribution is power-law, not normal. The mean is meaningless when one user issues forty times the median. This is not unusual; it's the rule. Once a feature is useful, a small cohort of professional users will run it like an automation rather than a chat. Your forecast modeled the median user; the bill is dominated by the 99th percentile. Unless you slice cost by cohort from day one, the heavy users are invisible until somebody on finance asks why the curve bent.
The tool-call multiplicity distribution has a fat right tail driven by failure. Three retries at each layer of a five-service call chain produce 3^5 = 243 backend calls per user request — and the synthetic eval set, which used clean fixtures, produced one. Production teams have reported recursive agent loops that climbed from $127/week to $47,000/week over eleven days, and overnight retry loops that made thousands of identical billable tool calls before anyone woke up. The forecast assumed a happy path that the model itself doesn't always take.
The takeaway: any forecast that quotes a single average tokens-per-request number has already lost. The unit you need to forecast is the distribution, and the distribution shifts when the feature ships.
What the Decomposition Has to Look Like
A useful unit-economics breakdown for a launched AI feature has at least four axes, and missing any one of them hides the spike when it happens.
Input vs. output tokens, sliced by intent class. Output tokens cost roughly 3–8× their input counterparts on most frontier APIs, and the median 2026 output-to-input ratio sits around 4×. A mix shift toward intents that produce long responses — explanations, summaries, generated documents — moves your blended price-per-request even when nothing else changes. You cannot detect this from a "total tokens" graph; the line keeps growing in a way that looks like usage but is actually a price increase per query.
Tool-call multiplicity per workflow. Track the count of model calls and tool calls per user-initiated request, not per API call. A workflow that quietly grew from 3 model calls to 7 because somebody added a new validation step shows up as a 130% cost increase that no individual prompt change can explain. Multiplicity drift is the number-one source of "we changed nothing and the bill doubled" stories.
Cache hit ratio drift week over week. Cache hit rate has been described as the single highest-leverage metric in production LLM cost, and the most commonly cited reason it drifts is mundane: somebody added a request timestamp to the system prompt for debugging, which changed every second and invalidated the cache on every request. Caching saves can move from 80%-class to single digits because of a one-line edit, and the only thing that catches it before the invoice is monitoring the hit rate as a first-class metric. Treat any decline of more than a few percentage points week-over-week as a deploy-correlation hunt, not noise.
Retry-on-failure overhead by error class. A naive retry loop without jitter or budget hammers an already-overloaded endpoint and drains your retry budget within milliseconds. Tag every retried call by error class (rate-limit, timeout, model-side 5xx, tool-side error) and graph cost-per-error-class. A new dependency degrading silently shows up here long before it shows up anywhere else.
- https://www.silicondata.com/blog/llm-cost-per-token
- https://www.softwareseni.com/why-your-ai-bill-exploded-between-pilot-and-production-and-how-to-predict-the-real-cost/
- https://feeds.trussed.ai/blog/prevent-ai-api-cost-overruns
- https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide
- https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user
- https://langfuse.com/docs/observability/features/token-and-cost-tracking
- https://aisecuritygateway.ai/blog/llm-token-budget-strategies-for-agents
- https://runcycles.io/troubleshoot/llm-cost-spike-debugging
- https://medium.com/@komalbaparmar007/llm-tool-calling-in-production-rate-limits-retries-and-the-infinite-loop-failure-mode-you-must-2a1e2a1e84c8
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- https://medium.com/@mdfadil/prompt-caching-saves-money-until-it-doesnt-8519c470918d
- https://www.finops.org/wg/finops-for-ai-overview/
- https://www.infoworld.com/article/4138748/finops-for-agents-loop-limits-tool-call-caps-and-the-new-unit-economics-of-agentic-saas.html
- https://docs.datadoghq.com/llm_observability/monitoring/cost/
