Skip to main content

Why Token Forecasts Drift After Launch — and How to Catch the Spike Before Finance Does

· 10 min read
Tian Pan
Software Engineer

The pre-launch cost model is a beautiful spreadsheet. It assumes a synthetic traffic mix run through a representative prompt at a tested cache hit rate and a clean tool-call path. The post-launch reality is that none of those assumptions survive the moment the feature actually starts working. The intents your synthetic traffic didn't cover are precisely the ones that stick. The marketing surge from a campaign engineering didn't get the meeting invite for lands on the highest-cost branch in your routing tree. The heavy-user cohort that uses 40× the median doesn't show up until week three.

The industry-wide version of this problem is now well-documented: surveys put the share of enterprises missing their AI cost forecasts by more than 25% at around 80%, and report routine cost increases of 5–10× in the months immediately after a successful launch. The crucial detail in those numbers is the word successful. Failed AI features stay on budget. The drift is driven by the feature working, not by the team doing something wrong. That makes it a planning artifact problem, not an engineering problem — and the planning artifact most teams reach for, the monthly bill, is the worst possible detector.

The Three Distributions Your Forecast Got Wrong

Pre-launch cost models almost always quote a single number per query: average input tokens, average output tokens, average tool-call depth. That single number is the mean of a distribution that nobody has observed yet. Three things happen on launch day that make the mean useless.

The intent distribution shifts. The synthetic eval set was built by a small team imagining what users would type. The actual traffic is dominated by intents the team didn't think to test. Support queues seeing this pattern report that intent distributions reshape after launch and cause misroutes in the high-volume tier — and a misroute in an LLM workflow is not just a wrong answer, it's a full extra round-trip into a more expensive branch. Long-tail intents that account for 5% of queries can account for 30% of token spend, because they're the ones that fall through to the model with the largest context and the deepest tool chain.

The per-user distribution is power-law, not normal. The mean is meaningless when one user issues forty times the median. This is not unusual; it's the rule. Once a feature is useful, a small cohort of professional users will run it like an automation rather than a chat. Your forecast modeled the median user; the bill is dominated by the 99th percentile. Unless you slice cost by cohort from day one, the heavy users are invisible until somebody on finance asks why the curve bent.

The tool-call multiplicity distribution has a fat right tail driven by failure. Three retries at each layer of a five-service call chain produce 3^5 = 243 backend calls per user request — and the synthetic eval set, which used clean fixtures, produced one. Production teams have reported recursive agent loops that climbed from $127/week to $47,000/week over eleven days, and overnight retry loops that made thousands of identical billable tool calls before anyone woke up. The forecast assumed a happy path that the model itself doesn't always take.

The takeaway: any forecast that quotes a single average tokens-per-request number has already lost. The unit you need to forecast is the distribution, and the distribution shifts when the feature ships.

What the Decomposition Has to Look Like

A useful unit-economics breakdown for a launched AI feature has at least four axes, and missing any one of them hides the spike when it happens.

Input vs. output tokens, sliced by intent class. Output tokens cost roughly 3–8× their input counterparts on most frontier APIs, and the median 2026 output-to-input ratio sits around 4×. A mix shift toward intents that produce long responses — explanations, summaries, generated documents — moves your blended price-per-request even when nothing else changes. You cannot detect this from a "total tokens" graph; the line keeps growing in a way that looks like usage but is actually a price increase per query.

Tool-call multiplicity per workflow. Track the count of model calls and tool calls per user-initiated request, not per API call. A workflow that quietly grew from 3 model calls to 7 because somebody added a new validation step shows up as a 130% cost increase that no individual prompt change can explain. Multiplicity drift is the number-one source of "we changed nothing and the bill doubled" stories.

Cache hit ratio drift week over week. Cache hit rate has been described as the single highest-leverage metric in production LLM cost, and the most commonly cited reason it drifts is mundane: somebody added a request timestamp to the system prompt for debugging, which changed every second and invalidated the cache on every request. Caching saves can move from 80%-class to single digits because of a one-line edit, and the only thing that catches it before the invoice is monitoring the hit rate as a first-class metric. Treat any decline of more than a few percentage points week-over-week as a deploy-correlation hunt, not noise.

Retry-on-failure overhead by error class. A naive retry loop without jitter or budget hammers an already-overloaded endpoint and drains your retry budget within milliseconds. Tag every retried call by error class (rate-limit, timeout, model-side 5xx, tool-side error) and graph cost-per-error-class. A new dependency degrading silently shows up here long before it shows up anywhere else.

If your dashboard does not slice on these four axes, you are watching a single line that conflates four independent mechanisms. The line will only tell you what happened, never why — and "why" is the only question that matters when you're talking to finance.

The Alerting Surface That Fires Before the Bill Arrives

The monthly bill is a 30-day-lagged scalar. You cannot run a feature on it. The alerting surface has to fire on signals that lead the bill by days, not lag it by weeks.

A working setup uses three layers. Rolling P95 spend per user cohort catches the heavy-user emergence pattern: when a small group's cost-per-day rises into a tail that the budget assumed didn't exist, you want to know in the same week, not the same quarter. Tag every API call with user_id, cohort_id, and feature_id at the gateway, then run a streaming aggregation. Anomaly detection on cost per query type catches the mix shift and the multiplicity drift: take cost-per-intent as a metric, fit a band, and alarm when last 24 hours leaves it. Budget-burn-rate alarms catch the cliff: if a key's spend rate exceeds 3× its trailing 7-day average in a 15-minute window, auto-throttle and page. The gateway-style budget circuit breaker pattern — per-request ceilings, per-session rolling budgets, per-key caps — exists precisely so the runaway loop doesn't get to run for eight hours.

Two non-obvious things make this work in practice. First, the alarms have to be tied to projection, not threshold. "You are at 60% of budget on day 18" is a useless alarm for a feature with cost-per-day that grew 20% per week — the projection is over by day 25, but the threshold won't fire until day 22. The right surface is the AWS-style "you'll exceed the cap by Tuesday" projection, computed from rate of change rather than absolute value. Second, the circuit breaker is the part most teams add last and regret first. The bounded-backoff and per-session token ceiling at the gateway is what stops the $47,000-week story from happening to you while you're asleep.

This is the structural realization the rest of the FinOps world reached years ago for cloud spend and is now applying to AI: token cost is not an invoice you reconcile at month-end, it is a stream of telemetry events that you treat exactly like latency or error rate. The instant you treat it that way, the dashboards, the alerting, and the on-call response all snap into shapes that already exist in the rest of your observability stack.

The Two-Dashboard Problem

There is an organizational pathology that breaks the conversation about cost before it starts. Engineering builds a dashboard backed by gateway telemetry, denominated in tokens, sliced by feature and cohort, and updated in real time. Finance builds a dashboard backed by the vendor invoice, denominated in dollars, allocated to cost centers, and updated monthly. The numbers never match. The numbers cannot match — they're computing different things over different windows on different attribution rules — and the meeting where this discrepancy first surfaces consumes ninety minutes that nobody intended to spend on reconciliation.

The fix is not a better dashboard; it's a single source of truth that both sides query. Tag at the gateway with everything finance needs (cost center, product line, customer ID for chargeback) so engineering's stream and finance's allocation are derived from the same span data. Reconcile against the invoice once a month as a check, not as a primary signal. The gap between gateway-computed cost and invoiced cost is itself a useful metric — it surfaces vendor pricing changes, missing telemetry, and untagged calls — but it is a debug signal, not the headline.

The deeper point is that AI feature unit economics is a cross-functional metric. Product needs cost-per-outcome to price the feature. Engineering needs cost-per-request to optimize the architecture. Finance needs cost-per-cost-center for the GL. None of those views are wrong, but they all need to derive from the same atomic event — the tagged span — or the conversation falls apart in a forest of footnotes.

Treat Token Cost Like Latency

The architectural realization underneath all of this is simple and slightly demoralizing if you've shipped a feature without it: token cost is a real-time signal, not a monthly statement. The same way nobody would ship a user-facing feature without a P95 latency dashboard and a paging alert on regressions, no team should ship an AI feature without a cost-per-request dashboard sliced by cohort and intent, with anomaly detection wired to the same on-call rotation that watches the latency.

The metaphor is exact. Latency drifts when a downstream dependency degrades, when a code path adds a synchronous call, when a cache regresses, when traffic mix shifts toward the slow branch. Cost does the same things, for the same reasons, on the same timescales — and the cost dashboard, if you build it right, looks structurally identical to the latency dashboard. The metric has changed; the playbook has not.

The teams that figure this out before the first invoice land in a different posture than the ones that don't. They argue with finance about projections instead of explanations. They roll out cohort-targeted pricing changes instead of feature-wide rollbacks. They catch the misconfigured timestamp that broke the cache on Wednesday afternoon, not three months later when the trend line finally crosses an arbitrary threshold somebody set in a budget meeting.

Forecast drift after launch is not a forecasting failure. It is a detection failure. The forecast was always going to be wrong; the question is whether you find out in real time or in retrospect. Build the telemetry as if you already know the answer is wrong, and the surprise gets a lot smaller — and a lot cheaper.

References:Let's stay in touch and Follow me for more thoughts and updates