Skip to main content

The Heavy Tail Your Token Forecast Never Priced

· 9 min read
Tian Pan
Software Engineer

The cost forecast for your AI feature was modeled on a 50-user pilot. Those users typed three-sentence prompts because that is what people type into a beta they were asked to evaluate. Production launched, you crossed ten thousand users, and the finance team flagged that your model bill is running at three times the per-user number from the deck. You went looking for the bug. There is no bug. Your pilot was sampling from one distribution and production is sampling from another, and the difference between them is a long tail of users who learned about your product on Twitter and are pasting thirty kilobytes of unstructured context they screenshotted from a thread.

This is the same financial mistake every consumer internet company learned in the 2010s, transplanted onto LLM economics. The pilot's median user is not the production p99.5, and a token cost model that uses the mean as its forecasting input has already lost the argument with the bill.

The Pilot Was Structurally Incapable of Showing You the Tail

A fifty-user pilot does not have a heavy tail. It cannot. The shape of token consumption in production is a distribution whose 99.5th percentile sits orders of magnitude above its median, and you need tens of thousands of samples before the tail even shows up in the data. Pilot users are not the long tail. They are friends-of-the-team, design partners, beta testers nudged into the funnel by a product manager, and the engineer's spouse. They behave the way pilot users always behave: politely, briefly, and within the demo's happy path.

Production is different in three ways the pilot can never simulate. First, the volume is large enough that the tail exists at all. Second, the acquisition mix has changed — the people arriving now read about you on social media, watched a YouTube demo, or got the link forwarded from a colleague who said "try this for the giant document you have to deal with." That is a self-selection mechanism that loads the input with users whose first interaction is the hardest task they have. Third, no one is watching them. Pilot users perform for the team that recruited them. Production users do whatever produces the result they want, which often means pasting an entire PDF directly into the prompt because that is the path of least resistance.

The team that priced the feature on pilot averages has not made a forecasting error in the technical sense. The pilot data was accurate. It was sampling the wrong distribution.

The Distribution Is Heavy-Tailed, Not Normal

Token consumption per user follows a power-law shape, not a bell curve. A small fraction of users consumes most of the tokens. Practitioners report ratios where the top one percent of users account for thirty to fifty percent of total token spend, and the gap between a median request and a p99.5 request can run two to three orders of magnitude. Two requests to the same endpoint can differ by orders of magnitude in resource consumption — a fifty-token prompt and a ten-thousand-token prompt both count as one request, but the compute cost is drastically different.

Heavy tails change what a "cost per user" KPI actually means. If your team reports the average, you are reporting a number that no real user produces. The median user costs you a fraction of the mean. The mean is being pulled up by the tail, and the variance behind that mean is doing the actual financial damage. Every month, finance is surprised by the gap between forecast and bill not because the average drifted, but because the variance the average was hiding is now showing up on the invoice.

The right operating frame is to model the consumption as a distribution and to make decisions on quantiles, not points. What does your p50 user cost? Your p90? Your p99? The cost forecast should be a curve, and the gross margin question is a different question at each point along the curve.

The Power Users Are Also the Highest-LTV Users

Here is the part that turns the unit economics from a math problem into a strategy problem. The users who are blowing through tokens are not adversarial. They are not bots. They are not random. They are the users who pasted the giant document because the product helped them with something that genuinely mattered to them. They are tweeting about your workflow. They are recruiting more users like themselves.

Capping their usage is a churn risk. Not capping them is a margin risk. The flat-rate pricing you launched with quietly subsidizes the power user by taxing the median user, and that subsidy is invisible to both — the power user thinks the product is fairly priced because the bill stays the same regardless of usage, and the median user thinks they are getting fair value because they have no idea what the tail is costing you. You are running a redistribution scheme on your own customer base, paying for it out of margin.

The instinct to fix this with usage caps is correct in principle and almost always wrong in execution. A flat cap at the p99 boundary destroys the people who love the product most. A cap at p50 hides the product's most compelling demo. The realistic move is to convert usage into pricing — tier the product so that power users pay for the value they extract, and surface the consumption to the customer as a feature rather than a billing surprise.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates