Why You Can't Budget an AI Feature With a Single Number

May 17, 2026 · 9 min read

Software Engineer

Finance asks one question about every feature you ship: "What does it cost per user?" For a traditional feature, the answer is a number. A page render, a database query, a push notification — each has a marginal cost that barely moves from one request to the next. You measure it once, multiply by your user count, and the forecast holds.

An AI feature breaks that contract. Ask "what does this agent cost per request" and the honest answer is not a number, it's a histogram. The same agent that resolves one ticket for two cents will burn four dollars on the next one, because that user asked a vague question, the agent looped through eleven tool calls, and each call dragged the entire growing conversation back through the model. The mean of those two requests — two dollars — describes neither of them, and it definitely doesn't describe the bill.

That is the trap. When you hand finance a single average cost, you are not simplifying a messy reality. You are reporting a number that is wrong in a specific, expensive direction.

The cost per request is a fat-tailed distribution

A normal feature's per-request cost clusters tightly around its mean. If you plotted it, you'd get a narrow spike. The average is a faithful summary because almost every request actually costs about the average.

An agentic feature does not behave that way. Its cost is governed by loop depth, and loop depth is non-deterministic. An agent reasons, calls a tool, observes the result, reasons again — and every step generates token-billed model calls against the full accumulated context. Naive agent loops compound cost at roughly O(N²), because the API charges for the entire conversation history on every single turn. A short interaction touches the model a few times. A long one — underspecified goal, retrieval-heavy query, a tool that returns noisy output the agent keeps re-reading — can touch it hundreds of times before it finishes or gives up.

Plot that and you don't get a spike. You get a distribution with a long, heavy right tail: most requests cheap, a thin band of requests staggeringly expensive. The expensive band is small in count and dominant in spend. In multi-tenant LLM products, practitioners consistently report the same shape — a few percent of tenants consuming a majority of tokens. One customer quietly burning the lion's share of your token budget is not an edge case to plan around. It is the default outcome unless you instrument against it.

When a distribution has a heavy tail, the mean stops being a measure of the typical case and becomes a measure of the tail. It gets dragged upward by the rare expensive requests, so it over-states what an ordinary user costs — and at the same time it hides the tail completely, because "two dollars average" gives finance no hint that some requests cost forty times that. The single number is wrong twice, in opposite directions, at once.

Why averaging breaks capacity planning and pricing

The damage from a misleading average isn't abstract. It lands in three concrete places.

Forecasting. A budget built on mean unit cost assumes next quarter looks like a scaled copy of this quarter. But the tail doesn't scale linearly with users — it scales with whatever shifts the mix of requests. Ship a feature that invites open-ended questions, onboard one enterprise customer with gnarly data, or change a prompt so the agent retries more, and the tail thickens while your average barely flinches until the invoice arrives. This is why surveys keep finding that the large majority of enterprises miss their AI infrastructure forecasts — and miss them badly, not by a rounding error. They forecasted a point; they got a distribution.

Pricing. Flat per-seat pricing on top of a fat-tailed cost is a structural bet that your cheap users will always outnumber and out-fund your expensive ones. The tail decides whether that bet holds. If 3% of accounts drive 60% of token spend, your real question is not "what's the average margin" but "what's the margin on the 97th-percentile user, and does the rest of the book cover them." An average cost can't answer that. It mixes the subsidizers and the subsidized into one figure that describes no actual customer and quietly hides a margin hole.

Capacity and gross margin. Infrastructure has to absorb peak concurrent load and the tail of long-running sessions, not the average. Provision against the mean and you are under-provisioned exactly when several heavy sessions overlap. Gross margin behaves the same way: it is healthy on the median request and underwater on the tail, and only percentile-level visibility tells you whether the blend nets out.

What to report instead: p50, p99, and the tail that sets the bill

Stop reporting a number. Report a shape. Three figures carry most of the signal:

p50 (median). What a genuinely typical request costs. This is the number that should drive per-seat pricing and the "is this feature healthy" gut check — and it is almost always lower than the average, because the tail pulls the mean up and away from the middle.
p99. What your expensive-but-not-pathological requests cost. This is the number capacity planning and worst-case budgeting should use. The gap between p50 and p99 is the single best one-glance measure of how unpredictable the feature is.
Total tail spend. The share of the entire bill coming from the top 1% of requests. If p99 is 30× p50 and the top 1% is 40% of spend, you don't have a cost problem — you have a tail problem, and that tells you exactly where to aim.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Why You Can't Budget an AI Feature With a Single Number

The cost per request is a fat-tailed distribution

Why averaging breaks capacity planning and pricing

What to report instead: p50, p99, and the tail that sets the bill

Recommended Reading

About Tian Pan

The cost per request is a fat-tailed distribution​

Why averaging breaks capacity planning and pricing​

What to report instead: p50, p99, and the tail that sets the bill​

Recommended Reading

About Tian Pan

The cost per request is a fat-tailed distribution

Why averaging breaks capacity planning and pricing

What to report instead: p50, p99, and the tail that sets the bill