Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law
Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.
The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.
Why average-cost dashboards lie
Most AI cost dashboards are built by people who used to monitor traditional SaaS infrastructure, where per-tenant compute cost was rounded to zero. Those dashboards aggregate by the dimensions that matter when COGS is dominated by fixed cost: feature, model, region, environment. The implicit assumption is that user-level cost variance is small enough to ignore — that what matters is the average user's cost times the number of users.
That assumption holds when marginal cost is near zero. It collapses when marginal cost is $0.30 per agent invocation and the variance across users is three orders of magnitude. The user who runs the summarizer once a week and the user who has wired the summarizer into a cron job that fires every 90 seconds both show up as "1 active user" in your DAU chart. They are not the same customer to your gross margin.
The dashboard tells the truth at the level of aggregates and lies at the level of decisions. "Spend grew 8% month over month, in line with active users" is technically accurate and operationally useless if 60% of the delta came from forty accounts. Aggregating before you've checked the distribution is a way of pretending the variance doesn't exist.
The shape of the curve
When you actually plot token spend per customer on a log-log axis, the line is almost always straight. Power law. The same shape you see in city populations, file sizes, and Wikipedia edits — and for the same structural reason: a multiplicative process where the rich get richer. The customer who automates one workflow against your API tends to automate three more. The team that found the agent loop useful builds a second agent on top of it.
A few rules of thumb that hold up across production workloads:
- Top 1% drives 30–50% of spend. This is the canonical heavy-tail signature. If your top 1% is closer to 10%, your usage is more uniform than the median — which usually means the feature is too constrained for power users to express themselves yet.
- Top 0.1% drives 10–25% of spend. A single account at this level often spends more than the bottom 50% combined.
- The tail extends two more orders of magnitude than you think. Whatever your worst customer looks like in May, by August someone will be doing 10× that and you will have no idea they exist until the bill arrives.
The architectural implication is that you can't reason about cost using means and standard deviations. The distribution has no well-defined standard deviation in the regime you care about — the variance is dominated by tail events. The right summary statistics are the top-N quantiles and the integral above each quantile.
How the eng team finds out at quarter-close
The failure mode looks the same every time. Finance runs the books at month-end. The aggregate AI line is up more than the user base. The CFO emails the VP of Engineering. The VP forwards it to the AI platform lead. The platform lead opens the dashboard, sees the per-feature breakdown is roughly proportional to last month, and replies "model prices moved, we'll re-benchmark." A week later, finance circles back: it's not the unit price, it's the volume, and the volume is concentrated in a way the per-feature breakdown can't see.
By the time someone instruments per-customer attribution and re-runs the report, the quarter has closed and the actual question — "which customer behaviors drove the concentration, and what changed about them this month?" — is unanswerable. The logs have rolled, the agent traces have been sampled away, and the only thing the team can recover is the aggregate. They guess. The guesses become the architecture brief for next quarter.
The deeper version of this story made the rounds in late 2025: four agents in a market-research pipeline ping-ponged for eleven days and cost $47,000 before anyone noticed, because the team's cost monitoring fired alerts but had no mechanism to terminate a session. Alerts are a postmortem; they are not a guardrail. The same team had per-feature cost dashboards. The cost was visible. What was missing was the loop closure between observation and enforcement, at the per-principal level, with a latency short enough that "thousand dollars an hour" gets caught in minutes.
The instrumentation you actually need
There's a small set of primitives that, once in place, make this class of problem tractable. The list isn't long, but every item on it has to be load-bearing from request inception — bolting them on later means re-instrumenting every call site, and the call sites multiply faster than the bolt-on work.
Per-customer attribution from request inception, not month-end aggregation. Every LLM call carries a customer_id (and for multi-tenant accounts, a principal_id distinct from the account so you can find the one user inside Acme Corp who is responsible). The attribution is captured in the same trace as the call, not derived later by joining heterogeneous logs. The right place to attach it is the gateway: a single front door for all LLM calls is also a single checkpoint for tagging.
- https://dev.to/waxell/the-47000-agent-loop-why-token-budget-alerts-arent-budget-enforcement-389i
- https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user
- https://www.thesaascfo.com/your-ai-feature-is-quietly-destroying-your-gross-margin/
- https://www.drivetrain.ai/post/unit-economics-of-ai-saas-companies-cfo-guide-for-managing-token-based-costs-and-margins
- https://sfailabs.com/guides/the-ai-project-gross-margin-reset-every-saas-company-is-about-to-face
- https://www.truefoundry.com/blog/breaking-down-llm-usage-customer-and-user-level-analytics
- https://www.truefoundry.com/blog/rate-limiting-ai-agents-preventing-llm-api-exhaustion
- https://ertyurk.com/posts/circuit-breakers-rate-limits-and-cost-ceilings/
- https://siliconangle.com/2026/04/23/portal26-launches-agentic-token-controls-cap-runaway-ai-agent-spend/
- https://www.cloudzero.com/blog/ai-agent-pricing-models/
