Skip to main content

Token Economics for AI-Powered API Products: Pricing What You Cannot Predict

· 10 min read
Tian Pan
Software Engineer

A team ships a customer-facing AI assistant. They price it at $49/month per seat, targeting 70% gross margins based on a spreadsheet that assumed "average 500 tokens per query." Three months later, finance flags that their heaviest users are consuming 15,000 tokens per session. The pricing model collapses not because the feature failed, but because the product team priced something they didn't yet understand.

This isn't a failure of forecasting. It's a structural problem: the cost basis of an LLM-powered product is fundamentally unlike anything traditional SaaS pricing was designed to handle. Every API call has unpredictable and material token cost. The inputs vary wildly by user, task, and time of day. The outputs compound in ways that only show up weeks later on your cloud bill. And once you layer in agentic patterns — tool calls, multi-turn reasoning, subagent orchestration — a single user interaction can cost $0.02 or $20 depending on what the model decides to do.

Why Per-Seat and Per-Query Pricing Both Break

Traditional SaaS built its margin model on two assumptions: marginal cost approaches zero at scale, and each user consumes roughly the same amount of the product. Both fail for AI APIs.

Per-seat pricing fails because token consumption is power-law distributed. A $10,000/month enterprise seat can generate anywhere from $100 to $100,000 in actual token costs depending on usage patterns. The mean is useless as a pricing signal because it's driven by the top 5% of users. Heavy users subsidize light users fine in a gym membership; in an LLM product, one power user can destroy your margin for an entire cohort.

Per-query pricing fails for the inverse reason. Two customers making structurally identical API calls can generate 10x different token costs depending on prompt length, whether the model invokes tools, how many retry loops occur, and whether you're hitting cached or uncached context. Charging identically for queries that cost you 100x different amounts is either a customer-acquisition trap (subsidizing expensive users) or a market-exit trap (overcharging cheap ones).

The deeper issue is that neither model was designed to accommodate variable cost paths — situations where the cost of serving a request isn't knowable until after the request completes.

The O(N²) Problem Inside Agent Loops

If per-seat and per-query pricing fail for simple chatbots, they catastrophically fail for agentic products where an agent orchestrates multiple steps, tools, and model calls per user request.

The reason is context compounding. Transformer models pay for the entire context window on every inference call, not just the new tokens being generated. Each turn in a multi-turn agent loop accumulates all prior context as input:

  • Turn 1: 4,000-token system prompt + 500 tokens of user input = 4,500 tokens billed
  • Turn 2: 4,000 system prompt + 500 input + 800 tokens of prior output = 5,300 tokens billed
  • Turn 10: all prior context + new input ≈ 25,000+ tokens billed

Total cost follows an n(n+1)/2 growth curve, not linear with turn count. A 10-turn agent run costs roughly 55x a single query, not 10x. Teams modeling agentic cost as "turns × average cost per turn" routinely underestimate by 3-5x before the first production invoice arrives.

An agent with a 20-page system prompt running 20 turns pays for approximately 80,000 input tokens in system prompt repetition alone. At current frontier pricing, that's hundreds of dollars in monthly cost just from prompts — before counting the actual work the model does.

This makes the architecture of your agent as important as your pricing model. A product that lets agents run unbounded tool call loops or accumulates unlimited context is a product that will have unbounded and unpredictable cost.

What Pricing Structures Actually Work

The industry has converged on several patterns that reflect cost variability better than flat-rate models.

Hybrid pricing (fixed base + usage overage) is now the majority choice. You charge a fixed monthly subscription covering a defined token allocation, then apply overage charges per million tokens beyond that baseline. The fixed component gives customers a predictable floor for budgeting; the overage component means heavy users pay for what they actually consume. The key failure mode to avoid is setting the baseline allocation so high that overages are theoretically possible but practically never triggered — that just recreates the per-seat pricing problem.

Credit bundle systems abstract per-token complexity by denominating usage in credits that customers prepay. Credits trade at a published rate against different model and feature tiers. This works well for products that route between multiple models (credits naturally handle that routing without exposing confusing per-model rates), but poorly when customers want to forecast cost because credit-to-token ratios differ across vendors and model generations. If you use credits, publish the token conversion table.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates