Skip to main content

Pricing Your AI Product: Escaping the Compute Cost Trap

· 10 min read
Tian Pan
Software Engineer

There is a company charging £50 per month per user. Their AI feature consumes £30 in API fees. That leaves £20 to cover hosting, support, and profit — before accounting for a single refund or churned seat. They built a product users love, grew to thousands of subscribers, and unknowingly constructed a business where more customers means more losses.

This is not a cautionary tale about a bad idea. It is a cautionary tale about a pricing architecture imported from a world where the marginal cost of serving the next user was effectively zero. That world no longer fully applies when your product calls a language model.

Traditional SaaS gross margins run 70–90%. AI-forward companies are reporting 50–60% — and the gap is mostly explained by one line item: inference. When tokens are 20–40% of your cost of goods sold, the standard SaaS playbook inverts.

Why Flat-Rate Pricing Breaks Under Token Pressure

The economics of traditional software are beautiful in their simplicity. Once you pay for the servers, each additional user costs nearly nothing. Pricing in that world is a question of willingness to pay and competitive positioning — cost is a rounding error.

LLM-powered features do not have this property. Every query triggers a real API call. A user who asks 400 questions a month costs twice as much to serve as a user who asks 200 — and that ratio does not compress as you scale. It compounds.

Consider a product with 20monthlyARPU.Alightuserconsuming200Ktokenspermonthatmidrangemodelpricing(20 monthly ARPU. A light user consuming 200K tokens per month at mid-range model pricing (1 per million tokens) costs 0.20intokens.Thatismanageablearoundingerrorata920.20 in tokens. That is manageable — a rounding error at a 92% gross margin. But power users at 2 million tokens per month with a premium model (5 per million tokens) cost $10.00 — half of the monthly revenue, before a single server is provisioned or support ticket answered.

The dangerous part is that these users often look like your best customers. They engage the most, use every feature, and generate the testimonials that drive growth. They are also, quietly, the users with the worst unit economics.

OpenAI discovered this directly with ChatGPT Pro. Even at $200 per month — the highest consumer AI subscription price on the market — the plan was losing money on a segment of users hitting 20,000+ queries monthly. The price that looked premium was still insufficient when usage was unconstrained.

The Four Pricing Architectures, and When Each Fails

Teams navigating AI pricing tend to converge on one of four patterns, each with a specific failure mode.

Bundled flat fee — AI features are included in existing tiers at no explicit charge. Simplest to ship, fastest to adopt. The failure mode is invisible: if usage spikes, margin compresses silently. You won't see it in acquisition metrics. You'll see it in the quarterly finance review when gross margin is ten points lower than forecast.

Tiered usage caps — each tier includes an allowance (1M tokens per month, 50 chat messages, 2,000 code completions) and power users hit a ceiling or upgrade. GitHub Copilot Free offers exactly this: 2,000 completions and 50 chat messages, then stops. This is the most widely deployed approach because it segments users economically — the 5% who drive 80% of token spend are also likely the 5% most willing to pay for more. The failure mode is churn at the ceiling. If your cap is set too aggressively, you frustrate the power users who advocate for you most loudly.

Metered overage — a base subscription covers a token budget, and users pay a per-unit rate beyond that threshold. Example: 1,000/monthplatformfeeincludes1Mtokens,then1,000/month platform fee includes 1M tokens, then 2 per 100K thereafter. The failure mode is surprise billing. Users who don't monitor their usage receive unexpectedly large invoices and churn or dispute them. This architecture requires robust usage dashboards and proactive alerts to work.

Outcome-based pricing — charge for resolved tickets, completed documents, closed deals, or other downstream outcomes rather than for token consumption. The failure mode is attribution complexity: you need a clear, defensible definition of what counts as an outcome, and customers will find edge cases.

None of these architectures is universally correct. Fifty-six percent of AI SaaS companies now use hybrid models that blend subscription predictability with some form of usage signal.

Margin-Defense Patterns That Actually Work

The goal is not to punish power users — they are your best advocates. The goal is to avoid cross-subsidizing them with revenue from users who barely touch the AI feature.

Model routing is the most underrated lever. Not every query needs a frontier model. A company routing 80% of queries to DeepSeek (0.55permilliontokens)andreservingClaudeOpusforgenuinelycomplextasks(0.55 per million tokens) and reserving Claude Opus for genuinely complex tasks (15 per million tokens) operates at a very different cost floor than a company calling the most capable model for every autocomplete. The cost differential is real: a 500-word GPT-4 response costs roughly 0.084;anequivalentLlama2responsecostsapproximately0.084; an equivalent Llama 2 response costs approximately 0.0007 — a 120x difference. Teams that build model selection into their architecture rather than hardcoding a single model have a structural margin advantage.

Dual-metric tracking separates what customers see from what you manage internally. Externally, users see "credits," "messages," or "requests." Internally, you track token consumption, compute hours, and per-user COGS. This is not deception — it is abstraction. Customers do not want to reason about tokens any more than they want to reason about database query costs. But you need the granular data to catch the users whose usage economics are broken before the problem shows up in your P&L.

Soft limits before hard limits reduce churn at the ceiling. A user who receives a notification at 80% of their monthly allowance has time to upgrade before they hit a wall. A user who suddenly sees a degraded product at 100% churns or escalates to support. Anthropic's approach of targeting rate limits at the top 5% of intensive users — rather than blunt caps — is a version of this: protect the median experience without punishing it.

Fair-use throttles as economic signals work when customers don't see the token math directly. If a user is consuming tokens at a rate that makes their account unprofitable, the appropriate response is not to immediately shut them off. It is to throttle gracefully, observe whether they notice or care, and use that signal to decide whether they belong in a higher tier.

When Value-Based Pricing Becomes the Only Viable Model

The long-run trajectory of LLM inference costs is deflationary — aggressively so. Median inference prices dropped roughly 50x per year between 2023 and 2025. Some benchmarks showed 900x annual declines for specific task categories post-January 2024. Gartner's forecast has inference costs falling more than 90% by 2030 relative to 2025 levels.

This creates a specific danger for usage-based pricing. If you charge 2per100Ktokenstoday,andthosetokenscost2 per 100K tokens today, and those tokens cost 0.005 to generate in 2028, your pricing has compressed by 95% while your cost of delivering the service dropped by 99%. You retained some margin, but you also handed your customers a 95% price reduction they didn't ask for and may not have even noticed — and you built a business whose revenue is mechanically tied to a commodity whose price falls every quarter.

Outcome-based pricing breaks this linkage. If you charge 0.99perresolvedsupportticket,itdoesnotmatterwhethertheunderlyingmodelcostdroppedfrom0.99 per resolved support ticket, it does not matter whether the underlying model cost dropped from 0.30 to $0.003 per resolution. Your revenue is tied to the value delivered, not the infrastructure consumed. As costs fall, your margins expand rather than your prices collapsing.

Intercom Fin validated this at scale. Charging 0.99perresolvedcustomerissueratherthanpertokenorperseat,itgrewfrom0.99 per resolved customer issue rather than per token or per seat, it grew from 1M to over $100M in ARR in roughly two years while handling 80% of support volume across its customer base. The model became more sustainable as inference costs declined, not less — because the outcome price stayed anchored to the value a resolved ticket represents to the customer, not to the cost of the API call that resolved it.

The threshold for switching to outcome-based pricing is not technical — it's economic. You need a clear, verifiable outcome that customers agree represents value delivered. You need attribution that isn't easily gamed. And you need the confidence that the outcome's value to the customer exceeds your fully-loaded cost of delivering it by a comfortable margin.

When a resolved support ticket is worth 50toacustomerandcosts50 to a customer and costs 0.50 to deliver, you can charge $0.99 and have a 98% gross margin regardless of which model you use. That math survives almost any cost trajectory.

The Practical Decision Framework

Before shipping an AI feature, the questions that determine pricing architecture:

  • Who are the power users, and what does serving them actually cost? Run the math at P95 usage, not median. If P95 economics are profitable, you have a bundled tier. If they are not, you need caps or outcome pricing.
  • Is the value of the outcome measurable and customer-verifiable? If yes, outcome pricing is available to you. If the AI feature improves productivity by an amount nobody can measure, you're left with consumption-based proxies.
  • How fast is your inference cost declining? If you are using a model class where costs are falling 50x per year, anchoring revenue to consumption is anchoring revenue to a falling number. Outcome pricing insulates you.
  • Does your customer segment have predictable usage patterns? If usage variance is low (e.g., a fixed workflow), bundled tiers work. High variance across customers almost always requires usage caps or metering.

The cost-plus reflex — take what tokens cost, multiply by a margin target, arrive at a price — produces the right answer in exactly one scenario: when customers have a direct alternative that commoditizes your offering and forces you to compete on cost. In most AI product contexts, that scenario hasn't arrived yet. Until it does, pricing at your cost floor is leaving margin on the table and building fragility into your unit economics at the same time.

Forward: Agentic AI Changes the Math Again

Agentic workflows consume substantially more tokens per task than single-shot queries. A reasoning loop that validates its own output, calls external tools, and retries on uncertainty may consume 10–50x the tokens of a simple generation. As teams ship more autonomous AI features, the variance in per-user token consumption increases — and the mean shifts upward.

This makes outcome-based pricing more attractive, not less. Charging per agent run, per completed task, or per resolved outcome insulates your pricing from the compound unpredictability of agentic token consumption. It also forces a useful discipline: if you're charging per outcome, you're thinking carefully about what an outcome is, which is exactly the thinking that improves product quality in agentic systems.

The teams that will have sustainable unit economics in 2027 are building the instrumentation and pricing architecture for outcome-based models now, before agentic workflows push per-user token consumption into ranges that break every flat-fee assumption still in production.

References:Let's stay in touch and Follow me for more thoughts and updates