Pricing AI Features: The Unit Economics Framework Engineering Teams Always Skip
Cursor hit 150 million doing it. Every dollar customers paid went straight to LLM API providers, with nothing left for engineering, support, or infrastructure overhead. This wasn't a scaling problem—it was a unit economics problem that was invisible until it was catastrophic.
Most engineering teams building AI features make the same mistake: they treat inference cost as a minor line item, ship a flat-rate subscription, and assume the economics will work out later. They don't. Variable inference costs don't behave like any other COGS in software, and the pricing architectures that work for traditional SaaS will bleed you dry the moment your heaviest users find your most expensive feature.
This is the framework for getting it right before production, not after the margin crisis.
Why Variable Inference Cost Breaks SaaS Assumptions
Traditional SaaS pricing rests on a simple premise: your marginal cost per additional user is close to zero. Hosting is cheap, bandwidth is cheap, database reads are cheap. You price on value, not cost, and as volume grows, your gross margins expand.
AI inference inverts this. Every API call has a real, variable cost that scales directly with user behavior. A chat feature might cost 10,000 in inference costs per month from a single feature, before you add infrastructure overhead, fallback models, observability tooling, and retry logic.
The cost multiplier from pilot to production consistently surprises teams. A feature that costs 3–5 per call in production once you account for: error retries, output validation loops, context padding from conversation history, and the observability stack needed to debug it. Teams that price based on pilot benchmarks discover this the hard way.
The situation gets more extreme with agentic workflows. A simple single-call inference might cost 0.50–20/month assuming single-call costs, one agentic power user can consume your entire monthly revenue in a few hours.
Building the Per-Workflow Cost Model
The antidote is cost modeling at the workflow level, not the API level. Before shipping any AI feature, you need a cost sheet that answers three questions: what does one activation cost, what does the 90th-percentile activation cost, and what happens when a user runs it 500 times a day?
Start with the four cost axes for every workflow:
Model selection is where the biggest lever lives. Modern LLMs span a 100x price range. A classification task that determines customer intent doesn't need the same model as a complex multi-step reasoning task. Routing simple operations to a budget model (Claude Haiku, GPT-4o mini) and reserving premium models for tasks that genuinely require them can cut average inference cost by 60–80% with negligible quality impact.
Token management is the second lever. Input tokens are cheaper than output tokens—typically by a factor of 4–5x. Every token you can eliminate from your prompts without sacrificing quality is a direct cost reduction. Common culprits: bloated system prompts with redundant instructions, unnecessary conversation history padding, and RAG retrievals that pull in far more context than the model actually uses.
Prompt caching is underused and high-return. When your system prompt and injected documents remain constant across many calls, cached tokens cost 10–15% of standard input prices. Teams running large document analysis pipelines have cut LLM costs by 50–90% through caching alone, simply by structuring prompts so the static content appears before the dynamic query.
Batching offers a flat 50% discount from both major API providers for non-real-time workloads. Document processing, data enrichment, background summarization—any task that doesn't need an immediate synchronous response can go through the batch API and immediately halve its cost.
The output of your cost model should be: median cost per workflow activation, 90th percentile cost, and a daily cost cap per user that, if exceeded, signals an anomaly worth investigating.
The Heavy-User Subsidy Problem
Here's the math that breaks flat-rate AI subscriptions:
Assume you offer an AI writing assistant at $20/month. Your user base splits roughly into three groups:
- Light users (80% of customers): 5–10 queries per day, $1–2/month in actual inference costs
- Regular users (18%): 50 queries per day, $15–20/month in inference costs
- Power users (2%): 300–500 queries per day, $100–200/month in inference costs
At a typical AI SaaS distribution, the top 20% of users consume 80% of your compute. The top 1–2% can account for 40–50% of total inference costs while paying the same $20/month as everyone else.
Light users don't cross-subsidize power users in traditional SaaS because the marginal cost is negligible. In AI, they explicitly subsidize them dollar for dollar. At 1,000 customers: 800 light users generate ~3,000–4,000. Revenue: 5,600 plus infrastructure multiplier of 2x = ~$11,200. Gross margin: ~44%. Acceptable, but only if you've modeled it.
Now consider what happens when your product gains traction and the power user ratio shifts from 2% to 5%. Same subscription price, same feature set—but COGS as a percentage of revenue jumps dramatically. Many teams discover this shift only after margins turn negative.
The fix is identifying power users early and designing your pricing to either capture their value or gate their usage. Track cost-per-user weekly. Flag any account exceeding 2x average inference cost for their tier. If your top 10 users are consuming 50x the median, you have a subsidy problem that will only grow.
Consumption Cap Design: Soft, Medium, Hard
Unlimited AI features are a liability, not a differentiator—unless you've explicitly modeled and priced the cost of unlimited.
The standard pattern for sustainable AI features uses a three-tier threshold system:
Soft cap (80% of budget): Send the user a notification. No service degradation. This creates transparency without friction and surfaces heavy users who might convert to a higher tier voluntarily.
Medium cap (95% of budget): Begin throttling. Routes to a cheaper model tier, reduces request rate, or returns slightly slower responses. The user can still work, but the economic exposure is controlled. Done transparently, most users accept this gracefully.
Hard cap (100% of budget): Halt new requests. For consumer products, this typically means a paywall or upsell prompt. For enterprise, it triggers a human review before resumption.
Critically, these caps should be enforced at multiple granularity levels simultaneously:
- Monthly budget per subscription tier
- Daily per-feature limit (prevents a single workflow from consuming a month of quota in one day)
- Hourly per-user rate limit (catches runaway scripts and automation loops)
- Per-call output token cap (prevents single requests from generating unbounded responses)
The per-call cap on output tokens is frequently overlooked. An agent set loose without an output limit can generate 50,000 tokens in a single response when given an open-ended task. At premium model rates, that's a 0.06 call—and a user running 100 such calls per day turns your pricing model into fiction.
Enforcement should happen in a centralized gateway that intercepts all inference calls before they hit the provider API, not scattered across individual feature implementations. If each feature enforces its own limits independently, a user can simply use multiple features simultaneously to blow past any individual cap.
Monetization Architectures That Hold at Scale
Five pricing architectures are emerging as sustainable for AI features. They're not mutually exclusive—the strongest implementations combine elements from multiple approaches.
Hybrid base + usage: A subscription floor that covers light and median use, with overage pricing that kicks in beyond a defined quota. This is the most straightforward model to communicate and implement. The key is setting the quota boundary where median users never hit it (reducing friction) while power users reliably do (creating monetization opportunities).
Credit systems: Abstract the token cost behind a credit unit. A credit is typically priced at $0.01 in COGS with a 50–100% markup. Different features cost different credits per activation based on their actual inference cost. The advantage is that credit prices are more psychologically digestible than token prices, and you can adjust credit costs for individual features without changing the subscription price. The disadvantage is opacity, which can frustrate technical users who want to understand what they're paying for.
Tiered access with model gating: Lower tiers get access to budget models only; higher tiers unlock premium models. This is elegant because the cost difference between tiers maps closely to the actual cost difference between model tiers, making margin management straightforward. It also creates clear upgrade motivation—if a user wants better quality, they pay more, and you're not cross-subsidizing the quality improvement.
Outcome-based pricing: Charge per successful completion (per document processed, per commit generated, per support ticket resolved) rather than per inference call. This decouples your pricing from token cost entirely, which is ideal when you can measure outcomes reliably. The margin risk shifts from "how much does this call cost?" to "can we deliver outcomes efficiently enough?" Replit's agent pricing moves in this direction—simple runs at 3–5.
Time-based soft limits with purchase options: A daily token budget resets on a rolling basis, with one-click overage purchases when users hit the limit. This works well for consumer products where impulse purchases are viable and users have variable daily usage patterns.
The Six Mistakes That Sink AI Feature Economics
Pricing it after shipping it. Once users have adopted a feature at a given price point, changing the economics is brutal. Engineer pricing alongside the feature, not after.
COGS fiction. Inference cost is only part of your true COGS. Monitoring, observability tooling, fallback model infrastructure, retry logic, and the support load from AI-related bugs all contribute. True production COGS is typically 1.5–2.5x raw inference cost.
Single-model dependency. If you've hard-coded a single provider and they raise prices or change their model lineup, your margins change overnight. Multi-model routing from the start lets you shift load to cheaper alternatives without a product change.
Unlimited tiers without circuit breakers. Even "unlimited" tiers need a soft limit that triggers human review. A single user running a query loop can generate $10,000 in inference costs in a day.
Ignoring the agentic cost explosion. Agentic features cost 5–30x more per task than single-call inference. If you price a subscription for chat-style interaction, then ship an autonomous agent that loops, calls tools, and self-corrects, you've effectively cut your margin by 10–30x for users who adopt the agent.
Margin-blind benchmarking. Copying a competitor's pricing without knowing their cost structure is dangerous. They might have preferred API contracts, older pricing locked in, or simply be operating at a loss to capture market share. Your floor price is COGS divided by your target gross margin. Below that number, you're paying customers to use your product.
What to Build Before You Launch
Before any AI feature ships to production, three things need to exist:
A cost model spreadsheet: median activation cost, 90th percentile activation cost, projected monthly cost at 1,000 MAU at three usage intensity levels (light, median, heavy). If the heavy-user scenario looks unsustainable, solve it in the design phase, not in the incident postmortem.
A per-user cost dashboard: visibility into inference spend broken down by user and by feature, updated at least daily. This is the early warning system for both the subsidy problem and for runaway usage patterns.
A tiered enforcement gateway: centralized cap enforcement with soft, medium, and hard thresholds, before any code hits production. Retrofitting this onto an existing feature is painful and often requires breaking changes.
Teams that skip these steps discover the same pattern: the product gains traction, heavy users self-select in, margins compress, and the team faces an impossible choice between degrading a feature users love or losing money on every active user. Neither is a good outcome. The unit economics framework isn't glamorous, but it's the difference between a feature that scales and one that gets quietly sunset once finance notices the COGS trend.
- https://www.cloudzero.com/blog/inference-cost/
- https://www.drivetrain.ai/post/unit-economics-of-ai-saas-companies-cfo-guide-for-managing-token-based-costs-and-margins
- https://intuitionlabs.ai/articles/llm-api-pricing-comparison-2025
- https://www.ikangai.com/the-llm-cost-paradox-how-cheaper-ai-models-are-breaking-budgets/
- https://paid.ai/blog/ai-monetization/usage-based-pricing-for-saas-what-it-is-and-how-ai-agents-are-breaking-it
- https://a16z.com/newsletter/december-2024-enterprise-newsletter-ai-is-driving-a-shift-towards-outcome-based-pricing/
- https://www.revenuecat.com/blog/growth/ai-feature-cost-subscription-app-margins/
- https://www.runonatlas.com/blog-posts/hybrid-pricing-models-why-ai-companies-are-combining-usage-credits-and-subscriptions
- https://metronome.com/blog/2026-trends-from-cataloging-50-ai-pricing-models
- https://schematichq.com/blog/why-ai-companies-are-turning-to-credit-based-pricing
- https://techcrunch.com/2025/08/07/the-high-costs-and-thin-margins-threatening-ai-coding-startups/
- https://www.techaheadcorp.com/blog/inference-cost-explosion/
- https://www.helicone.ai/blog/monitor-and-optimize-llm-costs
- https://langfuse.com/docs/observability/features/token-and-cost-tracking
