Skip to main content

AI Feature Billing Is an Engineering Problem Nobody Planned For

· 9 min read
Tian Pan
Software Engineer

Microsoft's Copilot launched with a clean story: 30/user/month,productivitymultiplied.Theactualmathwasuglier.Onceyoufactoredinthebaseenterpriselicense,computecostsperactiveuser,andsupportoverhead,Microsoftwaslosingover30/user/month, productivity multiplied. The actual math was uglier. Once you factored in the base enterprise license, compute costs per active user, and support overhead, Microsoft was losing over 20 per user per month on the feature. Finance didn't catch this immediately because the costs lived in the infrastructure budget, not the product P&L. Engineering knew the token bills were large. Nobody had connected the two lines.

This is the billing problem that most AI teams build into their products without realizing it. Not the pricing strategy problem — that's a product decision. The engineering problem: you have no infrastructure to measure what AI features actually cost per customer, per feature, and per request at the granularity required to make any pricing model work.

The Reason Traditional SaaS Pricing Fails for AI

Per-seat pricing assumes costs scale with users. AI breaks that assumption immediately.

With a conventional SaaS product, a power user and a light user cost roughly the same to serve. Storage and compute are cheap relative to revenue, so the variance doesn't matter. With AI features, a power user can trigger agent workflows that consume 50x more compute than a light user doing a single query. You've priced them identically.

Per-query pricing solves the linearity problem but creates an adoption problem. Salesforce learned this the hard way with Agentforce. At 2perconversation,CFOscouldcalculatethatfivesupportagentshandlingatypicaldaysvolumewouldgenerate2 per conversation, CFOs could calculate that five support agents handling a typical day's volume would generate 20,000 in monthly fees. Adoption stalled — not because the product was bad, but because the unpredictability itself was the blocker. When customers can't forecast what they'll spend, they don't spend at all. Salesforce iterated through three pricing models in eighteen months before finding a configuration that got traction.

The deeper issue is that agents are non-linear consumers. A single user request can chain through planning, tool selection, execution, verification, and response generation — each step consuming tokens, each step variable. The "happy path" might cost 0.02.Anambiguousrequestthattriggerstworetryloopsandthreetoolcallsmightcost0.02. An ambiguous request that triggers two retry loops and three tool calls might cost 0.40. You can't know in advance which path a given request will take. This makes any static pricing model a gamble on the distribution of your users' actual behavior.

What You Actually Need to Build

The cost attribution stack for AI features has three distinct layers, and most teams are missing at least two of them.

The event ingestion layer sits in front of every LLM call in your system. Every request — chat completion, embedding, agent step, tool call — flows through a single metering point that records the raw usage: input tokens, output tokens, cached tokens, model, latency, and a set of custom tags (user ID, session ID, feature name, team). This layer needs to handle the throughput of your production system and write to a persistent store without adding meaningful latency to the request path. A proxy like LiteLLM can handle this for multi-provider environments, normalizing token counts across OpenAI, Anthropic, and Bedrock into a consistent format.

The pricing engine takes the raw events and converts them to costs using current provider rates. This sounds simple until you're managing multiple models across multiple providers, each with different input/output pricing, cache discounts, and tier structures that change without much notice. The engine needs to apply your margin on top of provider costs and support the pricing structures you expose to customers — usage tiers, credits, hard caps, overage rates.

The billing and reporting layer turns costs into invoices and gives customers visibility into their consumption. The engineering challenge here is building something that's fast enough for real-time dashboards but accurate enough for invoice generation. These have different consistency requirements. A dashboard can tolerate a few minutes of lag. An invoice cannot.

Building this stack from scratch typically takes three to six months of focused engineering. Companies attempting it on legacy platforms tend to allocate 25-40% of ongoing engineering resources to billing-related work — data pipeline maintenance, pricing configuration, reconciliation jobs — long after the initial build is complete.

The Margin Math Engineering Teams Ignore

Most engineering teams don't think about gross margin. That's the finance team's problem. But AI changes the relationship between technical decisions and unit economics in ways that make it an engineering problem whether you want it to be or not.

The core issue: traditional SaaS companies run at 80-90% gross margins because serving an additional user costs almost nothing. AI companies run at 50-60% margins because every feature invocation has a real, non-trivial cost. This gap compounds as features scale.

Here's a simplified version of the math worth running on any AI feature before you ship it:

Take your feature's average cost per invocation (you need the metering layer to know this). Multiply by expected invocations per user per month. Divide by your pricing. If that number is above 0.4, you're likely margin-negative on that feature at scale.

The Microsoft example: at 30/user/monthforCopilot,auserwhorunsfivesubstantialcodingqueriesperdaygeneratesroughly30/user/month for Copilot, a user who runs five substantial coding queries per day generates roughly 20-30 in compute costs alone. The feature was priced assuming average usage would be much lower. It wasn't. The customers who found it valuable were exactly the ones who used it heavily — and heavy usage was expensive.

The pattern repeats. Build an AI feature, observe that engaged users love it, watch engaged users generate engaged costs, discover the feature is underwater. Finance catches it six months after engineering could have.

The instrumentation that prevents this isn't complex. You need:

  • Cost per invocation tracked from day one, not after you've noticed a billing spike
  • Segmentation by feature and by user cohort, not just aggregate
  • A breakeven threshold calculated before launch, not after
  • Per-user cost trending as a monitored metric alongside latency and error rate

Agentic Systems Break Even Simple Cost Models

Everything above assumes single-step LLM requests with predictable costs. Agents break that assumption completely.

When an agent fails and needs to restart, every LLM call it already made gets repeated. The cost of a failed agent run isn't zero — it's the full cost of all steps executed before failure. In multi-agent systems, a single user request can trigger sub-agent spawning, parallel tool execution, and coordination rounds. Production multi-agent deployments routinely consume 3-5x the token budget of equivalent single-agent implementations.

The non-determinism extends to cost itself. The same user prompt, sent twice, might execute via different reasoning paths and cost materially different amounts. Average token counts from test environments are dangerous inputs to financial models. The "unhappy path" — ambiguous inputs, retries, reasoning loops — can cost five times the happy path, and it's the unhappy path that tends to cluster around your most complex (and often most valuable) users.

Managing costs in this environment requires enforcement in the execution path, not just reporting after the fact. A budget that lives in a dashboard does nothing to a runaway agent workflow. Useful controls operate at multiple levels simultaneously:

  • Cost budgets: hard caps in real-time currency terms per workflow execution
  • Token budgets: per-request limits that prevent runaway prompts
  • Step budgets: maximum number of tool calls or reasoning iterations before forced termination
  • Time budgets: wall-clock limits that prevent stuck workflows from accumulating costs indefinitely

The enforcement point matters as much as the limit. A check that runs before invoking a model — before spawning an agent step, before triggering a tool — can prevent cost overruns. A check that runs after the fact can only report them.

What Good Cost Attribution Looks Like

Once you have the metering infrastructure, the useful question isn't "how much did AI cost this month?" — it's "which features, for which customers, at what margin, and trending in which direction?"

That question requires tagging every event at ingestion time with enough context to slice the data later. Feature name, user segment, request type, session ID. The tags cost nothing at query time and are extremely expensive to reconstruct retroactively.

The three reports that matter for running AI features sustainably:

Per-feature margin trending: Cost per invocation over time, segmented by feature. This catches features drifting underwater as usage patterns evolve or model prices change. A feature that was profitable at launch can become unprofitable six months later as engaged users' usage patterns shift.

Per-customer cost profile: Which customers cost you the most to serve, relative to what they pay? Power users in the AI context are often customers who derive genuine value — they're also often the most expensive. Understanding this segmentation informs both pricing tier design and support prioritization.

Heavy-tail cost analysis: What does the 95th percentile request cost? For agents, the distribution of costs per request can have a very heavy right tail. A small percentage of requests might consume 20x the median. Understanding the tail tells you whether your pricing can absorb outliers or whether you need hard caps.

The Infrastructure Ecosystem Is Catching Up

Stripe's acquisition of Metronome (the usage-based billing platform running infrastructure for OpenAI and Anthropic) signals where the market is heading. Dedicated AI billing infrastructure is becoming a standard product category, not something every company needs to build from scratch. Stripe's token billing product handles real-time event ingestion, provider price syncing, and invoice generation for LLM usage natively.

The open-source layer has similar coverage. LiteLLM handles proxy-level metering across providers. OpenTelemetry extensions for LLM tracing standardize event schemas. The raw infrastructure for metering exists; what most teams lack is the organizational decision to treat cost attribution as a first-class engineering concern from day one rather than a retroactive project after the first unexpected billing spike.

Token prices are falling — roughly 10x annually over the last three years. That trend helps margins but doesn't eliminate the problem. Lower costs per token still compound when agent workflows multiply the token count per user action. And as models get cheaper, the market expectation is that AI features should also get cheaper, passing the deflation through to customers rather than banking it as margin.

The teams building durable AI businesses are the ones who built cost visibility first, pricing models second. Not because they had better business instincts — because they had better instrumentation and could actually see what was happening.

Building a production AI feature without cost attribution isn't an engineering shortcut. It's committing to not knowing whether the feature is working until the invoice arrives.

References:Let's stay in touch and Follow me for more thoughts and updates