Cost Per Feature, Not Cost Per Token: The Allocation Gap in AI Budgets
Your finance team can tell you, to the dollar, what you spent on Anthropic and OpenAI last month. Your product team can tell you which features users touched the most. Nobody in the building can tell you whether Draft-Email is profitable, whether Summarize-Thread should stay in the free tier, or whether the new Rewrite-Tone feature is eating Draft-Email's lunch on a per-user basis. You have two dashboards that claim to track the same dollars and neither answers the question that actually drives product decisions.
This is the allocation gap. You measure token spend per endpoint because that is what the provider API gives you. But /chat serves twelve features that happen to share a prompt template, and "per endpoint" collapses all twelve into one line item. Pricing tiers, feature gating, deprecation calls, and the "do we ship this?" conversation all float on gut feel until someone does the plumbing to route token costs back to the features that incurred them.
The plumbing is not glamorous. It is request-level tagging, trace-to-telemetry joins, and a disciplined refusal to ship an AI feature without its own cost label. Teams that treat this as infrastructure investment end up with per-feature margin reports segmented by user cohort. Teams that defer it to next quarter end up making pricing decisions from vibes for eighteen months and discovering, after the fact, that a single customer segment was responsible for half the inference bill at negative margins.
Why Endpoint-Level Accounting Lies
The provider's dashboard is shaped by the API, not by your product. Anthropic and OpenAI bill by token, meter by API key, and optionally break things down by model. That is the accounting universe you get for free. The moment you expose more than one feature behind the same service, that universe stops mapping to anything a PM cares about.
A realistic mid-stage SaaS backend has three or four AI-powered features sharing the same inference gateway: a summarization call, a classification call, a chat handler, a Q&A retriever. They call the same model. They use the same API key. They share a system prompt that gets edited in one place. The billing export shows one line: claude-sonnet-4.5: $47,204. The product analytics show four usage graphs. You cannot divide these two things by each other and get a useful number.
The naive workaround is to infer costs from endpoint paths, but modern AI features are path-pluralistic. A draft-email feature may call an outline step, a tone classifier, a retrieval step, and a drafting step — four inference calls, two models, one user action. Attributing cost by HTTP endpoint puts all of this into whatever route triggered the drafting flow, hides the retrieval cost under a different bucket, and silently over-counts when the tone classifier runs for autocomplete too. By the time you realize the endpoint map has drifted from the feature map, your cost dashboards have been quietly wrong for months.
The Tagging Discipline: Labels at Call Time, Not Log Time
The only cost attribution that survives contact with agent traces is tagging at request creation. Retroactive inference — parsing logs, regex-matching prompt prefixes, grepping the request body — works for the golden path and quietly loses data on the long tail. Retry loops, tool-augmented agents, and re-plan branches generate tokens that have no obvious claim to a feature unless you labeled them going in.
Three tags belong on every inference call the first time it leaves your service:
- Feature tag. A stable, product-meaningful identifier:
draft-email,summarize-thread,rewrite-tone. Not the endpoint path, not the route name — the feature as marketing and product know it. When a feature is deprecated or renamed, the tag stays historical so the cost time series does not break. - User cohort. Tenant ID, plan tier, and user ID together. Plan tier matters because it is the dimension pricing decisions need. Tenant ID matters because enterprise customers are the long tail. User ID matters because within a tenant, one power user can consume more than the other nine hundred users combined.
- Request-tree root ID. The identifier for the top-level user action that initiated the inference. In an agent loop that makes fourteen tool-augmented calls to draft one email, all fourteen calls carry the same root ID. This is the tag that lets you answer "how many tokens does one draft-email action cost on average" without doing trace-graph surgery later.
Two more tags are worth carrying even though they are not load-bearing for cost: environment, to prevent staging traffic from contaminating production cost series, and model-plus-provider, so "we are thirty percent cheaper on Haiku" becomes visible without a separate query. Five tags total, set at the request middleware layer. This is boring code. It is also the piece teams defer because it is boring and then regret for a year.
The Join: Inference Traces Meet Product Telemetry
Tags on a trace are necessary but not sufficient. The question "cost per active user per month for Draft-Email" is not answered inside the inference pipeline — it is answered by joining two data sources the company already has and rarely co-queries.
On one side: an inference trace store where every span carries the five tags above plus input-token, output-token, cache-hit, and latency fields. Most LLM-observability vendors will hand this to you with modest engineering effort, though several teams end up writing their own because the off-the-shelf options do not expose raw trace data for SQL-style joins.
On the other side: the product analytics warehouse that knows who is a monthly active user of Draft-Email, when they started, what plan they pay for, and how often they retain. This side already exists; data engineering has been maintaining it for years.
The allocation pipeline is the join that pairs these: tokens-by-feature-by-tenant on one side, active-users-by-feature-by-tenant on the other. Divide, and the number you have been missing pops out: cost per active user per feature per month, segmented by plan tier, with variance bars. The specific number is not what matters. What matters is that the moment this number exists, three conversations that had been stuck in subjective debate suddenly have objective inputs.
Pricing tier design gets an answer to "which features can we include in the $19 plan without losing money on the median user." Feature gating gets an answer to "which features are profitable at the free tier." Deprecation gets an answer to "Summarize-Thread costs us $4.80 per MAU and ships to three percent of users — is it earning its floor space." These are arguments that, absent data, calcify into tribal positions correlated with which exec owned the original launch. With data they become spreadsheets.
What Breaks in Practice
Three failure modes show up reliably even after the tagging discipline is in place.
Long-tail margin erosion hiding inside a healthy portfolio number. Aggregate metrics like "our AI features run at 71 percent gross margin" are lies by construction in any long-tailed customer distribution. A closer look typically reveals that a handful of tenants consume half the tokens at 30-40 percent margins, while the median customer sits above 85 percent. The portfolio number averages them into fiction. The tag-joined view shows the shape. From there, per-tenant throttles, tiered rate limits, or targeted contract renegotiations become possible.
The gateway opacity tax. If your architecture puts a gateway between your app and the LLM provider, the gateway authenticates to the provider with a single service credential. The provider sees one API key. Per-user attribution dies at the gateway boundary unless the gateway itself enriches each upstream request with your tags and logs them to your own trace store. Relying on the provider's per-key breakdown looks like free instrumentation until you grow beyond a few tenants, and then it becomes a multi-month refactor.
Cache accounting inflating the bill. Anthropic and OpenAI bill cached input tokens at a fraction of list price — typically 10 percent for cache reads, up to 125 percent for cache writes. A cost report that multiplies all tokens by list price over-reports spend by 30-50 percent on cache-heavy workloads. On a workload that should be 70 percent cache hits, the over-report becomes large enough to justify turning caching off as an architecture decision. Attribution that does not split cache-hit and cache-miss token counts actively misleads.
The Finance-Versus-Product Conversation That Gets Unblocked
The reason this plumbing is worth building is not the dashboards. It is the conversation that the absence of the dashboards prevents.
In every organization past a certain size, finance and product develop different theories of where AI money goes. Finance sees the monthly invoice and asks why it keeps growing. Product sees the engagement graph and asks why finance is blocking feature expansion. Both are right within their frame. Both are working from inadequate data. The conversation stays stuck in "we should be more careful" versus "we need to keep innovating" until someone produces numbers that let you decide.
The feature-level margin report is that artifact. It converts "AI spend is up 40 percent QoQ" into "Draft-Email spend is up 60 percent because MAU doubled and per-MAU cost dropped 15 percent — this is the good kind of growth." It converts "let's cut the budget" into "Summarize-Thread is the one that is up 200 percent on flat MAU — that is where to look." It takes the finance-product debate from values-based to measurement-based, and that is the conversation that actually produces different decisions.
What to Build First
If you are starting from zero, resist the urge to buy the enterprise observability platform and instead do the smallest thing that preserves option value. A request-middleware layer that stamps the five tags above onto every outbound inference call, logs them to whatever trace store you already have — a structured-JSON log stream into your warehouse counts — and joins weekly against product telemetry will give you 80 percent of the value of the full platform.
The tags are the commit. Once every request is labeled at call time, you can change storage vendors, replace your analytics warehouse, rewrite the join, and the attribution survives. Without the tags, no vendor can save you. All they can sell you is a better way to look at data you never captured. The discipline worth investing in is the one-line decorator that refuses to let an AI call leave your service without telling the world, honestly, which feature asked for it.
- https://www.digitalapplied.com/blog/llm-agent-cost-attribution-guide-production-2026
- https://www.vantage.sh/blog/finops-for-ai-token-costs
- https://portkey.ai/blog/llm-cost-attribution-for-genai-apps/
- https://aws.amazon.com/blogs/machine-learning/introducing-granular-cost-attribution-for-amazon-bedrock/
- https://medium.com/@DTechBroIndoor/the-margin-killer-calculating-the-real-cost-of-ai-features-before-you-build-them-01094484c7aa
- https://www.revenuecat.com/blog/growth/ai-feature-cost-subscription-app-margins/
- https://www.cloudzero.com/blog/ai-costs/
- https://metronome.com/blog/2026-trends-from-cataloging-50-ai-pricing-models
- https://langfuse.com/docs/observability/features/trace-ids-and-distributed-tracing
- https://langwatch.ai/blog/trace-ids-llm-observability-and-distributed-tracing
- https://www.finout.io/blog/finops-in-the-age-of-ai-a-cpos-guide-to-llm-workflows-rag-ai-agents-and-agentic-systems
- https://www.aretove.com/the-2026-finops-frontier-governing-llm-costs-cloud-sprawl-and-data-gravity
- https://finops.aivyuh.com/compare/ai-cost-tracking-tools/
- https://www.finops.org/insights/2025-finops-framework/
