Your Inference Chargeback Is Quietly Taxing Eval Discipline
The FinOps team rolled out chargeback for AI a year ago. The dashboard is gorgeous. Every feature team can see, to the cent, what their inference bill was last month, and the platform PM has slides showing line-of-business attribution at the SKU level. The org has more AI features than it had a year ago. It also has worse AI quality. Nobody has connected the two facts yet, but they are the same fact.
Here is the failure mode in one sentence: chargeback prices the inference token and silently fails to price the eval token, so every PM on the org chart faces an incentive structure that rewards model upgrades and punishes evaluation discipline. Twelve months later, eval coverage is shrinking while the bill is growing — the precise opposite of what the FinOps initiative thought it was incentivizing. This is not a bug in the dashboard. It is the chargeback model functioning exactly as designed, in a domain where the design assumptions from cloud-cost FinOps no longer hold.
The 2026 State of FinOps survey reports that 98% of practitioners now manage some form of AI spend, up from 63% the year prior, and that AI Cost Management is the #1 capability teams plan to add. The maturity curve for cloud chargeback took a decade to climb. Teams are trying to compress that into eighteen months for a workload class where the unit economics are still moving — and where the thing being measured (token spend) is decoupled from the thing that matters (cost per correct answer) in ways the cloud-compute analogue never had to grapple with.
The Asymmetry That Eats Eval Coverage
Chargeback works by aligning the spending incentive with the spending team. It is one of the cleaner ideas in cloud FinOps because the unit being charged — a CPU-second, a gigabyte-month — is also approximately the unit the engineer optimizes against. Make the team that consumes the CPU-second pay for it, and the team will write more efficient code. The optimization gradient and the accounting gradient point the same way.
In LLM workloads, those two gradients diverge. The unit being charged is the token. The unit the team should optimize against is cost per correctness — token spend per task that lands at acceptable quality. Tokens are cheap to count and visible on every invoice. Correctness is expensive to measure, requires labeled data, and only shows up if the team has built and maintained an eval harness. Chargeback prices the cheap-to-measure side of the ratio and ignores the expensive-to-measure side.
The PM-level incentive then plays out predictably:
- Investing in eval: cost is engineer time and labeling budget, paid up-front, on the team's ledger. Benefit is "we caught a regression before it shipped" — a non-event nobody celebrates, no slide in the wins deck, no invoice line that goes down.
- Upgrading to a bigger model: cost is dollars, paid per-token, attributable, on the team's ledger but spread across the year. Benefit is "the feature got better" — a slide in the wins deck, a customer-facing change, a metric that moves at the next review.
A rational PM, looking at this incentive structure, escalates to a bigger model and defers the eval investment. So does the next PM. Multiply by every feature team in the org, and after six months the eval coverage curve has bent down while the inference curve has bent up. The chargeback model is doing what FinOps has always done — attributing spend to the team that incurs it — and that is precisely the problem. The behavior FinOps wanted (efficiency) and the behavior the chargeback rewards (capacity escalation) are not the same behavior, because the implicit assumption that "engineers will optimize against the metric they pay for" routes through a quality gate the chargeback never priced.
Why "Cost Per Correctness" Is the Number That Should Be on the Dashboard
The FinOps community has started talking about per-feature cost attribution as the next maturity step beyond per-team. This is correct and insufficient. Per-feature cost attribution still measures the numerator of cost-per-correctness without the denominator. A feature whose inference bill grew 40% might be a feature that grew 40% in usage with stable quality, or a feature that stayed flat in usage and silently degraded — those are different stories with the same line item.
The honest unit is cost-per-correct-task, which requires three things the typical chargeback dashboard does not have:
- A live eval set that scores production traffic, not just nightly batches.
- A correctness rate that is denominated against current user intent, not against last quarter's gold set.
- A graph that plots dollars-per-task-completed against time, with the eval-coverage percentage overlaid on the same axis so a PM cannot look at one without the other.
When that graph exists, the conversation in the cost review changes. Instead of "Feature X spent more this month, can we move it to a cheaper tier," the conversation becomes "Feature X's cost-per-correct-task went up while its eval coverage dropped, and we cannot tell whether the spend grew because usage grew or because the model is now wrong more often." That is a different conversation, and it is the conversation chargeback was supposed to enable in the first place.
The reason teams do not have this graph is not that the math is hard. It is that the eval-investment side of the ratio sits on the feature team's ledger as a pure cost and the inference side sits on the same ledger as both a cost and an attribution. Asymmetric ledgers produce asymmetric behavior. The fix is not to discipline the PM. The fix is to make the ledger symmetric.
Four Mechanisms That Re-Symmetrize the Ledger
A chargeback model that wants to incentivize quality alongside cost has to price eval coverage as something other than overhead. The mechanics are not exotic, but they require the FinOps team to give up the assumption that the inference bill is the only number worth attributing.
Eval credits against the inference bill. Price under-evaluated features at a higher per-token rate than well-evaluated ones. The premium is not a tax — it is a representation of the higher operational risk an under-evaluated feature carries, in the same way a service without monitoring carries a higher reliability risk. A team with 80% eval coverage on the production-weighted distribution pays the base rate. A team with 20% coverage pays a multiplier. The multiplier is recoverable: invest in eval, get the credit. This converts the eval investment from a sunk cost into a budgetary line item the PM can defend at the planning meeting because it shows up on the same dashboard the CFO is reading.
Platform-funded eval-tooling subsidies. The reason eval is on the feature team's ledger in the first place is that nobody else's ledger has a slot for it. A platform team that owns the eval framework, the labeling pipeline, the trace-capture infrastructure, and the judge-model invocations — and that absorbs those costs into the platform's own budget — converts a per-team build cost into a shared infrastructure cost. This is exactly the move that worked for observability a decade ago: when every team had to roll its own metrics pipeline, observability was patchy; when the platform team paid for the metrics pipeline, observability became a default. Eval is now where observability was in 2014.
An SLO on eval freshness, on the same dashboard as cost. A cost-per-correctness number is only as honest as the eval set behind it. Eval sets drift. A six-month-old gold set is calibrating against a workflow distribution that no longer exists. An SLO denominated in eval-set age relative to production traffic shift — the percentile of recent traffic that lies inside the eval set's distribution — turns eval freshness into a tracked metric with an alert threshold. When the SLO breaches, the chargeback rate floats up until the eval is refreshed. PMs see the freshness number next to the cost number, in the same place, in the same review.
A quarterly cost-per-correctness review with an eval mandate. The conversation that currently sounds like "your feature is expensive, can you tier down to a cheaper model" should sound like "your cost-per-correct-task has risen 30% — before we discuss model tiering, what is the eval coverage you are tiering against, because we will not approve a model swap until we can grade the swap." The cheaper-tier conversation is correct and necessary; it is also the conversation most likely to ship a regression if the eval is not in place to catch it. Forcing the eval-investment conversation to happen first, with budget attached, gates the model swap behind a quality measurement.
None of these mechanisms is conceptually new. Cloud FinOps has analogues for each — reserved-capacity discounts, platform-owned shared services, observability SLOs, quarterly architecture reviews. The work is not invention. The work is recognizing that AI workloads have a quality dimension cloud workloads did not, and that the chargeback model has to grow a price on that dimension or the org will continue to optimize around it.
Chargeback Models Encode the Org's Actual Values
There is a quieter problem behind the mechanics, which is what the chargeback model says about what the org thinks AI is. A chargeback that prices only inference is, implicitly, a statement that inference is the load-bearing cost of the AI feature. This is plausibly true in dollar terms. It is rarely true in value-at-risk terms. The customer-visible blast radius of a model regression — a confidently-wrong answer that ships, a refusal where there should have been an action, a tool call into a stale state — is not bounded by the inference token bill. It is bounded by the trust contract with the user, which is many orders of magnitude more expensive to repair than the inference bill is to spend.
A chargeback that prices only inference is telling every PM, every quarter, that the org thinks the bill is the risk. The leadership all-hands can say whatever it wants about quality being a first-class concern. The dashboard is the org's actual values, written in numbers, where every PM reads it before every planning meeting. If the eval column on that dashboard says "coverage: not tracked" and the inference column says "spend: $48,213.41," the org has communicated its priorities with more clarity than any all-hands can undo.
There is a useful Goodhart's Law lens here. When a measure becomes a target, it stops being a good measure — and the inference bill, made into the chargeback target, has stopped being a good measure of feature health. It is now an optimization target that the org will optimize against, in ways both honest (the team builds prompt caching) and perverse (the team defers eval). The fix that the literature has converged on for cloud metrics is to measure several characteristics at once, because it is harder to game five numbers than one. The AI version of that fix is to put the eval-coverage number, the eval-freshness number, and the cost-per-correct-task number on the same dashboard as the inference bill, with the same level of prominence, and to deny the platform PM the right to ship a chargeback dashboard that shows only the bill.
The Year-End Review That Vindicates the Chargeback
Twelve months from now, the FinOps team will run a year-end review of the chargeback program. Two outcomes are possible. In one, the review finds that inference spend is well-attributed, that under-evaluated features pay a premium that has driven eval coverage up across the org, that the cost-per-correct-task curve has flattened or fallen across most features, and that the chargeback model has become the load-bearing artifact for AI quality investment as well as cost discipline. The dashboard tells a story the CFO and the head of AI both endorse, because it is the same story.
In the other, the review finds that inference spend is well-attributed, that the org has tiered down where it could and tiered up where it had to, that the bill came in roughly on plan, and that the user-perceived quality of the AI features has degraded in ways nobody can quite locate because no team funded the eval that would have caught it. The dashboard is correct. The features are worse. The chargeback program has hit its number and missed its purpose.
The difference between those two outcomes is whether the chargeback model priced the eval token alongside the inference token, or only the inference token. That is a design choice the FinOps team is making this quarter, in code, in a dashboard schema, in a meeting where someone has to advocate for a metric that does not yet have a clean unit. The team that wins is the team that puts the eval column on the dashboard before the year-end review forces them to. The team that does not put it there is, today, choosing to discover at the next year-end that the org's AI feature portfolio has been quietly regressing, on a budget that came in on plan, in a chargeback model that worked exactly as designed.
- https://cloudchipr.com/blog/finops-for-ai
- https://www.finops.org/wg/finops-for-ai-overview/
- https://www.finops.org/wg/how-to-build-a-generative-ai-cost-and-usage-tracker/
- https://data.finops.org/
- https://www.virtasant.com/blog/state-of-finops-2026
- https://www.vantage.sh/blog/finops-for-ai-token-costs
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://hamel.dev/blog/posts/evals-faq/
- https://en.wikipedia.org/wiki/Goodhart's_law
- https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide
- https://epoch.ai/data-insights/llm-inference-price-trends
