Skip to main content

The Chargeback Ledger for Compound AI Systems

· 10 min read
Tian Pan
Software Engineer

The first time the CFO asks "what does the assistant cost us per month," the engineering team produces a number. The second time, a different team produces a different number. The third time, finance produces a third number, and somebody opens a spreadsheet that re-derives the bill from spans because nobody trusts any of the previous answers. This is the moment a compound AI system stops being an architecture problem and becomes an accounting problem.

The shape of the failure is structural. A single user request to "summarize my last quarter's customer feedback" triggers an agent owned by team A, which calls a retrieval tool maintained by team B, which calls a model hosted by provider X, which streams results back through a re-ranking tool from team C, which calls a different model from provider Y. One click; five owners; two invoices that arrive a month apart. Standard FinOps primitives — cost centers, allocation tags, account-level rollups — were designed to slice infrastructure that already had stable owners. They do not compose cleanly across an internal call graph that crosses team boundaries on every request.

The 2026 State of FinOps report puts 98% of FinOps teams on the hook for AI spend, and the same survey lists real-time visibility into AI costs as the top tooling gap. That gap is not "we cannot see the bill." The gap is "we cannot see who caused what slice of the bill, fast enough that anyone changes their behavior before the bill arrives."

Per-span attribution is the only ledger that survives audit

The instinct to attribute cost at the API key is wrong. API keys map to applications, but a compound AI request is not an application — it is a graph of calls. Attribute at the key level and you get one bucket per agent and zero visibility into which tools or sub-agents drove the cost inside that bucket.

The architecture that survives is per-span attribution. Every LLM call, every tool invocation, every retrieval step emits a span, and each span carries enough metadata that the cost can be reconstructed from the trace alone. OpenTelemetry's GenAI semantic conventions formalize the minimum: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model, plus a gen_ai.usage.cost_usd metric that pricing rules populate at ingest. The cost lives at the leaf span. Parents inherit nothing automatically; the rollup is a query.

What the OTel spec doesn't dictate is the attribution layer on top. That is yours to design, and it has to answer three questions for every span:

  • Who made this call? The principal — the human user or system actor whose intent triggered the request.
  • Who is being charged for it? The cost center — usually the team that owns the agent, but not always. Sometimes the tool team takes the bill because they own the egress contract with the vendor.
  • Who authorized the chain that led here? The on-behalf-of principal — propagated from the parent span so that a tool team's costs can be sliced by which calling agent invoked them.

These three identifiers are different. A user from product P can use an agent from team A that calls a tool from team B that pays a vendor V. The principal is the user, the immediate cost center is team A's agent, the on-behalf-of chain runs A→B, and the vendor is V. Without all three, you cannot answer "how much did product P spend on team B's tools this month?" — which is the question that actually starts budget fights.

The political surface is sharper than the technical one

The technical work is finite. Add headers, propagate them through tool calls, store cost-tagged spans, build a query that rolls up by any dimension. A competent platform team can ship that in a quarter.

The political surface is where chargeback dies. Three patterns recur:

The "you called me" defense. Team B builds a search tool. Team A's agent invokes it 200,000 times a month. Team B insists team A pays — they invoked it. Team A insists team B pays — they own the tool, they chose the vendor, they negotiated the contract, they could have used a cheaper model. Both are correct. The unstated assumption is that there is a single "owner" of a cost, and there usually is not. The functional resolution is to bill the variable cost (per-call vendor charges) to the caller and the fixed cost (the tool's amortized infra and engineering load) to the tool team. Almost nobody does this on day one because it requires the chargeback ledger to distinguish two cost types per span, and most ledgers start with a single number.

The "we never agreed to this volume" complaint. Team B sized their tool for a known set of agents. A new agent from team D ships, and traffic doubles overnight. Team B's vendor bill triples because the new agent fans out three searches per call. Whose problem is that? The honest answer is that an internal tool's cost contract was implicit, and "implicit contract" is FinOps for "deferred fight." The pattern that survives is treating internal AI tools the way you treat external SaaS vendors: documented per-call price, projected volume on intake, alerts when actuals exceed projection by more than a threshold.

The "settlement currency" mismatch. Team A pays one vendor in burst-priced API credits with monthly committed-use discounts. Team B pays a different vendor with provisioned throughput units billed hourly. Team C runs models on internal GPUs whose cost is "amortized infrastructure" denominated in some allocated-capacity unit. Charging team A back at "tokens" works; charging team B back at "tokens" hides the throughput commit; charging team C back at "tokens" requires inventing a synthetic price-per-token from a fixed-cost pool. There is no clean answer. The least-bad approach is to settle internally in a single unit (USD-equivalent at end-of-month), publish the conversion rules, and let teams plan in their native vendor's pricing while the ledger normalizes at month close.

"Cost per token" is easy. "Cost per outcome" is the metric leadership actually wants.

Practitioners tend to converge on cost per token because tokens are countable, vendor-stamped, and survive the trace. Cost per token is the metric that gets dashboards. It is also the metric that gets ignored at the executive level after the second board meeting.

The reason is that token cost falls — model providers cut prices roughly in half each year — while tokens consumed per outcome rise faster, because compound systems use longer contexts, more retrieval rounds, and more tool calls per resolution. The unit price of intelligence is going down; the all-in cost of a successful resolution is going up. A dashboard that only tracks cost per token reports "we are getting cheaper" while the absolute bill grows quarter over quarter. This is the token cost illusion, and finance will catch it before engineering does.

The metric that matters is cost per validated business outcome — the all-in cost (tokens, tools, infra, human review) of one successful invocation, where "successful" is defined by the same rubric your eval suite uses. Building this metric requires three pieces:

  • Outcome attribution. Every user-visible request gets a top-level span with an outcome label. The label is populated when the request resolves: success, partial, failure, escalated-to-human. Without an outcome label, every cost rollup is "tokens spent," not "tokens spent on things that worked."
  • Failure cost accounting. Failed and partial outcomes still cost money. The honest cost-per-success number divides the total spend (success + failure tokens) by the count of successes. This number is two to four times the naive cost-per-token-times-tokens-per-success calculation, and it is the number that determines whether the feature is unit-economic.
  • Human-in-the-loop costs. When an agent escalates and a human resolves, the human's time is part of the outcome's cost. Most ledgers omit it because it lives in a different system. It is also the cost that scales with adoption and most threatens the unit economics.

Cost per token belongs on the platform team's dashboard. Cost per validated outcome belongs on the product team's dashboard. The two dashboards should reconcile, but they answer different questions, and confusing them produces the worst kind of internal report — one that is technically correct and operationally useless.

What "directional for two quarters" actually means

The most common organizational arc is predictable. A platform team ships per-span attribution and a chargeback dashboard. Leadership tells teams the dashboard is "directional for now" — meaning costs are visible but not yet enforced, no one's budget is hit. Teams nod, ignore it, and ship.

Two quarters later, an unrelated cost dispute — usually a single team blowing through their cloud budget — pulls the AI spend dashboard into a finance review. Suddenly "directional" becomes "billed against your Q3 budget retroactively," and the political surface that was deferred lands all at once. The teams that stayed agnostic to the dashboard for six months now discover that their agents are 3x more expensive than the team that did optimize, and their budget for Q4 is gone.

The pattern that avoids this is unromantic: ship attribution with showback first (visibility, no budget impact), publish a target conversion-to-chargeback date with at least one quarter of warning, run a parallel month where the dashboard's numbers are reconciled against the actual vendor invoices (closing the "untagged" gap to >80% before going live), and let teams optimize against the showback before the chargeback bites. The showback period is not a courtesy — it is the integration test for the ledger. Numbers that look fine in a dashboard tend to fall apart at month-close when the vendor's invoice arrives with allocation categories the dashboard never modeled.

The teams who treat compound AI as an accounting system from day one don't avoid budget fights. They have the budget fights with the data on their side, in a forum where the data is trusted, and they have them six months earlier than the teams who deferred the work.

The takeaway for builders

If your compound AI system has more than two internal teams' tools in the call graph, the ledger is not a future problem — it is a present one, and the deferral is the failure mode. The minimum viable chargeback architecture is per-span cost attribution with three identifiers (principal, cost center, on-behalf-of), an outcome label on the top-level span, and a documented settlement convention for cross-vendor normalization. The maximum viable one adds variable-vs-fixed cost decomposition per tool, human-in-the-loop cost integration, and a published showback-to-chargeback transition date.

The unit economics of compound AI will be settled at the team level inside companies long before the industry settles it at the vendor level. Whoever builds the ledger first writes the rules everyone else negotiates against.

References:Let's stay in touch and Follow me for more thoughts and updates