Skip to main content

Token Accounting Drift: When Your Trace Logs Don't Match the Provider Invoice

· 9 min read
Tian Pan
Software Engineer

There is a finance meeting that happens at every company shipping a hosted LLM feature, usually around month four. The engineering team has been logging token counts from every request. The finance team has the provider's invoice. The numbers don't agree. Sometimes the gap is five percent. Sometimes it is thirty. The engineers say the invoice is wrong. The finance team says the logs are wrong. Both teams are technically correct, and neither owns the reconciliation.

The drift is not fraud. It is a structural measurement problem, and the structure has at least six independent failure modes that compound. A team that does not own those failure modes will spend the next quarter writing apology emails to FP&A about why the forecast slipped, when the real story is that nobody on the engineering side ever audited what "token" meant in their own logs.

The reason this matters more than the typical FinOps drift is that tokens are the unit of measurement the whole AI cost model is built on. You can forecast spend, set per-tenant budgets, attribute cost to features, and negotiate volume discounts only to the extent that your logged tokens reconcile with billed tokens. A system that cannot reconcile its observed bytes with its billed bytes does not have a working cost model. It has a guess.

The six structural sources of drift

Every team I have seen audit this seriously finds the same six contributors, in roughly this order of magnitude.

Tokenizer version skew. The tokenizer that ships with your SDK is a snapshot. The server runs the current one. When Anthropic shipped a new tokenizer with Opus 4.7 that produces up to 35% more tokens for the same input compared to Opus 4.6, every team that estimated cost with the old client-side tokenizer underbilled themselves by up to a third on the same English text. The per-token price did not change. The token count did. If your SDK pin is six months stale, your cost model is six months stale.

Cached prefix accounting. Both Anthropic and OpenAI now offer prompt caching with roughly a 90% discount on cache reads, but Anthropic also charges 1.25× or 2× the base input price on cache writes depending on TTL. Your trace probably logs total input tokens and a separate cache_read count, but the provider's invoice applies four different prices to four slices of those tokens: standard input, cache write at 1.25×, cache write at 2×, and cache read at 0.1×. If your cost model multiplies total input tokens by the standard rate, you will overcount cost by a wide margin when caching is working and undercount it when caching is silently failing — both directions hide the same bug.

Retry double-counting. Your client retries on 429s and timeouts. Whether each retry shows up on the bill depends on whether the request reached the inference layer before the failure. Network timeouts that fail before the server received the prompt are not billed. Server timeouts after the prompt was tokenized but before completion finished often are. Your client logs every retry the same way. The provider's usage API knows the difference. The gap is small per request and large in aggregate, especially on flaky days.

Streaming usage frames. When you stream a response, the token count for the request arrives in a final usage frame after the content tokens, immediately before the stream terminator. Many SDKs require an explicit option (stream_options={"include_usage": True} in the OpenAI SDK; analogous in others) to emit it at all. If you forgot to set that option six months ago when streaming launched, your traces have been recording zero for output tokens on every streamed request. The invoice has not.

Hidden provider-side tokens. The provider injects system prompts for safety, content moderation, and tool use scaffolding that you never wrote and cannot see. Some providers bill those tokens. Some do not. Some bill them only under certain conditions (vision inputs, tool use, structured output). Your trace cannot see what your trace cannot see. The bill can.

Batch and tier discounts. Both major providers offer roughly a 50% batch-mode discount. If your logging layer records the on-demand price for every request regardless of how it was submitted, and a fraction of your traffic is going through batch, your logged cost will overstate the bill by the batch share times 50%. Reverse direction: if a request you logged as batch was actually routed to on-demand due to a capacity fallback, you understate.

Any one of these alone might be a few percent. All six together explain why a thirty-percent gap is not unusual the first time a team looks.

The reconciliation primitive nobody builds

The fix is not to write a better client-side tokenizer. Tokenizers will keep drifting; the server's view is always authoritative. The fix is to treat your logged token count as an estimate and the provider's reported usage as ground truth, and to wire the second into your observability stack.

Every major provider exposes a usage API: Anthropic's Admin Usage API, OpenAI's /v1/organization/usage/completions, the equivalents at AWS Bedrock and Google Vertex. These APIs return token counts and costs at request-level granularity (or hourly aggregates, depending on tier). The reconciliation primitive is a daily job that pulls the previous day's usage data from the provider, joins it against your own request logs on request ID, and writes the diff to a cost_reconciliation table.

What you need from that table:

  • Per-request canonical cost field. The value the provider reported, not what your client guessed. This becomes the source of truth for cost attribution, per-tenant billing, and forecasting.
  • Drift metric. Logged tokens minus billed tokens, expressed as a percentage. This is your tokenizer-version health signal. When it crosses a threshold (5%? 10%?), page someone.
  • Unattributed cost. Tokens on the bill that no request log accounts for. This is your hidden-system-prompt and your double-billing-detection signal in one number.
  • Logged-but-unbilled. Requests in your logs with no matching usage entry. This is your retry-double-counting and your client-side tokenizer false-positive signal.

The reconciliation job is not glamorous and it does not ship features, which is why nobody builds it until they have to. The teams that have it earn the right to forecast the next quarter; the teams that don't show up to the finance meeting with two numbers and no story for the gap.

Why finance and engineering both keep being wrong

The org failure mode is older than LLMs. It is the same pattern that broke cloud cost accounting in 2014 and CDN billing in 2008: one team treats the invoice as ground truth because it is the document the company actually paid, and the other team treats the logs as ground truth because they are the artifact the engineers actually produced. Neither team owns the reconciliation, so the gap stays.

The version that's specific to LLMs is that the engineering team often does not realize the gap exists at all until finance escalates. The dashboards show token counts. The dashboards do not show "tokens the provider would have billed us for if we had asked." Building the second dashboard is the work.

A pattern that works: assign a single person — usually on the platform or infrastructure team — as the named owner of "cost reconciliation." Their job is not to fix every source of drift. Their job is to maintain the reconciliation dashboard, to alert when drift crosses threshold, and to be the human who walks into the monthly invoice meeting with one number and a story for it. Without a named owner, the work falls into the gap between SRE, FinOps, and the application team that owns the AI feature, and nothing ships.

The tokenizer-pin regression test

A specific discipline worth calling out: pin the tokenizer version in your SDK lockfile, and write a regression test that fails when the lockfile updates without an accompanying bill-reconciliation review.

This sounds excessive until you watch a tokenizer change ship in a minor SDK release and silently change the input-token count for every request by double digits. The whole point of a pinned tokenizer with a regression-style gate is that the change is not allowed to land in production without somebody updating the cost model in the same PR. The test does not need to be sophisticated; it can be a snapshot test that asserts a known string tokenizes to a known count. The value is the gate, not the cleverness of the assertion.

The corollary: when you do roll a tokenizer update, treat it as a cost-model migration, not a dependency bump. Re-baseline the drift metric. Update the forecast. Tell finance before they tell you.

Tokens are bytes; bills are inventory

The frame that I think unlocks this conversation: tokens are a unit of measurement the same way bytes are. A storage system that can't reconcile its observed bytes with its billed bytes is a storage system without a working cost model — it might be cheap, it might be expensive, you cannot tell. The same is true for inference.

The teams treating tokens as something fuzzier than bytes — an estimate, a vibe, a number on a dashboard — are the same teams that get blindsided by month-five budget overruns and end up with finance running the cost negotiation because engineering had no story. The teams that treat tokens as bytes, with all the rigor that implies (versioned tokenizers, reconciliation jobs, drift alerts, named ownership), get to set the cost narrative themselves.

The bill is inventory. Your trace is the warehouse log. If they don't match, you don't have a billing problem; you have a measurement problem, and the measurement problem is yours to fix.

References:Let's stay in touch and Follow me for more thoughts and updates