The Token Count Your Client Estimated And Your Provider Invoiced
Your application counted tokens locally with a tokenizer library matching what you believed the provider used. The SDK reported "estimated 4,200 tokens" before each call. Your budget logic admitted the request. Then the provider's invoice came back at 6,800 tokens for the same payload. Multiply that 60% gap by a few million calls a month and the line item your finance team cannot reconcile against your own logs starts to look like an architectural mistake rather than a rounding error.
The mistake is not that the local tokenizer was wrong. The mistake is treating the local tokenizer as a contract instead of a guess. Tokenization is something the provider does inside their serving stack — your library is a model of that process, not the process itself, and the two drift in ways that are small per call and structural across the population of calls you actually make.
This is the failure mode that shows up months in. The first week, your cost-per-conversation matches your estimate. The first month, the discrepancy looks like noise. By the time someone in finance asks why the AI line item is 30% over budget, you have built caching, rate limits, per-customer quotas, and SLO alerts on top of a number you cannot independently verify. The number you trusted is the number the provider chose to tell you, and the number they chose to bill you with is a different one.
The Tokenizer In Your SDK Is Not The Tokenizer In Their Datacenter
Every major provider ships, or used to ship, a tokenizer library: tiktoken for OpenAI, a count-tokens endpoint for Anthropic, a similar local helper for Gemini. The pitch is reasonable — count before you call so you can budget, truncate, or refuse the request. The reality is that the library you imported is a reference implementation that lags the serving stack in three predictable ways.
First, the encoding model your SDK was built against may not be the encoding model the server now uses. Tokenizers get updated. Anthropic's recent generations introduced a tokenizer that, on the same text, can count substantially more tokens than the prior one — public guidance suggests up to 35% more for certain content. If your SDK was pinned six months ago and your account was silently migrated to the new model family, you are estimating against last year's tokenizer and being billed against this year's.
Second, the serving stack adds tokens your library does not see. Chat models wrap your messages in a chat template — role headers, separators, system preamble, occasionally a date stamp the provider injects on their side. The exact bytes of that template are a provider detail, and they evolve. The token-counting issue lists for these libraries are full of bug reports where the local count is off by anywhere from one to several hundred tokens depending on what tool calls or system prompts were in the message, because the local library cannot fully replicate the template the server applies.
Third, multimodal and structured inputs are counted by rules the local tokenizer simply does not implement. An image is not text. Audio is not text. A function-calling tool schema is text on the wire but is normalized and rewritten on the server before being counted. The local helper can give you a number for the prose portion of your prompt; the image cost, the audio cost, and the tool-call overhead are computed by a different code path on the other side of the network.
Each of these is small in isolation. Together, they produce a gap that is not random — it is biased upward on exactly the inputs you care about most.
The Gap Is Largest On The Inputs That Matter
A 1–3% drift on average is annoying. A 1–3% drift on average that is actually a 0% drift on small English prompts and a 40% drift on multilingual code-paste prompts is a different problem. The population of inputs you actually serve is not uniform, and the discrepancy concentrates in specific places.
Non-Latin scripts are the most consistent source of expansion. The same sentence in English and Chinese is roughly the same length to a human reader, but tokenizes very differently. A tokenizer that was trained primarily on English text encodes a single Chinese character into multiple tokens, and the multiplier depends on the script and the version of the vocabulary. If your product launches a Japanese tier and your forecasts were built on English usage, the cost per conversation is not what your spreadsheet said.
Pasted code has the same property. Source code contains long runs of identifiers, punctuation, and whitespace that compress poorly. A 500-line file pasted into a chat may tokenize at twice the rate of 500 lines of prose. If you operate a coding agent, every diff applied, every stack trace inspected, every test output read sits in this expansion-prone region of the tokenizer's behavior.
Emoji and rare Unicode sit in the worst part of the curve. A single emoji can take more tokens than a short word. Bullet characters, smart quotes, mathematical symbols, and the decorative whitespace people paste from rich-text editors all add cost that is invisible in a word-count and visible in the invoice.
Tool schemas deserve their own line. A function-calling agent ships a JSON schema with every request describing the tools available — names, descriptions, argument types, enum values. That schema lives in the input window and gets billed on every call. Provider-side normalization may inflate or compress this differently than your local pre-flight count expects, and the GitHub issue queues for major SDK tokenizers contain explicit acknowledgments that token counts diverge meaningfully once tool calls are in play.
The takeaway is that an average-case benchmark of your tokenizer against the API will tell you the gap is small. The cases that drive your bill are the worst-case cases, and they are systematically underestimated.
Hidden Tokens You Did Not Send But Were Billed For
The second class of discrepancy is more uncomfortable. The provider can add tokens to your input that your client never saw and never had the opportunity to count.
Some of these are benign and disclosed. Anthropic's documentation notes that system optimizations may add tokens, but explicitly states billing reflects only your content. OpenAI's chat-template overhead is documented well enough that experienced practitioners add a fudge factor. These are accounting details, not surprises.
Others are more contested. Reasoning models charge for hidden chain-of-thought tokens you never see in the response. A research paper from 2025 on "predictive auditing of hidden tokens in LLM APIs" argues that providers concealing intermediate reasoning while billing for it creates an information asymmetry the customer cannot easily verify — they can only check that the bill is plausible, not that the underlying count is correct. Whether or not that asymmetry is exploited, it exists structurally because the customer has no independent way to audit a token count generated inside the provider's serving infrastructure.
The defensive posture is not paranoia — it is to assume the input side is approximate and the output side is partly opaque, and to treat the provider's returned usage metadata as the source of truth for what actually got billed, not your pre-flight estimate.
What "Closing The Gap" Actually Looks Like
The right design treats your local token count as a budgeting hint, not a budgeting authority. A few patterns make this work in production.
Returned usage is the budget ledger, not the local estimate. Every response from a modern LLM API carries a usage field with the actual input and output tokens billed for that call. Wire that into the same place you keep your local pre-flight estimate, and back-propagate the difference into the per-tenant, per-feature, per-prompt-template cost model. If you only log estimates, you are tracking a number that diverges from your invoice on a schedule nobody owns.
Admit requests with a margin, not the raw estimate. If your local count says 4,200 tokens against a 4,500-token cap, the cap is wrong. Pick a margin based on the empirical distribution of estimate / actual for the kind of input you are about to send — and that ratio should be measured per language, per content type, and per tool-schema version, not as a single global constant. A coding agent and an English chatbot do not share a margin.
Schedule divergence audits. Once a week, sample a few thousand recent calls, compute the ratio of returned usage to local estimate, slice it by prompt template and content language, and alert when any slice's drift exceeds a threshold. This catches two things at once: a silent tokenizer change on the provider side, and a content-mix shift on your side that walked you into a worse part of the tokenizer's behavior. The artifact this produces is also what your finance team needs when they ask you to reconcile the bill.
Treat the SDK tokenizer version as part of your dependency contract. Pin it, log it on every call, and treat an upgrade as a change worthy of an eval pass — not because the tokenizer logic itself broke anything semantic, but because every cost forecast and every budget guard downstream of it just shifted under your feet. The same discipline you apply to model version pinning belongs on the tokenizer.
Make the provider's count_tokens endpoint your pre-flight oracle when it exists, and your local library only when it does not. Anthropic and AWS Bedrock both offer pre-flight counting endpoints that run against the same tokenizer as the serving path. They are slower than a local library and they cost you a network round trip, but the gap between estimate and invoice closes substantially when the estimate is computed by the same code the invoice is. Use them for the high-stakes path — admission control for large requests, budget gates on expensive tools — and accept the local approximation only for the cheap, high-volume cases where the round trip would dominate latency.
The Harder Problem: Two Tokenizers, Both Official
The deepest version of this problem is the one you cannot fix by reaching for the official library. The "official" tokenizer ships in two places — the one bundled with the SDK and the one running in the serving stack — and they can disagree by version. The SDK build cadence is governed by the open-source repo and the package release schedule. The serving stack build cadence is governed by a rolling deployment across the provider's fleet. There is no contract that says they must match on a given day.
A real example pattern: a provider rolls out a tokenizer change to address a quality issue in their generation pipeline. The serving fleet picks it up within a deploy window. The published library on PyPI or npm picks it up on the next release, days or weeks later. Anyone running their pre-flight count against the published library during that window is, briefly, counting against a tokenizer that no longer exists on the server. The drift is small. It is also durable, because if you cached cost estimates or set budget caps during that window, the caches and caps are wrong until you rebuild them.
The takeaway is not that you should avoid local tokenizers — they are still by far the cheapest way to bound a request before sending it. The takeaway is that you should never let your system make an irreversible decision (admit, reject, charge, allocate) on a local estimate alone. The provider's returned usage closes the loop. If your architecture has no place to feed that returned usage back into the budget logic, you have built a system that is structurally incapable of reconciling itself with the invoice.
The Engineering Discipline This Asks For
Token counting looks like a measurement problem. It is really a contract problem. The tokens you send and the tokens you are billed for are governed by code on the other side of a network boundary that you do not control, and the library you imported is an approximation of that code that is allowed to drift.
The teams that handle this well have three habits in common. They log the provider's returned usage alongside their own estimate on every call. They alert on the ratio between the two, not on the absolute value of either. And they treat tokenizer pinning, prompt-template changes, and provider model upgrades as cost-affecting events that deserve the same review discipline as a database migration — because each of them can move a line item by a percentage that compounds across the entire request volume.
The cost discrepancy is fixable. What is not fixable, once you have spent six months building on top of an unverified local estimate, is the institutional confidence that your AI bill matches your AI logs. Build the reconciliation loop early, while the volumes are small and the gaps are still negotiable. Once the line item is large enough for finance to ask about, the answer they want is not "we think the provider's tokenizer changed" — it is a per-tenant, per-month chart of estimate-versus-actual that you can hand them without flinching.
- https://github.com/openai/tiktoken/issues/474
- https://community.openai.com/t/discrepancy-in-token-counts-between-tiktoken-and-api-usage-for-o4-mini-gpt-4o-mini/1271170
- https://platform.claude.com/docs/en/build-with-claude/token-counting
- https://docs.anthropic.com/en/api/messages-count-tokens
- https://www.propelcode.ai/blog/token-counting-tiktoken-anthropic-gemini-guide-2025
- https://galileo.ai/blog/tiktoken-guide-production-ai
- https://portkey.ai/blog/tracking-llm-token-usage-across-providers-teams-and-workloads/
- https://flexprice.io/blog/how-to-meter-llm-tokens-usage-for-billing
- https://arxiv.org/html/2508.00912v1
- https://docs.aws.amazon.com/bedrock/latest/userguide/count-tokens.html
