Skip to main content

Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill

· 9 min read
Tian Pan
Software Engineer

The first person to notice your reasoning-model regression is almost never on the engineering team. It is the finance analyst who pings your manager on a Tuesday afternoon because the previous month's Anthropic invoice came in 2.4x higher than the prior one, and "we didn't ship anything that should have done that." You open the dashboard, look at request volume — flat. Latency p99 — flat. Output tokens per response — flat. Error rate — flat. Every panel you wired up six months ago says the system is healthy. Finance is looking at a different number, and they are right.

The number they are looking at is reasoning tokens, and most observability stacks were built before the field existed.

When a reasoning model answers a question, the output it bills you for has two halves: the visible response that ends up in your application's response field, and the chain-of-thought it ran through before producing that response. On Claude's extended-thinking API, that chain-of-thought arrives in a separate thinking content block; on OpenAI's Responses API, it's tucked under usage.output_tokens_details.reasoning_tokens. Both are billed at the same rate as output tokens. Neither shows up if your logger is reading choices[0].message.content and counting characters.

The gap between what you log and what you pay for is where the surprise lives. A complex query on Claude Opus with extended thinking turned on can legitimately spend 20,000 to 40,000 thinking tokens before emitting 500 visible tokens. At Opus 4.7 output rates of $25 per million tokens, that is fifty cents to a dollar of invisible spend per call. Multiply by traffic. The order-of-magnitude shift between what your dashboard reports and what shows up on the invoice is not a billing system error — it is your instrumentation drawing the wrong field.

Why Most Observability Stacks Miss It

The reason this gap exists is historical. When teams instrumented their first LLM calls in 2023 and 2024, output tokens meant one thing: the tokens that came back in the response body. The mental model — "input goes in, output comes back, count both" — matched what the APIs returned. The OpenAI usage object had completion_tokens, the Anthropic usage object had output_tokens, and that was the whole story.

Reasoning models broke the model. When OpenAI shipped the o-series and Anthropic shipped extended thinking, both providers added a new sub-field — reasoning_tokens on OpenAI, the thinking content block on Claude — and quietly rolled those tokens into the output token count for billing. The total billed output is the sum of visible output plus reasoning. The visible-output field your logger has been pulling for two years still works, still reads, still emits a number — but that number is no longer the number on your bill.

The OpenTelemetry GenAI semantic conventions caught up. There is now a defined attribute gen_ai.usage.reasoning.output_tokens that maps to usage.output_tokens_details.reasoning_tokens. The opentelemetry-instrumentation-openai-v2 package emits it. The Python and TypeScript SDKs from most providers expose it. The data is there. What is missing in most production stacks is the wiring — somebody has to write the line of instrumentation code that pulls reasoning_tokens out of the response and attaches it to the same span as input_tokens and output_tokens. Until that happens, the span looks complete, the dashboard looks healthy, and the regression hides in the gap.

The Three Failure Modes That Surprise Finance

The pattern repeats often enough that it deserves names. There are three distinct ways a team gets blindsided by reasoning-token spend, and each one needs a different defensive move.

The first is the silent prompt regression. An engineer tunes a system prompt — adds a new instruction, changes the response format, tightens the schema. The visible output tokens are the same. The latency is barely different. But the change happened to nudge the model into longer reasoning chains, and per-call thinking tokens go from 1,800 to 5,400. The eval scores look fine, the visible-output dashboard looks fine, and the cost per call has tripled. The change merges. Finance notices in 30 days.

The second is the model migration cliff. Your team upgrades from a non-reasoning model to a reasoning one — Sonnet to Opus with thinking enabled, or GPT-4o to GPT-5 with reasoning effort set to medium. The migration test plan checks for correctness, latency, and visible output length. It does not check for thinking-token consumption, because the prior model did not have any. You ship the upgrade on a Friday. By the following Monday, your per-call spend has gone up 8x on a workload where the team budgeted for 1.5x.

The third is the agentic loop blow-up. An agent does five turns of tool calls. Each turn invokes a reasoning model. Each call's reasoning tokens are modest individually — maybe 1,200 tokens per turn — but the previous turn's reasoning content gets appended back into the next turn's context. By turn five, you are paying for reasoning over reasoning. A workflow that looks like "five small calls" is actually one call's worth of input followed by a quadratic accumulation of thinking. The bill is shocking; the per-call dashboard, which averages across turns, hides the shape.

In all three cases the regression existed in the data the provider was returning. The team simply was not looking at the right field.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates