Skip to main content

9 posts tagged with "reasoning-models"

View all tags

Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill

· 9 min read
Tian Pan
Software Engineer

The first person to notice your reasoning-model regression is almost never on the engineering team. It is the finance analyst who pings your manager on a Tuesday afternoon because the previous month's Anthropic invoice came in 2.4x higher than the prior one, and "we didn't ship anything that should have done that." You open the dashboard, look at request volume — flat. Latency p99 — flat. Output tokens per response — flat. Error rate — flat. Every panel you wired up six months ago says the system is healthy. Finance is looking at a different number, and they are right.

The number they are looking at is reasoning tokens, and most observability stacks were built before the field existed.

Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts

· 10 min read
Tian Pan
Software Engineer

The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.

The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.

The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume

· 11 min read
Tian Pan
Software Engineer

A reasoning model emits a chain-of-thought trace because that is how it computes. A product team renders that trace in the UI because hiding it feels like throwing away tokens the user paid for. Those are two different decisions, and almost nobody on the product side notices they made the second one. The trace becomes a panel, the panel becomes a feature, the feature gets a docs page, and six months later someone in a quarterly review asks why the support queue is full of users arguing with the reasoning instead of the answer.

The trace is debug output. It exists for engineers who need to know why the model picked one tool, hedged on a date, or quietly switched personas mid-paragraph. Pushing it to the end user without a design pass is the AI-product equivalent of leaving console.log calls in production and calling them "transparency." It looks like a feature, it costs almost nothing to render, and it quietly degrades trust in ways that don't show up in any of the dashboards the team built.

Reasoning-Effort Budgeting: When Thinking Tokens Become a Finance Line Item

· 11 min read
Tian Pan
Software Engineer

The first time your finance team asks why a single user racked up a fifty-cent answer to a one-tenth-of-a-cent question, the call will not be about the model. It will be about the line on the invoice that did not exist twelve months ago: reasoning tokens. They look like output tokens on the bill, they bill at output-token rates on most providers, and they have no natural ceiling. A query that would have produced a four-hundred-token reply on a non-reasoning model can quietly burn eight thousand internal thinking tokens to get there — and the only person who notices is the one reconciling the spend.

For most of the API era, "tokens used" was an honest number. You sent a prompt in, you got a response out, and the bill was a clean function of both. Reasoning models broke that intuition. The model now generates a hidden, billable, internally-only-visible chain of thought before it emits the answer the caller will read, and the size of that chain depends on the model's own assessment of how hard the question was. The user-visible output may be a single sentence. The bill may be for ten pages.

The Reasoning-Model Tax at Tool Boundaries

· 10 min read
Tian Pan
Software Engineer

Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.

This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

Prompting Reasoning Models Differently: Why Your Existing Patterns Break on o1, o3, and Claude Extended Thinking

· 10 min read
Tian Pan
Software Engineer

Most teams adopting reasoning models do the same thing: they copy their existing system prompt, point it at o1 or Claude Sonnet with extended thinking, and assume the model upgrade will do the rest. Benchmarks improve. Production accuracy stays flat — or drops. The issue isn't the model. It's that the mental model for prompting never changed.

Reasoning models don't work like instruction-following models. The strategies that squeeze performance out of GPT-4o — elaborate system prompts, carefully curated few-shot examples, explicit "think step by step" instructions — were designed for a different inference architecture. Applied to reasoning models, they constrain the exact thing that makes these models valuable.

This post is a practical guide to the differences that matter and the adjustments that actually work.

The Reasoning Model Premium in Agent Loops: When Thinking Pays and When It Doesn't

· 10 min read
Tian Pan
Software Engineer

Here is a number that should give you pause before adopting a reasoning model for your agent: a single query that costs 7 tokens with a standard fast model costs 255 tokens with Claude extended thinking and 603 tokens with an aggressively-configured reasoning model. For an isolated chatbot query, that is manageable. But inside an agent loop that calls the model twelve times per task, you are not paying a 10x premium — you are paying a 10x premium times twelve, compounded further by the growing context window that gets re-fed on every turn. Billing surprises have killed agent projects faster than accuracy problems.

The question is not whether reasoning models are better. On hard tasks, they clearly are. The question is whether they are better for your specific workload, at your specific position in the agent loop, and by a margin that justifies the cost. Most teams answer this incorrectly in both directions — they either apply reasoning models uniformly (burning budget on tasks that don't need them) or avoid them entirely (leaving accuracy gains on the table for the tasks that do).

When Thinking Models Actually Help: A Production Decision Framework for Inference-Time Compute

· 10 min read
Tian Pan
Software Engineer

There is a study where researchers asked a reasoning model to compare two numbers: 0.9 and 0.11. One model took 42 seconds to answer. The math took a millisecond. The model spent the remaining 41.9 seconds thinking — badly. It re-examined its answer, doubted itself, reconsidered, and arrived at the correct conclusion it had already reached in its first three tokens.

This is the overthinking problem, and it is not a corner case. It is what happens when you apply inference-time compute indiscriminately to tasks that don't need it.

The emergence of reasoning models — o1, o3, DeepSeek R1, Claude with extended thinking — represents a genuine capability leap for hard problems. It also introduces a new class of production mistakes: deploying expensive, slow deliberation where fast, cheap generation was perfectly adequate. Getting this decision right is increasingly central to building AI systems that actually work.