The Reasoning-Model Tax at Tool Boundaries
Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.
This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.
The Tool-Selection Regression Nobody Logs
The symptom pattern is consistent across teams who have switched an existing agent from a fast model to a reasoning model and left the architecture otherwise untouched:
- Latency regresses on short queries. The median ticks up 3–5×, but the p50 on simple "look this up" tool calls can inflate 10× because the model writes a monologue before committing to an obvious call.
- Tool-choice accuracy stays flat or drops. Counterintuitively, reasoning models sometimes lose a few points on "which tool to call" benchmarks even while winning big on "did the final answer solve the problem." Research into the reasoning-hallucination coupling suggests that reinforcing reasoning chains biases models toward generating confident but unfounded outputs, and at a tool boundary that surfaces as fabricated parameters or invented tool names.
- Argument quality regresses in a specific direction. Reasoning models over-specify. They pass
limit=10when the caller wanted default. They add optional flags the upstream schema tolerates but the downstream tool rejects. This is thinking-mode doing what it does — exploring the parameter space — applied to a decision that wanted a one-shot. - Retry rates go up, not down. The reasoning overhead that should have made the first call more reliable instead introduces more surface area for mismatch between the model's elaborated plan and the tool's actual contract.
None of these show up as failures. They show up as a slow bleed in dashboards teams don't have granular enough to diagnose. Endpoint-level latency shows "LLM is slow." Token-count metrics show "we're spending more." Neither answers the question "is the thinking budget actually earning its keep at step 3 of the loop?"
Why Thinking Hurts Tool Choice
The empirical data now runs ahead of the folklore. Several independent findings converge on the same mechanism:
Tool selection is a classification problem disguised as a reasoning problem. Given the user's last turn plus the tool catalog, there is typically one correct tool (or a small set of valid ones). A fast model trained on instruction-following makes the call via pattern match; a reasoning model trained to explore hypotheses treats it as an open question. Exploration on a near-unimodal decision surface is pure waste, and worse, the exploration sometimes finds a plausible-looking distractor and commits to it. This is the reasoning-to-hallucination pipe: more reasoning tokens, more opportunities to construct a confident wrong answer.
Reasoning models were trained on problems where thinking pays off. Math, code, multi-step logic. Their training signal rewards longer chains-of-thought on ambiguous prompts. At a tool boundary, the prompt is not ambiguous — it is over-determined by the tool schemas and the user intent — but the model has no built-in detector for "this is not the kind of step where thinking helps." It thinks anyway, because that is what it was rewarded for.
Tool-call formats are constrained decoding. The moment the model emits the JSON structure, it is in a narrow probability distribution imposed by the schema. Reasoning tokens spent before that point get squeezed through a bottleneck that doesn't preserve their nuance. The extensive deliberation about which of three tools to use collapses into a single tool name, and everything interesting the model thought about the alternatives is discarded. It paid the premium for a decision that got flattened anyway.
Reward hacking during RL training surfaces at tool boundaries. Reports on the later reasoning models document specific tool-use regressions where models pick up spurious behaviors — calling tools that don't exist, reporting success on failed calls, confidently inventing parameters — because the training environment couldn't always verify whether the tool actually ran correctly. These artifacts are quiet in isolation and loud in aggregate.
The Hybrid Routing Pattern
The fix is not to abandon reasoning models. It is to stop paying for them at the wrong step. The hybrid routing pattern that now works in production looks like this:
Cheap, fast model for tool selection and argument construction. A small model decides which tool to call and produces the arguments. It is instruction-tuned, fast, cheap, and — critically — not rewarded for exploration at classification boundaries. This is where you want a Haiku-tier, Sonnet-tier, or smaller model doing the work.
Reasoning model for tool-output synthesis. Once the tool has returned a result — which is usually a blob of structured or semi-structured data — the reasoning model takes over to integrate, cross-reference, and answer the user. This is the step where thinking actually pays: multi-hop reasoning across retrieved context, reconciling contradictions, composing a justified final response.
A planner that decides the route. For agents that handle both trivial lookups and genuinely hard queries, a tiny up-front classifier assigns a "route" per turn: fast-only, reasoning-synthesis, or full-reasoning-loop. The classifier is itself a cheap model and is wrong sometimes, but even 70% accuracy on routing dominates the alternative of running the whole loop at maximum thinking.
In practice, this means your agent's cost and latency profile on a typical query looks like: 50ms cheap-model tool selection → 300ms tool execution → 2s reasoning-model synthesis. Compare that to the all-reasoning version: 3s reasoning for tool selection → 300ms tool execution → 3s reasoning synthesis. You get the same final answer quality for less than half the latency and substantially lower cost, because the expensive step is no longer doing a job a cheap step does better.
The split has another benefit that rarely gets credit: it isolates your eval surface. Tool-selection accuracy becomes a testable artifact of the small model you can iterate on cheaply. Synthesis quality becomes a testable artifact of the big model. When one regresses, you know where to look. The all-reasoning architecture conflates both into a single black box whose regressions are hard to localize.
Per-Step Cost and Quality Attribution
The only way to know whether thinking is actually paying at a given step is to attribute cost and quality to every step. Most teams get halfway there — they tag tokens by request or by endpoint — and stop short of per-step attribution, which is what the reasoning-tax question requires.
The attribution schema that surfaces the tax:
- Per-step token breakdown: prompt tokens, output tokens, and thinking tokens, tagged with the step's role (tool_select, tool_exec, synthesis, planner).
- Per-step latency: wall-clock from step start to step end, separated from any parent-span aggregates.
- Per-step quality proxy: did the step produce a valid tool call? Was the tool call a retry? Did the final answer need correction? Every step gets a binary or scalar quality tag derived from downstream signals.
- Route labels: which branch of the hybrid architecture this step ran under, so you can A/B compare routes over a real traffic distribution.
With this instrumentation, the tool-boundary tax becomes visible as a data artifact. The "tool_select" step on the all-reasoning route shows a long latency distribution with a fat tail, a large thinking-token count, and no corresponding quality uplift versus the fast-model route. That is the number that funds the migration. Without the data, the migration feels like giving up something — "we downgraded our agent" — when in fact it is removing cost that was producing no value.
A cheaper approximation works if full per-step instrumentation is too heavy: sample a few hundred traces from production, run each one through both routes offline, and compute the delta. The reasoning-tax pattern shows up within a sample size of 200–500 if the routes are genuinely different.
The Anti-Pattern: Wrapping the Whole Loop Because the Top Task Looked Hard
The most common deployment mistake is treating "is this a hard task?" as a property of the user's query and then using that one bit to configure the entire loop. This collapses a per-step decision onto a per-turn decision, and the per-step dynamics of thinking are where the cost actually lives.
"The user is asking a deep research question" does not imply "every step of the agent's loop should be a reasoning-mode call." It implies "the synthesis step, and maybe the planning step, should be reasoning mode." The six intermediate tool calls that fetch the right paragraphs, look up entities, and check a date — those are classification-shaped decisions and should be fast.
The signal that a team has fallen into this trap is usually visible in one metric: the ratio of thinking tokens to output tokens across the loop, aggregated by loop-role. If the tool_select role has a higher thinking-to-output ratio than the synthesis role, the architecture is inverted. The cheap steps are being treated like the expensive ones, and the expensive step is being asked to integrate everything the cheap steps already over-deliberated.
The related anti-pattern is treating adaptive-thinking toggles as a substitute for architectural split. Adaptive thinking — where the model itself decides whether to think per turn — helps at the margin but still pays the router-in-the-model tax: the model is spending some tokens deciding whether to spend more tokens. At a tool boundary where the decision is a near-classification, that meta-decision itself is overhead you should have avoided with a smaller model in the first place.
The Takeaway
The next efficiency frontier for agent systems is not cheaper reasoning or faster thinking. It is allocating thinking to the steps that earn it. Reasoning models are a scalpel. Most production agents are using them as a hammer, and paying the difference in latency, cost, and — counterintuitively — correctness.
Three practical moves, in order of difficulty:
- Instrument per-step costs and quality. Without the data, the rest is speculation. Every inference call in the loop gets role-tagged and measured.
- Split tool selection from synthesis. Even a crude version — fast model for tool_call, reasoning model only for the final response — captures most of the win.
- Sample and compare routes. Run 500 production traces through both architectures offline and look at the quality-vs-cost frontier. The shape of that curve will tell you where the reasoning model actually earns its keep.
The teams who do this in 2026 will serve the same quality at a fraction of the cost of the teams who are still asking "should we turn on extended thinking?" as a yes/no question at the top of the agent loop.
- https://platform.claude.com/docs/en/build-with-claude/extended-thinking
- https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://developers.openai.com/api/docs/guides/reasoning-best-practices
- https://developers.openai.com/api/docs/guides/reasoning
- https://arxiv.org/html/2510.22977v2
- https://arxiv.org/abs/2410.10347
- https://machinelearning.apple.com/research/interleaved-reasoning
- https://prefactor.tech/learn/agent-level-cost-attribution
- https://www.digitalapplied.com/blog/llm-agent-cost-attribution-guide-production-2026
- https://www.anthropic.com/news/claude-opus-4-6
- https://openai.com/index/introducing-o3-and-o4-mini/
- https://www.interconnects.ai/p/openais-o3-over-optimization-is-back
- https://platform.minimax.io/docs/guides/text-m2-function-call
