The Planning Tax: Why Your Agent Spends More Tokens Thinking Than Doing
Your agent just spent 0.12. If you've built agentic systems in production, this ratio probably doesn't surprise you. What might surprise you is where those tokens went: not into tool calls, not into generating the final answer, but into the agent reasoning about what to do next. Decomposing the task. Reflecting on intermediate results. Re-planning when an observation didn't match expectations. This is the planning tax — the token overhead your agent pays to think before it acts — and for most agentic architectures, it consumes 40–70% of the total token budget before a single useful action fires.
The planning tax isn't a bug. Reasoning is what separates agents from simple prompt-response systems. But when the cost of deciding what to do exceeds the cost of actually doing it, you have an engineering problem that no amount of cheaper inference will solve. Per-token prices have dropped roughly 1,000x since late 2022, yet total agent spending keeps climbing — a textbook Jevons paradox where cheaper tokens just invite more token consumption.
The Anatomy of Planning Overhead
To understand the planning tax, trace where tokens actually go in a standard ReAct agent loop. Each cycle follows the same pattern: the model generates a Thought (reasoning about the current state), selects an Action (choosing a tool and parameters), receives an Observation (the tool's output), and then reasons again about what to do next. Every step in this chain costs tokens, but the Thought steps are where the budget bleeds.
Consider a moderately complex task — answering a question that requires querying two APIs and synthesizing the results. A ReAct agent might execute 5–8 reasoning cycles. In each cycle, the Thought step typically generates 200–500 tokens of chain-of-thought reasoning, while the Action itself might be a 50-token function call. The Observation gets injected back into the context window, so every subsequent Thought step re-reads the entire conversation history. By the eighth cycle, your agent is reasoning over a context that's 80% self-generated narration.
The token math compounds quickly:
- Reasoning tokens cost more. Output tokens are priced 3–8x higher than input tokens across major providers. The 2026 median sits at roughly 4:1, with premium reasoning models hitting 8:1. Every chain-of-thought step pays this premium.
- Context windows balloon. Each observation gets appended to the conversation, so the input token count grows linearly (or worse) with each cycle. An agent that started with a 2,000-token prompt might be processing 30,000 tokens of context by its fifth tool call.
- Reflexion loops multiply everything. Agents that self-critique and retry — Reflexion, self-consistency, or tree-of-thought patterns — can consume 50x the tokens of a single pass. Ten retry cycles on a task that already requires five reasoning steps means 50 chain-of-thought generations, most of which produce the same conclusion with slight variations.
The result: agents make 3–10x more LLM calls than simple chatbots, and a single request can easily consume 5x the token budget of a direct completion. For unconstrained software engineering agents, the per-task cost reaches $5–8 in API fees alone.
The Efficiency-Intelligence Trade-off
There's a deeper problem buried in the planning tax: more planning doesn't always mean better outcomes. Apple's 2025 "Illusion of Thinking" paper tested reasoning models on logic puzzles and found a striking pattern. For easy problems, standard models without chain-of-thought were faster and sometimes more accurate — the extra reasoning was pure waste. For medium-difficulty problems, reasoning genuinely helped. But for hard problems, both reasoning and non-reasoning models collapsed, and the reasoning models actually used fewer thinking tokens on the hardest instances, as if the model recognized it couldn't solve the problem and gave up early.
This creates a bimodal failure mode for agentic planning:
- Overthinking simple tasks. An agent asked to look up a stock price doesn't need five steps of decomposition. But if your architecture always runs the full ReAct loop, you pay the planning tax even when a single tool call would suffice.
- Under-planning hard tasks. Paradoxically, the tasks that most need careful planning are the ones where extended reasoning provides the least marginal benefit, because the model hits the limits of its capability regardless of how many tokens it spends thinking.
The practical implication: there's a task complexity threshold where planning overhead exceeds execution cost. Below that threshold, planning is waste. Above it, planning helps but with diminishing returns. The sweet spot — where additional reasoning tokens produce proportionally better outcomes — is narrower than most agent architectures assume.
Architectural Patterns That Reduce the Tax
The good news: the planning tax is an architectural choice, not an immutable property of LLM agents. Several patterns have emerged that reclaim token budget without sacrificing capability.
Plan-Then-Execute (ReWOO)
The ReWOO architecture (Reasoning WithOut Observation) concentrates all planning into a single upfront pass. A Planner module generates a complete blueprint — every tool call, every dependency — before any execution begins. A Worker module then executes the plan without triggering additional LLM calls for intermediate reasoning. A Solver module synthesizes the results at the end.
The token savings are dramatic: ReWOO reduces token usage by 5–10x compared to ReAct traces on the same tasks, while matching or exceeding accuracy. On the HotpotQA benchmark, it achieved 64% fewer tokens with a 4.4% accuracy improvement. The key insight is that most intermediate "reasoning" in a ReAct loop is redundant — the model is re-deriving the same plan it had from the beginning, just with updated context.
The trade-off is adaptability. ReWOO can't improvise mid-execution. If the third tool call returns unexpected data, the plan doesn't adjust. For tasks where the execution path is predictable, this is a non-issue. For exploratory tasks where each observation genuinely changes the strategy, you need a hybrid approach.
- https://zylos.ai/research/2026-02-19-ai-agent-cost-optimization-token-economics
- https://medium.com/elementor-engineers/optimizing-token-usage-in-agent-based-assistants-ffd1822ece9c
- https://labs.adaline.ai/p/token-burnout-why-ai-costs-are-climbing
- https://arxiv.org/abs/2506.14852
- https://capabl.in/blog/agentic-ai-design-patterns-react-rewoo-codeact-and-beyond
- https://arxiv.org/abs/2305.18323
- https://arxiv.org/html/2412.18547v5
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
