The Planning Tax: Why Your Agent Spends More Tokens Thinking Than Doing
Your agent just spent 0.12. If you've built agentic systems in production, this ratio probably doesn't surprise you. What might surprise you is where those tokens went: not into tool calls, not into generating the final answer, but into the agent reasoning about what to do next. Decomposing the task. Reflecting on intermediate results. Re-planning when an observation didn't match expectations. This is the planning tax — the token overhead your agent pays to think before it acts — and for most agentic architectures, it consumes 40–70% of the total token budget before a single useful action fires.
The planning tax isn't a bug. Reasoning is what separates agents from simple prompt-response systems. But when the cost of deciding what to do exceeds the cost of actually doing it, you have an engineering problem that no amount of cheaper inference will solve. Per-token prices have dropped roughly 1,000x since late 2022, yet total agent spending keeps climbing — a textbook Jevons paradox where cheaper tokens just invite more token consumption.
The Anatomy of Planning Overhead
To understand the planning tax, trace where tokens actually go in a standard ReAct agent loop. Each cycle follows the same pattern: the model generates a Thought (reasoning about the current state), selects an Action (choosing a tool and parameters), receives an Observation (the tool's output), and then reasons again about what to do next. Every step in this chain costs tokens, but the Thought steps are where the budget bleeds.
Consider a moderately complex task — answering a question that requires querying two APIs and synthesizing the results. A ReAct agent might execute 5–8 reasoning cycles. In each cycle, the Thought step typically generates 200–500 tokens of chain-of-thought reasoning, while the Action itself might be a 50-token function call. The Observation gets injected back into the context window, so every subsequent Thought step re-reads the entire conversation history. By the eighth cycle, your agent is reasoning over a context that's 80% self-generated narration.
The token math compounds quickly:
- Reasoning tokens cost more. Output tokens are priced 3–8x higher than input tokens across major providers. The 2026 median sits at roughly 4:1, with premium reasoning models hitting 8:1. Every chain-of-thought step pays this premium.
- Context windows balloon. Each observation gets appended to the conversation, so the input token count grows linearly (or worse) with each cycle. An agent that started with a 2,000-token prompt might be processing 30,000 tokens of context by its fifth tool call.
- Reflexion loops multiply everything. Agents that self-critique and retry — Reflexion, self-consistency, or tree-of-thought patterns — can consume 50x the tokens of a single pass. Ten retry cycles on a task that already requires five reasoning steps means 50 chain-of-thought generations, most of which produce the same conclusion with slight variations.
The result: agents make 3–10x more LLM calls than simple chatbots, and a single request can easily consume 5x the token budget of a direct completion. For unconstrained software engineering agents, the per-task cost reaches $5–8 in API fees alone.
The Efficiency-Intelligence Trade-off
There's a deeper problem buried in the planning tax: more planning doesn't always mean better outcomes. Apple's 2025 "Illusion of Thinking" paper tested reasoning models on logic puzzles and found a striking pattern. For easy problems, standard models without chain-of-thought were faster and sometimes more accurate — the extra reasoning was pure waste. For medium-difficulty problems, reasoning genuinely helped. But for hard problems, both reasoning and non-reasoning models collapsed, and the reasoning models actually used fewer thinking tokens on the hardest instances, as if the model recognized it couldn't solve the problem and gave up early.
This creates a bimodal failure mode for agentic planning:
- Overthinking simple tasks. An agent asked to look up a stock price doesn't need five steps of decomposition. But if your architecture always runs the full ReAct loop, you pay the planning tax even when a single tool call would suffice.
- Under-planning hard tasks. Paradoxically, the tasks that most need careful planning are the ones where extended reasoning provides the least marginal benefit, because the model hits the limits of its capability regardless of how many tokens it spends thinking.
The practical implication: there's a task complexity threshold where planning overhead exceeds execution cost. Below that threshold, planning is waste. Above it, planning helps but with diminishing returns. The sweet spot — where additional reasoning tokens produce proportionally better outcomes — is narrower than most agent architectures assume.
Architectural Patterns That Reduce the Tax
The good news: the planning tax is an architectural choice, not an immutable property of LLM agents. Several patterns have emerged that reclaim token budget without sacrificing capability.
Plan-Then-Execute (ReWOO)
The ReWOO architecture (Reasoning WithOut Observation) concentrates all planning into a single upfront pass. A Planner module generates a complete blueprint — every tool call, every dependency — before any execution begins. A Worker module then executes the plan without triggering additional LLM calls for intermediate reasoning. A Solver module synthesizes the results at the end.
The token savings are dramatic: ReWOO reduces token usage by 5–10x compared to ReAct traces on the same tasks, while matching or exceeding accuracy. On the HotpotQA benchmark, it achieved 64% fewer tokens with a 4.4% accuracy improvement. The key insight is that most intermediate "reasoning" in a ReAct loop is redundant — the model is re-deriving the same plan it had from the beginning, just with updated context.
The trade-off is adaptability. ReWOO can't improvise mid-execution. If the third tool call returns unexpected data, the plan doesn't adjust. For tasks where the execution path is predictable, this is a non-issue. For exploratory tasks where each observation genuinely changes the strategy, you need a hybrid approach.
Plan Caching
Agentic Plan Caching (APC) extracts structured plan templates from completed agent executions and reuses them for semantically similar future tasks. When a new request arrives, the system matches it against cached plans using keyword extraction, then uses a lightweight model to adapt the template to the specific context.
The results are compelling: 50% average cost reduction and 27% latency improvement while maintaining task accuracy. Unlike semantic caching for chatbots (which just returns cached responses), APC caches the plan structure while allowing the execution details to vary. This means the first user to ask "compare Q3 revenue across these three databases" pays the full planning tax, but subsequent similar queries skip the decomposition entirely.
Hierarchical Decomposition
Instead of a flat planning loop where one model does everything, hierarchical decomposition uses a lightweight model for high-level task breakdown and reserves the expensive model for steps that genuinely require sophisticated reasoning. This mirrors the organizational pattern of having a senior architect design the approach while junior developers execute the individual steps.
In practice, this means routing 90% of planning queries to a smaller, cheaper model and only escalating to the frontier model for the 10% of decisions that actually need it. Teams report 87% cost reductions using this cascade approach.
Thinking Budgets
Modern reasoning models support explicit thinking budgets — you can tell the model to spend no more than N tokens on internal reasoning. This is a blunt instrument but surprisingly effective. Research on token-budget-aware reasoning shows that constraining the thinking budget forces the model to be more decisive, reducing hedging and repetitive self-reflection without proportional accuracy loss.
The practical heuristic: set thinking budgets proportional to task complexity. A tool-selection step doesn't need 4,000 tokens of deliberation. A multi-step synthesis might. Adaptive approaches that start with a small budget and expand it only when the model signals low confidence can capture most of the benefit with a fraction of the cost.
The Tool Tax: A Hidden Multiplier
Planning overhead doesn't just come from reasoning tokens. Every tool in your agent's toolkit adds to the tax, even when the tool isn't used. Tool schemas get serialized into the prompt as function definitions, and they're tokenized and billed on every request. An agent with 30 available tools might spend 3,000–5,000 tokens per request just describing tools it never calls.
The fix is straightforward: filter available tools by relevance before each LLM call. If the agent is in a "data retrieval" phase, it doesn't need the code execution or email-sending tools in its context. This is the LLM equivalent of loading only the libraries you need — obvious in retrospect, but routinely overlooked because the cost is invisible until you instrument your token usage.
Similarly, conditional system prompts — including tool-specific instructions only when those tools are relevant — can eliminate thousands of tokens of superfluous context per request.
Measuring Your Planning Tax
You can't optimize what you don't measure. Most teams track total token usage per request, but this single number hides the planning/execution split. To diagnose your planning tax, instrument these metrics:
- Planning ratio: tokens spent on reasoning steps divided by total tokens. If this exceeds 50%, you're likely overtaxing.
- Reasoning-to-action ratio: number of reasoning tokens per tool call. A healthy ratio depends on your domain, but 10:1 or higher suggests the model is overthinking.
- Retry rate: how often the agent re-plans after an observation. Frequent re-planning indicates either poor initial plans or unpredictable tool behavior.
- Context growth rate: how quickly the conversation context grows per cycle. Linear growth is expected; superlinear growth (from verbose observations or repeated self-reflection) is a red flag.
- Marginal accuracy per token: does doubling the thinking budget measurably improve outcomes? If not, you're past the diminishing returns threshold.
Once you have visibility, the optimization path usually follows a predictable sequence: first, cut tool schemas and conditional prompts (easy wins, 20–30% savings). Then implement plan caching for your most common query patterns (another 30–50% on cache hits). Finally, evaluate whether your architecture needs the full ReAct loop or whether a plan-then-execute pattern fits your use case.
What Comes Next
The planning tax is a transitional problem. As models get better at reasoning efficiently — spending tokens only where they add value — the overhead will shrink. Adaptive thinking approaches, where the model self-regulates its reasoning depth based on task difficulty, are already shipping in production models. Plan caching and hierarchical decomposition are becoming standard infrastructure rather than research curiosities.
But the deeper lesson will outlast the specific techniques: in any system where thinking and doing are both metered resources, you need to budget them separately. Treating token usage as a single undifferentiated number is like tracking "compute" without distinguishing CPU from I/O — technically correct but operationally useless. The teams that get agentic costs under control are the ones that ask not just "how many tokens did we use?" but "how many of those tokens actually moved us toward the answer?"
The planning tax is real, it's measurable, and for most production agents today, it's the single largest line item in your inference budget. The question isn't whether to pay it — some planning is essential — but whether you're paying the right amount for the decisions your agent actually needs to make.
- https://zylos.ai/research/2026-02-19-ai-agent-cost-optimization-token-economics
- https://medium.com/elementor-engineers/optimizing-token-usage-in-agent-based-assistants-ffd1822ece9c
- https://labs.adaline.ai/p/token-burnout-why-ai-costs-are-climbing
- https://arxiv.org/abs/2506.14852
- https://capabl.in/blog/agentic-ai-design-patterns-react-rewoo-codeact-and-beyond
- https://arxiv.org/abs/2305.18323
- https://arxiv.org/html/2412.18547v5
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
