The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think
Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.
The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.
Why the Math Is Wrong From the Start
The intuitive model treats agent cost like a loop counter: N tool calls at N×C. This is accurate only if each call sees the same context — which is never true in an agent loop.
Consider a coding agent using Claude Sonnet at standard pricing. A single-pass call with a 9,000-token context costs about $0.03. Run that agent for ten steps, naively appending tool results and conversation history at each turn, and the total context across all calls reaches roughly 472,000 tokens — a 43x increase in cost compared to the single-call baseline.
The underlying formula is:
Total input tokens = N × S + u × N(N+1)/2 + r × N(N-1)/2
Where N is the number of turns, S is the static prefix (system prompt + tool definitions), u is the average user message size, and r is the average tool result size. The triangular term N(N+1)/2 is the culprit: it's what makes agent cost O(N²) rather than O(N). For a 20-turn agent, the expected context growth is not 20x — it's closer to 200x in accumulated token exposure.
Real measurements confirm this. In a documented five-step agent run, per-call token consumption grew as: 888 → 3,400 → 8,900 → 14,200 → 18,900. The cost per step didn't stay flat; it ballooned with each turn because every previous turn stayed in context.
The Hidden Fixed Costs That Multiply With Every Call
Token growth from accumulated history is the obvious problem. Less visible is the set of fixed costs that get paid fresh on every inference call.
System prompts. A detailed system prompt runs 2,000–5,000 tokens for a typical production agent. At one million API calls — not unusual for a customer-facing product — that's 2–5 billion tokens of instruction overhead before a single user message is processed. At scale, system prompts become one of the largest line items in inference spend.
Tool definitions. Every available tool gets serialized into the context for every call, whether the model uses it or not. A single modest tool definition costs 50–100 tokens. A setup with 100 tools consumes roughly 22% of a 128K context window before the user query begins. Measured production setups have found 55,000–134,000 tokens of tool-definition overhead in a single call. One team reduced this from 134K to 8,700 tokens — an 85% reduction — by switching from always-on tool definitions to dynamic loading.
Retry overhead. Failed tool calls don't disappear from the context. The error response, the model's next attempt, and any intermediate reasoning all accumulate in the conversation history and get resent with every subsequent call. A 10% failure rate per step, compounded across 10 steps without circuit breakers, can silently multiply costs several times over. One engineering team reduced their per-task tool call count from 14 to 2 by adding clear terminal states (SUCCESS/FAILED) to tool responses — the model stopped retrying ambiguous outcomes.
Combined, these hidden fixed costs mean that the real undercount for moderate-complexity agents (3–5 tools, multi-step workflows) is typically 5–10x. For complex multi-agent systems, the multiplier reaches 20–50x.
The Framework Tax Nobody Budgets For
Before any task-specific cost is incurred, the orchestration framework itself burns tokens. Measurements across common frameworks show:
- LangGraph: 1.3–1.8x overhead on baseline task cost
- CrewAI: ~2x overhead due to autonomous deliberation before tool calls
- Multi-agent orchestration: Roughly 7x per additional agent added to the pipeline
A four-agent research pipeline running 20 steps doesn't cost 4× a single-agent system. It costs closer to 28× — and that's before any retry loops or context accumulation.
This matters because teams usually prototype with a single agent and then scale to multi-agent architectures to handle complexity, treating the architecture change as a quality improvement rather than a cost event. It's both.
Five Levers That Actually Work
1. Parallel Tool Calls
Sequential tool calls pay the full input token cost of the entire conversation multiple times — once per call. Parallel tool calls batch independent operations into a single inference round, paying input tokens once for a set of work that would otherwise require multiple round trips.
The practical gains are meaningful. In benchmarks, parallel execution produces 1.4x–3.7x latency improvements, and the cost reduction compounds because fewer round trips means less accumulated context per unit of work completed. Not every tool use can be parallelized — some calls depend on the outputs of prior ones — but identifying and batching independent operations is the highest-leverage architectural change available.
2. Prompt Caching for Repeated Prefixes
- https://arxiv.org/html/2601.14470
- https://arxiv.org/html/2603.29919
- https://arxiv.org/html/2511.17006v1
- https://www.augmentcode.com/guides/ai-agent-loop-token-cost-context-constraints
- https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching/
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://galileo.ai/blog/hidden-cost-of-agentic-ai
- https://arxiv.org/pdf/2407.08892
- https://www.sitepoint.com/optimizing-token-usage-context-compression-techniques/
- https://milvus.io/blog/why-ai-agents-like-openclaw-burn-through-tokens-and-how-to-cut-costs.md
- https://blog.devgenius.io/ai-agent-tool-overload-cut-token-usage-by-99-while-scaling-to-1-000-tools-fc91f8e2b6ab
- https://www.techtarget.com/searchenterpriseai/tip/Practical-tips-for-agentic-ai-cost-optimization
- https://iternal.ai/token-usage-guide
- https://ngrok.com/blog/prompt-caching
