The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think
Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.
The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.
Why the Math Is Wrong From the Start
The intuitive model treats agent cost like a loop counter: N tool calls at N×C. This is accurate only if each call sees the same context — which is never true in an agent loop.
Consider a coding agent using Claude Sonnet at standard pricing. A single-pass call with a 9,000-token context costs about $0.03. Run that agent for ten steps, naively appending tool results and conversation history at each turn, and the total context across all calls reaches roughly 472,000 tokens — a 43x increase in cost compared to the single-call baseline.
The underlying formula is:
Total input tokens = N × S + u × N(N+1)/2 + r × N(N-1)/2
Where N is the number of turns, S is the static prefix (system prompt + tool definitions), u is the average user message size, and r is the average tool result size. The triangular term N(N+1)/2 is the culprit: it's what makes agent cost O(N²) rather than O(N). For a 20-turn agent, the expected context growth is not 20x — it's closer to 200x in accumulated token exposure.
Real measurements confirm this. In a documented five-step agent run, per-call token consumption grew as: 888 → 3,400 → 8,900 → 14,200 → 18,900. The cost per step didn't stay flat; it ballooned with each turn because every previous turn stayed in context.
The Hidden Fixed Costs That Multiply With Every Call
Token growth from accumulated history is the obvious problem. Less visible is the set of fixed costs that get paid fresh on every inference call.
System prompts. A detailed system prompt runs 2,000–5,000 tokens for a typical production agent. At one million API calls — not unusual for a customer-facing product — that's 2–5 billion tokens of instruction overhead before a single user message is processed. At scale, system prompts become one of the largest line items in inference spend.
Tool definitions. Every available tool gets serialized into the context for every call, whether the model uses it or not. A single modest tool definition costs 50–100 tokens. A setup with 100 tools consumes roughly 22% of a 128K context window before the user query begins. Measured production setups have found 55,000–134,000 tokens of tool-definition overhead in a single call. One team reduced this from 134K to 8,700 tokens — an 85% reduction — by switching from always-on tool definitions to dynamic loading.
Retry overhead. Failed tool calls don't disappear from the context. The error response, the model's next attempt, and any intermediate reasoning all accumulate in the conversation history and get resent with every subsequent call. A 10% failure rate per step, compounded across 10 steps without circuit breakers, can silently multiply costs several times over. One engineering team reduced their per-task tool call count from 14 to 2 by adding clear terminal states (SUCCESS/FAILED) to tool responses — the model stopped retrying ambiguous outcomes.
Combined, these hidden fixed costs mean that the real undercount for moderate-complexity agents (3–5 tools, multi-step workflows) is typically 5–10x. For complex multi-agent systems, the multiplier reaches 20–50x.
The Framework Tax Nobody Budgets For
Before any task-specific cost is incurred, the orchestration framework itself burns tokens. Measurements across common frameworks show:
- LangGraph: 1.3–1.8x overhead on baseline task cost
- CrewAI: ~2x overhead due to autonomous deliberation before tool calls
- Multi-agent orchestration: Roughly 7x per additional agent added to the pipeline
A four-agent research pipeline running 20 steps doesn't cost 4× a single-agent system. It costs closer to 28× — and that's before any retry loops or context accumulation.
This matters because teams usually prototype with a single agent and then scale to multi-agent architectures to handle complexity, treating the architecture change as a quality improvement rather than a cost event. It's both.
Five Levers That Actually Work
1. Parallel Tool Calls
Sequential tool calls pay the full input token cost of the entire conversation multiple times — once per call. Parallel tool calls batch independent operations into a single inference round, paying input tokens once for a set of work that would otherwise require multiple round trips.
The practical gains are meaningful. In benchmarks, parallel execution produces 1.4x–3.7x latency improvements, and the cost reduction compounds because fewer round trips means less accumulated context per unit of work completed. Not every tool use can be parallelized — some calls depend on the outputs of prior ones — but identifying and batching independent operations is the highest-leverage architectural change available.
2. Prompt Caching for Repeated Prefixes
The system prompt and tool definitions don't change between calls. Most major LLM providers now offer prefix caching: once a prompt prefix is processed and cached, subsequent calls that share that prefix pay a dramatically reduced input cost. Anthropic charges 90% less for cached reads. OpenAI's automatic caching offers roughly 50% savings on repeated prefixes.
The key implementation detail is cache structure. Caching requires that the cacheable prefix is strictly identical across calls — dynamic content inserted before the static portions destroys the cache hit. One team improved their cache hit rate from 7% to 74% by moving dynamic content out of the cacheable prefix and into the user message. After full optimization, they reached 84% cache hit rates and cut overall inference costs by 59%, serving 9.8 billion tokens from cache rather than recomputing them.
For a 40-step task with a 20,000-token system prompt, caching means paying full price once and 10% of full price 39 times, instead of full price 40 times. The savings compound with task length.
3. Context Compression Between Turns
The default behavior — append everything and pass it forward — is the most expensive behavior. There are several alternatives, roughly ordered by implementation effort:
Rolling summarization. Replace older turns with a living summary rather than keeping the full transcript. This requires explicit summary management but caps context growth. Research shows 50–80% token reduction is achievable while preserving the information needed to complete the task.
Structured note-taking. Instead of relying on the conversation history as the agent's memory, maintain an external notes file that tracks progress, decisions, and partial results. The notes get updated in compact form; the conversation window stays short. This is the approach recommended in Anthropic's context engineering research for long-horizon tasks.
Sub-agent isolation. Break long tasks into sub-agents, each of which operates on a scoped context and returns a 1,000–2,000 token summary rather than its full working history. The orchestrating agent sees only summaries, keeping the top-level context manageable.
Be careful with aggressive compression. One study found that 99.3% compression of context — technically feasible — actually increased total cost because it forced the agent to re-fetch information it had already retrieved. The sweet spot is usually 50–80% reduction.
4. Dynamic Tool Loading
Serving all tool definitions in every call is the simplest implementation but the most expensive. A more efficient approach maintains a registry of available tools but passes only two tools to the model by default: one to search the registry and one to invoke a selected tool. When the model decides it needs a specific capability, it searches the registry and loads that tool's definition just-in-time.
One documented implementation reduced tool-definition overhead from 134,000 tokens to 8,700 tokens — an 85% reduction — while maintaining full agent capability. A team with 1,000 tools in their registry would, without this approach, consume more than 50% of a 128K context window on tool definitions alone before the model processes a single word of the user's request.
5. Budget Enforcement and Early Stopping
One of the most expensive failure modes is an agent that runs past the point of useful work — retrying failed operations, re-deriving conclusions it already reached, or entering loops without termination conditions. Infrastructure-level token budgets are not primarily a cost-control feature; they are a correctness feature that happens to prevent runaway billing.
Practical early stopping signals include:
- Clear terminal states in tool responses (SUCCESS/FAILED, not just error codes)
- Step counters with hard ceilings per task
- Token budget awareness injected into the system prompt, so the model can modulate its behavior as it approaches the limit
- Circuit breakers that abort on repeated identical tool calls
Research on budget-aware tool use found that agents given explicit token budget information make meaningfully different choices: they prioritize high-information tool calls, skip redundant verifications, and terminate gracefully instead of continuing until exhausted.
What to Measure Before You Scale
Teams that get surprised by agent costs usually share one characteristic: they measured cost per call but not cost per task. Per-call cost is a lagging indicator — it reflects what already happened. Per-task cost, tracked across the full agent loop, reveals the compounding dynamics before they become a billing event.
The useful metrics for production agent cost management are:
- Tokens per task completion: includes all turns and retries, not just successful calls
- Cache hit rate by prompt layer: system prompt, tool definitions, and conversation window each have different hit rates and different optimization strategies
- Context growth rate: tokens sent in call N versus call N-1, averaged across tasks
- Retry fraction: what percentage of total task cost comes from retried operations
With these signals, the optimization opportunities become visible. Without them, you are estimating a quadratic curve using linear assumptions — and the next billing cycle will explain the difference.
The Cost Structure Won't Stay Fixed
Per-token pricing has been falling steadily and is projected to continue falling. The temptation is to treat cost overruns as a temporary problem — cheaper inference will solve it. But cheaper tokens don't fix quadratic scaling. An agent that costs 30x its single-call estimate at today's prices costs 30x at tomorrow's prices too, just in smaller absolute dollars. And as token prices fall, teams tend to run more agents, longer, on harder problems — which means the N in N(N+1)/2 grows, not shrinks.
The fundamental issue is architectural, not pricing. Building agents that hold their context flat, use caching aggressively, and stop when they're done is not a cost optimization applied after the fact — it is how production agents are supposed to be designed. The teams shipping economically viable long-horizon agents in 2026 made these decisions at the architecture stage, not after their first cloud bill landed.
- https://arxiv.org/html/2601.14470
- https://arxiv.org/html/2603.29919
- https://arxiv.org/html/2511.17006v1
- https://www.augmentcode.com/guides/ai-agent-loop-token-cost-context-constraints
- https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching/
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://galileo.ai/blog/hidden-cost-of-agentic-ai
- https://arxiv.org/pdf/2407.08892
- https://www.sitepoint.com/optimizing-token-usage-context-compression-techniques/
- https://milvus.io/blog/why-ai-agents-like-openclaw-burn-through-tokens-and-how-to-cut-costs.md
- https://blog.devgenius.io/ai-agent-tool-overload-cut-token-usage-by-99-while-scaling-to-1-000-tools-fc91f8e2b6ab
- https://www.techtarget.com/searchenterpriseai/tip/Practical-tips-for-agentic-ai-cost-optimization
- https://iternal.ai/token-usage-guide
- https://ngrok.com/blog/prompt-caching
