The Hidden Token Tax: How Overhead Silently Drains Your LLM Context Window
Most teams know how many tokens their users send. Almost none know how many tokens they spend before a user says anything at all.
In a typical production LLM pipeline, system prompts, tool schemas, chat history, safety preambles, and RAG prologues silently consume 30–60% of your context window before the actual user query arrives. For agentic systems with dozens of registered tools, that overhead can hit 45% of a 128k window — roughly 55,000 tokens — on tool definitions that never get called.
This is the hidden token tax. It inflates costs, increases latency, and degrades output quality — yet it never shows up in any user-facing metric.
Anatomy of a Taxed Request
Consider what happens when a user sends "What meetings do I have today?" Here's what ships alongside that 8-token query:
- System prompt (behavior rules, persona, guardrails): 1,500–3,000 tokens
- Tool/function definitions (names, descriptions, parameter schemas): 5,000–55,000 tokens
- Chat history (prior turns for conversational context): 2,000–10,000 tokens
- RAG context (retrieved documents or knowledge base chunks): 1,000–5,000 tokens
- Safety preambles and output format instructions: 500–1,000 tokens
- The actual user message: 8 tokens
That's over 10,000 tokens of overhead for an 8-token question — and with large tool registries, easily 60,000+. Every token gets billed at your input rate and competes for the model's finite attention budget.
Multi-turn conversations compound the problem. A 20-turn conversation accumulates 5,000–10,000 tokens of history, yet only the last few turns typically matter. You pay for all of them on every single call — a tax that grows linearly with conversation length and never shrinks on its own.
Tool Schemas: The Biggest Silent Offender
Tool definitions are the single largest source of hidden overhead. Each definition carries a surprising cost:
- Tool name: 5–10 tokens
- Description: 50–150 tokens
- Argument schema (types, required fields): 100–300 tokens
- Field descriptions and constraints: 50–200 tokens
- Few-shot examples for reliable invocation: 200–500 tokens
That totals 550–1,400 tokens per tool. Most teams never notice because their framework injects these definitions automatically — the tax hides behind the abstraction.
Real-world measurements from agents connecting to MCP servers reveal the scale of the problem:
- GitHub (35 tools): ~26,000 tokens
- Slack (11 tools): ~21,000 tokens
- Observability tools: ~8,000 tokens
That's 45% of a 128k context window gone before the developer types a single character. Every token is billed whether or not any tool gets called — a simple "summarize this document" still pays the full tax.
Selection accuracy also degrades as the registry grows:
- 5–10 tools: over 90% selection accuracy
- 50+ tools: drops to around 49% — coin-flip territory
More tokens, worse results.
The Tax Multiplies Across Chained Calls
The token tax doesn't add — it multiplies. An agentic workflow chaining three LLM calls — intent classification, database query, response formatting — each carrying 20,000 tokens of overhead burns 60,000 tokens of structural cost for a 200-token answer. That's a 300:1 overhead-to-value ratio.
Agent loops hit even harder. An agent that takes 10 steps, each carrying the full system prompt and tool definitions, burns 200,000–500,000 tokens on overhead alone. At $3 per million input tokens, that's $0.60–$1.50 per task just for the tax — before counting the tokens doing useful work.
- https://www.gentoro.com/blog/contextual-function-calling-reducing-hidden-costs-in-function-calling-systems
- https://www.mmntm.net/articles/mcp-context-tax
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://dakora.io/blog/the-hidden-30-how-prompt-optimization-cuts-your-llm-costs-for-real
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://dev.to/whoffagents/llm-context-windows-managing-tokens-in-production-ai-apps-11l
