Skip to main content

The Hidden Token Tax: How Overhead Silently Drains Your LLM Context Window

· 8 min read
Tian Pan
Software Engineer

Most teams know how many tokens their users send. Almost none know how many tokens they spend before a user says anything at all.

In a typical production LLM pipeline, system prompts, tool schemas, chat history, safety preambles, and RAG prologues silently consume 30–60% of your context window before the actual user query arrives. For agentic systems with dozens of registered tools, that overhead can hit 45% of a 128k window — roughly 55,000 tokens — on tool definitions that never get called.

This is the hidden token tax. It inflates costs, increases latency, and degrades output quality — yet it never shows up in any user-facing metric.

Anatomy of a Taxed Request

Consider what happens when a user sends "What meetings do I have today?" Here's what ships alongside that 8-token query:

  • System prompt (behavior rules, persona, guardrails): 1,500–3,000 tokens
  • Tool/function definitions (names, descriptions, parameter schemas): 5,000–55,000 tokens
  • Chat history (prior turns for conversational context): 2,000–10,000 tokens
  • RAG context (retrieved documents or knowledge base chunks): 1,000–5,000 tokens
  • Safety preambles and output format instructions: 500–1,000 tokens
  • The actual user message: 8 tokens

That's over 10,000 tokens of overhead for an 8-token question — and with large tool registries, easily 60,000+. Every token gets billed at your input rate and competes for the model's finite attention budget.

Multi-turn conversations compound the problem. A 20-turn conversation accumulates 5,000–10,000 tokens of history, yet only the last few turns typically matter. You pay for all of them on every single call — a tax that grows linearly with conversation length and never shrinks on its own.

Tool Schemas: The Biggest Silent Offender

Tool definitions are the single largest source of hidden overhead. Each definition carries a surprising cost:

  • Tool name: 5–10 tokens
  • Description: 50–150 tokens
  • Argument schema (types, required fields): 100–300 tokens
  • Field descriptions and constraints: 50–200 tokens
  • Few-shot examples for reliable invocation: 200–500 tokens

That totals 550–1,400 tokens per tool. Most teams never notice because their framework injects these definitions automatically — the tax hides behind the abstraction.

Real-world measurements from agents connecting to MCP servers reveal the scale of the problem:

  • GitHub (35 tools): ~26,000 tokens
  • Slack (11 tools): ~21,000 tokens
  • Observability tools: ~8,000 tokens

That's 45% of a 128k context window gone before the developer types a single character. Every token is billed whether or not any tool gets called — a simple "summarize this document" still pays the full tax.

Selection accuracy also degrades as the registry grows:

  • 5–10 tools: over 90% selection accuracy
  • 50+ tools: drops to around 49% — coin-flip territory

More tokens, worse results.

The Tax Multiplies Across Chained Calls

The token tax doesn't add — it multiplies. An agentic workflow chaining three LLM calls — intent classification, database query, response formatting — each carrying 20,000 tokens of overhead burns 60,000 tokens of structural cost for a 200-token answer. That's a 300:1 overhead-to-value ratio.

Agent loops hit even harder. An agent that takes 10 steps, each carrying the full system prompt and tool definitions, burns 200,000–500,000 tokens on overhead alone. At $3 per million input tokens, that's $0.60–$1.50 per task just for the tax — before counting the tokens doing useful work.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates