The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window
Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.
This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.
The frame this post wants to install: MCP is not a free composition primitive. It is a context-budget consumer with a measurable per-server tax, and the team that doesn't price the tax explicitly is going to discover its agent's effective context shrinks every quarter while the tooling sprawl grows.
The Disclosure Is Re-Paid Every Turn, Not Once
The most common misconception about MCP is that capability disclosure is a setup cost — a one-time handshake, an init-time discovery, a configuration step that happens before the conversation starts. The intuition is wrong, and the cost model that follows from the intuition is wrong by an order of magnitude.
Tool definitions are injected into the model's context on every planning step. Each turn, the planner re-reads the full tool catalog. A multi-turn agent with eight MCP servers averaging forty tools each sends three hundred and twenty tool definitions into every planning call, and the median tool definition runs around seven hundred tokens after you count name, description, input schema, parameter docs, and the schema-format boilerplate the protocol requires. That's somewhere between fifteen and thirty thousand tokens of disclosure overhead per turn, repeated for every reasoning step in a multi-step task, before the user message, before the conversation history, before any retrieved context.
The compounding effect shows up in three places nobody attributes correctly. First, the per-request token bill scales linearly with installed servers while the marginal user value per server flattens — the eighth integration delivers a fraction of the value of the first, but pays the same disclosure cost. Second, the prompt cache breakpoint, which most teams place at the end of the tool definitions block to amortize the disclosure, sits in the most volatile region of the prompt — any server added, removed, reordered, or returning a slightly different description (a timestamp in a tool description, a dynamic file count, an MCP server whose async init produces a different ordering across restarts) invalidates the cache and the team pays full price until the next stable run. Third, the model's attention is spread across hundreds of tool definitions, and tool-selection accuracy degrades in a way that looks like a prompt regression but is actually a disclosure-load regression.
The Attention Cost Is Larger Than the Token Cost
Engineers who optimize for token cost alone are pricing half the bill. The disclosure tax has a second invoice that doesn't show up on the usage dashboard at all: model accuracy.
The benchmark numbers are stark. At ten available tools, capable models score effectively perfect on tool selection. At twenty, the best models still hit 95%. By the time the tool count crosses around twenty-five, accuracy begins to slip measurably. At one hundred and seven tools, models fail outright — task success collapses, and tool-selection accuracy in some studies drops from above 40% to below 14%, a threefold degradation where the agent picks the wrong tool roughly seven times out of eight. GitHub trimmed their Copilot MCP integration from forty built-in tools to thirteen and recovered two to five percentage points on SWE-Lancer and SWE-bench-Verified plus four hundred milliseconds of latency. The trim wasn't a cost optimization — it was an accuracy fix priced as a cost optimization.
The mechanism is "lost in the middle" amplified by tool overload. The model has to attend across a context that is largely tool descriptions for tools that won't be called. Signal from the actual task gets diluted by noise from forty-nine tool definitions the planner doesn't need this turn. The team observes a tool-selection regression, runs a few prompt-engineering experiments, and ends up writing a sharper system prompt that produces a transient lift, then drifts back as the team adds two more servers. The underlying problem is the disclosure load. The fix is not in the prompt.
Prompt Cache Breakpoints Cannot Save You From Volatility
The first instinct when the disclosure cost shows up on the invoice is to lean harder on prompt caching. Place a cache breakpoint at the end of the tools block, pay the disclosure cost once, amortize it across the cache window. On paper, this cuts input cost on the tool prefix by ninety percent. In practice, the cache savings are real but fragile, and the team that depends on them is one config change away from a surprise.
The fragility is structural. Prompt cache requires byte-for-byte prefix matching in the strict order tools → system → messages. MCP servers initialize asynchronously, which means the tool array can come back in a different order on different restarts unless the agent runtime sorts it deterministically. A tool whose description includes a timestamp, a build hash, or a dynamic count of available resources busts the cache on every call. A tool whose schema is regenerated from a typed source on each server boot can produce subtly different output between versions of the source library. Any of these silently turns a cached prefix into a cold one, the latency doubles, the cost goes up by 10x on the affected calls, and the team blames the model vendor.
- https://www.anthropic.com/engineering/code-execution-with-mcp
- https://docs.bswen.com/blog/2026-04-24-mcp-token-overhead/
- https://www.mmntm.net/articles/mcp-context-tax
- https://mcpplaygroundonline.com/blog/mcp-token-counter-optimize-context-window
- https://thenewstack.io/how-to-reduce-mcp-token-bloat/
- https://www.stackone.com/blog/mcp-token-optimization/
- https://www.atlassian.com/blog/developer/mcp-compression-preventing-tool-bloat-in-ai-agents
- https://dev.to/nebulagg/mcp-tool-overload-why-more-tools-make-your-agent-worse-5a49
- https://matthewkruczek.ai/blog/progressive-disclosure-mcp-servers.html
- https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576
- https://layered.dev/mcp-tool-schema-bloat-the-hidden-token-tax-and-how-to-fix-it/
- https://www.agentpmt.com/articles/thousands-of-mcp-tools-zero-context-left-the-bloat-tax-breaking-ai-agents
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/tool-use-with-prompt-caching
