MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns
The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.
Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.
The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.
Every Listed Tool Is Billed, Whether or Not It Runs
The first cost of sprawl is the one teams underestimate, because it does not show up where they look for it. A tool you never call still costs you on every turn.
Each connected MCP tool contributes its name, description, JSON schema, field documentation, and enum values into the model's context. Estimates vary by tool complexity, but a single tool commonly runs 500 to 1,400 tokens of pure definition. Multiply that across a moderately connected agent and the numbers get loud fast: one survey of real deployments found 81 tools consuming over 20,000 tokens before a single user message was processed. Other measured setups were worse — three servers burning 143,000 of a 200,000-token window, leaving the agent roughly a quarter of its context for the actual conversation, retrieved documents, and reasoning.
The intuition that breaks here is the belief that capability disclosure is a one-time setup cost. It is not. The tool definitions are re-sent on every planning step the agent takes. An agent that runs eight reasoning hops to finish a task pays the full disclosure bill eight times. The token cost scales linearly with the number of connected servers, while the marginal user value of the eighth server flattens. That is the shape of a tax, not an investment.
It also corrupts something subtler: your prompt cache. Cache breakpoints want a stable prefix. When the tool-disclosure block sits in the middle of that prefix and a team adds or removes one server, the cache key downstream of that point invalidates. Your hit rate degrades and nobody connects it to last week's "harmless" integration.
More Tools Do Not Degrade the Agent Gracefully — They Break It
If the token tax were the whole story, you could buy your way out with a bigger context window. The harder problem is that sprawl attacks accuracy, and it does so on a cliff rather than a slope.
When researchers measured tool selection against bloated catalogs, accuracy did not erode politely. It collapsed — from around 43% down to under 14% as the tool set grew, a threefold degradation. Controlled tests show the same cliff: at 10 tools, selection was effectively perfect; at 20 tools, a strong model still scored 19 out of 20; at roughly 100-plus tools, both large and small models failed outright. Past a threshold, models do not get a little worse with each added tool. They fall off.
The mechanism is the same instruction-following degradation that makes long prompts drift off-task. The model's attention is finite, and you are spreading it across hundreds of similarly described options. It starts conflating parameters between tools and, worse, inventing tool names that do not exist.
Sprawl makes this dramatically harder by introducing collisions. A survey of nearly 1,500 MCP servers found 775 tool-name collisions: the name search appeared in 32 different servers, and get_user and execute_query showed up eleven times each. Exact-match collisions are the visible tip — semantic near-duplicates ("find_customer" versus "lookup_contact" versus "get_account") confuse selection just as badly and never trip a name-conflict check. When your agent picks the wrong search, the downstream eval blames the model. The model was not the problem. The catalog was.
Sprawl Happens Because Ownership Is Diffuse by Construction
- https://www.microsoft.com/en-us/research/blog/tool-space-interference-in-the-mcp-era-designing-for-agent-compatibility-at-scale/
- https://www.anthropic.com/engineering/code-execution-with-mcp
- https://dev.to/nebulagg/mcp-tool-overload-why-more-tools-make-your-agent-worse-5a49
- https://www.agentpmt.com/articles/thousands-of-mcp-tools-zero-context-left-the-bloat-tax-breaking-ai-agents
- https://konghq.com/blog/engineering/mcp-tool-governance-security-meets-context-efficiency
- https://www.jenova.ai/en/resources/mcp-tool-scalability-problem
- https://www.arcade.dev/blog/mcp-gateway-pattern/
- https://dxheroes.io/insights/mcp-governance-landscape-early-2026
- https://www.stackone.com/blog/mcp-token-optimization/
