Skip to main content

The Over-Tooled Agent Problem: Why More Tools Make Your LLM Dumber

· 9 min read
Tian Pan
Software Engineer

When a team at Writer instrumented their RAG-MCP benchmark, they found that baseline tool selection accuracy — with no special handling — was 13.62% when the agent had access to a large set of tools. Not 80%. Not 60%. Thirteen percent. The same agent, with retrieval-augmented tool selection exposing only the most relevant subset, reached 43%. The tools didn't change. The model didn't change. Only the number of tool definitions visible at reasoning time changed.

This is the over-tooled agent problem, and it's quietly wrecking production AI systems at scale.

Engineers building AI agents instinctively want to give them more capability. More tools mean more flexibility, right? The agent can handle edge cases, cover more user intents, adapt to novel situations. The reasoning is intuitive. It's also wrong. Beyond a surprisingly low threshold, adding tools to an LLM agent does not increase capability — it destroys it.

What Actually Happens When You Add Too Many Tools

Every tool you add to an agent's context has a cost that most teams don't measure until it's too late.

Tool definitions are not free tokens. A typical function schema with description, parameter names, types, and docstrings consumes 150–400 tokens. A production MCP server with 20 tools consumes 3,000–8,000 tokens before the agent has processed a single user message. Teams building multi-server setups routinely accumulate 50,000–200,000 tokens of schema overhead. That's often more context than the actual task requires.

But the cost isn't just latency and money. The deeper problem is cognitive overload at the selection step. When a model must choose from 40 possible functions, several failure modes emerge:

  • Selection confusion: Similar tool names and overlapping descriptions create ambiguity the model can't resolve. It picks the wrong one, or hedges by calling multiple tools unnecessarily.
  • Context rot: Functions blur together in the attention mechanism. The model loses track of semantic distinctions between tools that a human designer saw as obvious.
  • Schema noise: The sheer volume of irrelevant parameter descriptions occupies working memory that should be applied to understanding the user's intent.
  • Hallucinated tools: Under high tool counts, models occasionally invoke tools that don't exist or combine parameters from different schemas into a chimera call.

The Berkeley Function Calling Leaderboard data shows that even the strongest models exhibit measurable accuracy degradation as tool catalog size grows. This isn't a weakness of a specific model. It's a fundamental property of how attention-based architectures handle large discrete choice sets.

The Threshold Is Lower Than You Think

Teams typically discover the problem in one of two ways. Either they observe mysterious agent failures that don't correlate with prompt quality or model version, or they run a controlled experiment and discover the degradation for themselves.

The "safe zone" in practice is roughly 10–20 tools per reasoning context. Below that threshold, most capable models handle selection well. Above it, quality begins degrading. Above 40–50, many production workflows become unreliable enough to require architectural intervention.

The precise threshold varies by model and task domain, but the shape of the curve is consistent: a plateau of acceptable performance, followed by a cliff.

What makes this particularly dangerous is that the cliff isn't always visible. Agents don't throw errors when they pick the wrong tool. They return plausible-looking results from the wrong operation. Users see subtle degradations — wrong data returned, side effects from unintended function calls, responses that are almost right. These are hard to catch in testing because the outputs pass surface-level quality checks.

Tool Routing Layers: The Architectural Fix

The core insight behind every effective solution is the same: don't give the model all the tools. Give it the right tools for this specific request.

A tool routing layer sits between the user request and the agent's reasoning loop. Its job is to reduce the full tool catalog to a context-appropriate subset — ideally 5–15 tools — before the agent ever begins thinking about which one to call.

Routing can be implemented at different levels of sophistication:

Semantic retrieval is the most common approach. Maintain an external index (ChromaDB or similar) over tool descriptions. When a request arrives, embed it and retrieve the top-N most semantically similar tools. The agent only sees those. This is the approach behind RAG-MCP and similar systems — it's what drove that 3x accuracy improvement from 13% to 43%.

Rule-based routing uses explicit logic to assign tool subsets by domain, user role, or request type. A billing-related query gets the billing tools; a data query gets the analytics tools. This is less flexible than semantic routing but more predictable and easier to audit.

LLM-based routing uses a lightweight, fast model as a classifier that decides which tool domain or agent specialization to engage before routing to the full reasoning model. The routing model processes a small context; only the target model sees the full tool set for that domain.

The key property shared by all three is that they decouple the tool catalog size from the tool context size. Your system can support 200 tools; any given agent call sees 10.

Hierarchical Toolsets: Scaling Tool Architecture

Routing solves the immediate problem, but as systems grow more complex, routing alone becomes insufficient. You need structural hierarchy.

The HTAA (Hybrid Toolset Agentization and Adaptation) framework, developed in 2026, formalizes what many teams have discovered independently: frequently co-used tools should be encapsulated together into specialized agents. Rather than one agent with a flat list of 50 tools, you have an orchestrator agent and a set of specialist agents — each with 5–10 tightly focused tools.

This creates several compounding benefits:

  • Reduced planner complexity: The orchestrator decides which specialist to call, not which of 50 functions. That's a much easier selection problem.
  • Better tool documentation: When tools live in a domain context, their descriptions can assume domain knowledge rather than explaining everything from scratch.
  • Cleaner failure modes: When something goes wrong, it's scoped to a specialist's domain. Debugging is easier.
  • Independent scaling: High-traffic tool domains can have their specialist scaled separately.

The agents-as-tools pattern implements this hierarchy concretely. Specialist agents — "data retrieval agent," "calendar management agent," "code execution agent" — are themselves exposed as callable tools to an orchestrating agent. The orchestrator's tool list stays small. Each specialist's tool list stays focused. The full capability of the system scales without degrading any individual reasoning context.

Lazy-Loading Registries: Just-in-Time Tool Exposure

A complementary approach is lazy-loading: instead of declaring all available tools at session start, the agent requests tools as it needs them.

MCP-Zero and similar implementations reframe tool access as a discovery problem. The agent emits a structured query — "I need a tool that can read from a Postgres database" — and the registry returns the appropriate schema. The agent only holds schemas for tools it has explicitly requested in the current reasoning chain.

This approach has unique advantages for long-running or multi-turn agents. As a conversation evolves, the tool context stays clean rather than accumulating schemas for every capability the agent might conceivably need. Token overhead grows with actual usage, not potential usage.

The practical tradeoff is that discovery adds a round-trip latency. For workflows where tool selection happens once upfront, retrieval-at-session-start is simpler. For complex, branching multi-step workflows, lazy loading can meaningfully reduce average token consumption.

Designing Tools That Don't Hurt You

Architecture fixes the structural problem, but tool design quality determines how much headroom you have.

The single most important principle is single responsibility. Each tool should do exactly one thing, have an unambiguous name, and have a description that makes its boundaries clear. When two tools have overlapping capabilities, the model will make selection errors. Eliminate the overlap by splitting or combining.

Tool names must be treated with the same care as API surface design. Names that differ by a single word ("get_user_profile" vs "fetch_user_data") create unnecessary ambiguity. Names should encode the action and the entity unambiguously, with no synonyms between tools.

Tool documentation is a prompt, not a comment. The model reads your description at inference time and uses it to decide whether to call this function. Write it for a model, not a human reader. Specify what the tool does, when to use it, and — critically — when not to use it. Stating the negative case is often more valuable than the positive case for disambiguation.

Keep parameter schemas strict. Loose typing and optional parameters invite hallucinations. If a parameter is always required, mark it required. If a value must be one of a fixed set, use an enum. Precision in schema design reduces the probability that the model will pass malformed arguments.

Measuring the Problem Before You Solve It

Most teams don't realize they're over-tooled until quality degrades. By then, they've shipped a system architecture that's expensive to restructure.

The right time to measure tool selection quality is before you deploy. Build a small evaluation suite of representative queries. For each query, record which tool was called, whether it was correct, and what the selection confidence was. Run this evaluation when you add tools. If accuracy drops, you've crossed a threshold — don't ship without addressing it.

In production, instrument tool call selection explicitly. Log which tools were selected for each request. Track the distribution. If you see a long tail of rarely-called tools consuming schema tokens on every request, they're candidates for either lazy-loading or a specialized routing path.

The goal isn't to minimize the number of tools in your system. It's to minimize the number of tools in any given reasoning context. Those are very different problems, and solving the second one unlocks the ability to build systems of arbitrary capability without sacrificing accuracy.

Conclusion

The intuition that more tools means more capable agents is exactly backwards. Beyond a threshold of roughly 10–20 tools per context, adding capability destroys the ability to use it. The fix isn't to build smaller systems — it's to build systems where routing, hierarchy, and lazy-loading ensure each agent call sees only the tools it needs.

Tool routing layers, hierarchical multi-agent architectures, and just-in-time tool discovery are converging into a common architectural pattern for production AI systems at scale. Teams that adopt these patterns early will find that agent reliability improves not despite having more tools, but because more tools are now managed well. Teams that don't will keep debugging mysterious accuracy regressions in systems that look fine in isolation and break in production.

The tool catalog is infrastructure. Treat it like infrastructure: design for access control, not just availability.

References:Let's stay in touch and Follow me for more thoughts and updates