Skip to main content

The Tool Selection Problem: How Agents Choose What to Call When They Have Dozens of Tools

· 10 min read
Tian Pan
Software Engineer

Most agent demos work with five tools. Production systems have fifty. The gap between those two numbers is where most agent architectures fall apart.

When you give an LLM four tools and a clear task, it usually picks the right one. When you give it fifty tools, something more interesting happens: accuracy collapses, token costs balloon, and the failure mode often looks like the model hallucinating a tool call rather than admitting it doesn't know which tool to use. Research from the Berkeley Function Calling Leaderboard found accuracy dropping from 43% to just 2% on calendar scheduling tasks when the number of tools expanded from 4 to 51 across multiple domains. That is not a graceful degradation curve.

The tool selection problem sits at the intersection of retrieval, context management, and model reasoning — and practitioners building real agent systems usually discover it the hard way, after they've already shipped.

Why Dumping All Tools Into Context Fails

The naive approach is also the obvious one: define all your tools in the system prompt and let the model figure it out. This works up to around 10-15 tools, depending on the model and the specificity of each tool's description. Beyond that, two things break simultaneously.

Token explosion. Anthropic's internal testing found that 58 tool definitions consume roughly 55,000 tokens. At production inference costs, that means every single agent turn carries a baseline overhead equivalent to summarizing a short novel — before the user's query even enters the picture. For high-throughput systems, this is the hidden budget item that makes the unit economics unworkable.

Selection accuracy degradation. More tools don't just add tokens — they degrade the model's ability to choose. The failure mode is subtle: the model rarely says "I don't know which tool to use." Instead, it picks a plausible-sounding tool that's wrong, or it combines multiple tools incorrectly, or it tries to call a tool with parameters that fit a different tool's schema. These errors are hard to detect because the response format is often still valid — just semantically incorrect.

There's also a third, underappreciated failure: documentation quality variance. In a set of 50 tools, some will have precise, well-scoped descriptions and others will have vague, overlapping ones written by different teams over time. A tool called send_notification and another called push_alert with similar descriptions will reliably confuse models, especially when the correct choice depends on subtle behavioral differences that aren't explicit in the description text.

Static Retrieval: Better Than Nothing, But Not Enough

The first fix most teams reach for is RAG-based tool selection: embed all your tool descriptions, embed the user query, find the top-k most similar tools, and only pass those to the model. This is meaningfully better than static inclusion — it cuts token usage and narrows the selection space.

But pure embedding retrieval has its own failure modes that practitioners learn quickly.

Query-description vocabulary mismatch. A user asking to "reschedule the meeting" may not have high cosine similarity with a tool described as calendar_event_update, especially if the embedding model wasn't trained on API documentation. The user's language and the tool's documentation language often live in different parts of the embedding space.

Static retrieval in dynamic workflows. The right tool for step 2 of a multi-step task often depends on the output of step 1. If you retrieve tools only against the original user query, you'll systematically miss tools that become relevant mid-execution. Conditioned retrieval — where tool selection incorporates both the original intent and the evolving execution context — consistently outperforms query-only retrieval on multi-step benchmarks.

Top-k truncation eliminates the second-best choice. When two tools are nearly identical in relevance, embedding similarity rankings can place the correct one at position k+1, just outside the window. The model never sees the right option and picks incorrectly from what it does see.

The semantic covering attack (which is mostly a reliability issue, not just a security one). If several tools have overlapping, generic descriptions, they consume all the top-k slots regardless of actual relevance. Even in benign settings, poorly differentiated tool descriptions cause this problem constantly — the top-5 retrieved results end up being five variations of the same broad capability rather than a diverse, complementary set.

What Actually Works: Layered Routing

Production systems that handle large tool inventories reliably tend to use layered routing rather than a single retrieval pass. The structure looks roughly like this:

Tier 1: Intent classification. A lightweight classifier (or a small LLM call) maps the incoming request to a domain or tool category. This doesn't require semantic search — it's a structured dispatch step. The categories can be coarse: calendar_operations, document_management, data_queries, user_management. The goal is eliminating 80% of the tool inventory before any similarity calculation happens.

Tier 2: Hybrid retrieval within the category. Within the relevant category, use both dense (embedding) and sparse (BM25/keyword) retrieval, then merge the ranked lists. Hybrid retrieval consistently outperforms pure semantic search for tool selection because API documentation contains exact terminology — method names, parameter names, domain-specific jargon — that keyword matching handles better than embeddings.

Tier 3: Context-conditioned re-ranking. For multi-step workflows, re-rank the candidate tools against the current execution state, not just the original query. A re-ranker that sees what tools were already called and what their outputs were will systematically surface better options than one working only from the initial request.

The result: the model sees 5-8 tools per call instead of 50, all of them relevant to the current step. Selection accuracy climbs. Token costs drop. Failure analysis becomes tractable because you can log and audit which tools were retrieved and why.

Tool-to-Agent Retrieval for Multi-Agent Systems

When tool inventories are organized across multiple specialized agents — one agent for calendar, one for documents, one for CRM — routing becomes a two-level problem: which agent to invoke, and which tool within that agent.

The tempting shortcut is agent-level routing only: classify the intent, route to the right agent, let the agent handle tool selection internally. This breaks in two ways. First, it hides fine-grained tool capabilities behind coarse agent descriptions, forcing routing decisions to happen without the information needed to make them correctly. Second, some tasks genuinely span multiple agents and require coordinated tool bundles — a purely agent-first approach can't reason about those cross-agent dependencies.

Embedding both tools and agents in a shared vector space, linked through metadata relationships, handles this better. Retrieval can match against individual tool capabilities for precision tasks, or against agent bundles for coordinated multi-step workflows, depending on what the query similarity pattern suggests. On benchmarks with 70+ MCP servers and 527+ tools, this approach shows roughly 17-18% improvement in retrieval quality metrics over agent-only or tool-only routing.

The Execution Context Problem

There's a failure mode that retrieval improvements alone won't solve: the model selecting the correct tool but calling it incorrectly because it lacks world model awareness.

In stateful systems, tool calls have side effects that change what subsequent calls should do. An agent that deletes a record doesn't understand that this changes the result of the next SELECT. An agent that sends an email doesn't model that retrying the send will duplicate the message. The tool selection layer can surface the right tool, but execution correctness depends on the model understanding the stateful consequences of its actions.

The practical mitigation is pre-execution planning: having the agent sketch out the full call sequence before executing any step, identifying dependencies and irreversibility constraints. Models that plan before they act make meaningfully fewer irreversible errors than those that select and execute tool-by-tool. The NESTFUL benchmark found that even GPT-4o achieves only 28% accuracy on full sequences of dependent API calls — a stark reminder that tool selection is only one part of the problem.

Writing Tool Descriptions for Retrieval

The tools themselves are a retrieval artifact, not just API documentation. Descriptions written for human readers often fail as retrieval targets.

Useful tool descriptions for agent systems include:

  • Canonical use cases: concrete examples of when this tool should be called, not just what it does
  • Anti-examples: explicit statements of when to use a related tool instead
  • Preconditions and postconditions: what must be true before calling this tool, and what will be different after
  • Failure modes: what the tool returns when inputs are out of range, when the underlying service is unavailable, or when the operation partially succeeds

Adding these fields to tool metadata and indexing them alongside the primary description improves retrieval quality substantially. Research on document expansion for tool retrieval found that adding structured fields like when_to_use, tags, and limitations yielded the largest improvements in retrieval metrics — more than improving the base description itself.

The Routing Architecture Decision

When to add a routing layer versus simplifying the tool inventory is a judgment call that depends on your actual numbers. A reasonable heuristic:

  • Under 15 tools: static inclusion, invest in description quality instead
  • 15-40 tools: single-pass semantic retrieval with hybrid scoring
  • 40+ tools: layered routing with intent classification, hybrid retrieval, and context-conditioned re-ranking
  • 200+ tools across multiple agents: tool-to-agent retrieval in a shared embedding space

The mistake practitioners make is reaching for routing complexity before they've exhausted description quality improvements. A well-differentiated set of 30 tools with crisp, precise descriptions will outperform a retrieval system built on top of 30 poorly described tools. Retrieval doesn't fix confusion — it just narrows which confusing options the model sees.

Measuring Whether It's Working

Agent tool selection is one of the harder components to evaluate because failures are often silent. The model calls a tool, the tool returns a result, and the agent continues. Whether the right tool was called is only visible if you're logging and auditing at the tool-call level, not just the final response.

Useful signals to track:

  • Tool selection accuracy per intent class: which categories show degraded accuracy as your tool inventory grows?
  • Retrieval recall@k: of the correct tools for a given query, how often do they appear in the top-k retrieved set?
  • Tool call retry rate: how often does the model call a tool, get an unexpected result, and call a different tool? High retry rates indicate upstream selection errors.
  • Mean tools per turn: rising over time suggests retrieval isn't narrowing the selection space effectively.

Build evals that cover the retrieval layer specifically, not just end-to-end task completion. By the time a task completion eval degrades, the retrieval problem has usually been present for much longer — you just couldn't see it.

The tool selection problem is ultimately a retrieval problem disguised as a model problem. Engineers who treat it as "the model isn't smart enough" reach for larger, more expensive models. Engineers who treat it as "the retrieval layer isn't giving the model the right information" build systems that work at 500 tools with the same accuracy they had at 5.

References:Let's stay in touch and Follow me for more thoughts and updates