Skip to main content

Tool Discovery at Scale: Why Embedding-Only Retrieval Fails Past 20 Tools

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same problem on their fifth sprint: the agent can't reliably pick the right tool anymore. At ten tools, it mostly works. At twenty, accuracy starts to slip. At fifty, you're watching the agent call search_documents when it should call update_record, and the logs offer no explanation. The usual reaction is to tweak the tool descriptions — add more context, be more explicit, rewrite the examples. This occasionally helps. But it misses the root cause: flat embedding retrieval is architecturally wrong for large tool inventories, and better descriptions cannot fix an architectural problem.

Tool selection is retrieval, and retrieval has known scaling limits. Understanding those limits — and the structured metadata patterns that work around them — is what separates agent systems that hold up in production from ones that require constant babysitting.

Why Embedding-Only Retrieval Degrades at Scale

Embedding-based tool retrieval works by encoding a user's query and all tool descriptions into a shared vector space, then returning the tools nearest to the query. This is fast, generic, and surprisingly effective in the small-inventory case. The problem is mathematical.

Recent theoretical work demonstrates a fundamental constraint: the number of distinct top-k subsets an embedding model can return is bounded by its output dimensionality. Once your tool inventory grows past a critical threshold relative to that bound, the model simply cannot distinguish between similar tools. Two tools that differ only in their input contract — say, create_user versus create_service_account — collapse to nearly identical embeddings if their descriptions share enough vocabulary. The model picks whichever one appears slightly more frequently in its training distribution, not the one the user actually needs.

The failure modes are predictable:

Vocabulary overlap collapses distance. When tools share domain language — as they inevitably do in any real API surface — embedding similarity becomes a poor proxy for functional distinction. A tool that updates a user's billing address and a tool that updates their shipping address will embed close together even if their downstream effects differ completely.

Embedding drift across model updates. When the embedding model is updated, indexes generated with the old model become misaligned. Teams often attribute the resulting quality degradation to prompt drift or data issues, spending weeks in the wrong debugging loop.

Code-switching creates disjoint clusters. In mixed technical/natural-language tool inventories, queries phrased in developer terminology ("POST to the endpoint") and queries phrased in user terminology ("save my changes") land in different regions of the embedding space, even when they should resolve to the same tool.

Benchmarks put numbers on the intuition. On the Composio benchmark — which tests tool selection across 50 problems drawn from 8 function schemas — accuracy ranges from roughly 33% with no optimization to 74% with multiple strategies applied. The mean accuracy across structured tool-calling tasks in published studies sits around 69%, with a standard deviation exceeding 20 percentage points. That variance tells you the problem is sensitive to implementation, not just model capability.

What Self-Describing Tools Actually Look Like

The instinct when retrieval degrades is to improve tool descriptions. This is not wrong, but it is incomplete. Prose descriptions optimize for human readability; what retrieval systems need is machine-parseable structure that encodes more than intent.

Effective capability metadata has four components:

Intent class. A controlled vocabulary label — not a prose summary — that places the tool in a taxonomy. For example: {domain: "identity", action: "mutate", entity: "user-role"}. Intent classes enable exact-match routing before semantic search even runs. A query classified as {action: "read"} never reaches {action: "mutate"} tools.

Input contract. The typed schema of required and optional parameters, including semantic constraints that the JSON type system cannot express. "This field takes a user ID, not an email address" is information that should live in structured metadata, not in a prose description field that an LLM reads at inference time.

Preconditions. What must be true about the world state before this tool can succeed. create_invoice requires that a customer record exists and the billing period is open. Making preconditions explicit lets the agent — or a router layer — verify eligibility before calling a tool and recover gracefully when the precondition fails.

Side-effect profile. Whether the tool is read-only, idempotent, or destructive; whether it triggers downstream workflows; whether it is reversible. An agent that doesn't know that archive_record is irreversible will cheerfully use it when soft_delete was the right choice.

A tool with all four of these fields populated is self-describing in the sense that a router can make routing decisions without needing the LLM to interpret prose. The query "enroll the new hire in compliance training" can match workday_create_learning_enrollment at 92.8% accuracy despite zero keyword overlap — because intent classification, not string similarity, is doing the matching.

Hierarchical Routers Outperform Flat Retrieval

Once you have structured metadata, you can build a two-stage discovery architecture that behaves predictably as inventory grows.

Stage one: intent routing. A fast classifier (it does not need to be a large model) maps the user's request to an intent class or cluster. This reduces a 200-tool inventory to a candidate set of 10–15 tools that share the relevant domain and action type. The classifier is trained on controlled vocabulary labels, not prose, so it degrades gracefully as inventory grows.

Stage two: semantic selection within cluster. With a small candidate set, embedding similarity works well. The retrieval system is now comparing tools that are already contextually related, so the vocabulary-overlap failure mode is largely eliminated.

Dynamic approaches extend this further. Rather than selecting all tools upfront from the initial query, Dynamic Tool Dependency Retrieval conditions tool selection on the evolving plan: after each step, it re-queries the tool inventory based on what has already been done and what still needs to happen. A three-step workflow doesn't load all three tools at the start; it loads each one as the agent's reasoning reaches the point where it's needed. This reduces the context overhead that large tool inventories impose — the "MCP tax" of 10,000–60,000 tokens per turn in multi-server deployments — while improving selection accuracy in multi-step tasks.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates