Skip to main content

Your Tool Catalog Is a Power Law and You're Optimizing the Long Tail

· 11 min read
Tian Pan
Software Engineer

Pull a week of tool-call traces from any production agent and the shape is the same: three or four tools handle 90% of the calls, and a couple of dozen others split the remaining 10%. The catalog is a power law, but the framework treats it like a uniform list. Every tool description ships in every system prompt, every selection rubric weights tools equally, every eval samples the catalog as if a search-files call and a refund-issue call were drawn from the same distribution. They are not.

The cost of that flatness is invisible until it isn't. A team adds the eighteenth tool, the planner's accuracy on the original three drops two points, nobody can localize the regression to a specific change because everything moved at once, and the eval suite — itself uniform across the catalog — averages the slip into a number that still looks fine. Meanwhile the tokens spent describing tools the model will not call this turn now exceed the tokens spent on the user's actual prompt.

The fix is not a better selection rubric. It is admitting that "tool catalog" is two different data structures glued together by convention, and that the gluing is what's making the planner worse.

The shape of the data: accuracy collapses non-linearly with catalog size

The empirical curve is uglier than most teams assume. Independent benchmarks running the same prompts against catalogs of varying size find that with around 50 tools, modern models hold 84–95% selection accuracy. Push to ~200 tools and the range fragments to 41–83% depending on model. At ~740 tools, accuracy collapses to 0–20% across most models. The RAG-MCP study reports baseline accuracy of 78% at 10 tools dropping to 13.6% at 100+ — an 82% degradation, and not on a smooth curve. Performance falls off a cliff at thresholds the team did not measure for, because the eval did not segment on catalog size.

Two failure modes are doing the work. The first is attention dilution: the planner has to read every tool description on every turn, and the prompt that used to fit in a focused window now spreads attention thin across a directory of definitions. The second is positional bias — the well-known "lost in the middle" effect — which tool catalogs hit harder than text retrieval because the catalog has no narrative momentum. With 741 tools, middle positions (40–60% of the catalog) score 22–52% selection accuracy versus 31–32% at the edges. The planner is less likely to pick the right tool not because it doesn't know — but because it didn't read that part of the prompt with the same care.

The token cost compounds the accuracy cost. A five-server MCP setup (GitHub, Slack, Sentry, Grafana, Splunk) burns roughly 55,000 tokens just on tool definitions before the user prompt has loaded. Twenty MCP servers exposing twenty tools each — a routine enterprise scale — produces 400 tool definitions in JSON Schema, which is the most token-hostile serialization available. The cumulative overhead consumes a substantial fraction of the context window before reasoning begins.

Hot path and cold catalog are different products

The architectural lever that recovers this performance is partitioning. There is a hot path of high-frequency tools that earn their place in the prompt — they are loaded eagerly, described in detail, covered by tight eval rubrics, and treated as part of the planner's permanent vocabulary. And there is a cold catalog of long-tail tools that is discovered on demand: the planner does not see them at all on most turns, but it has a lookup primitive (a tool-search tool, a retriever, an MCP-zero-style discovery API) that surfaces relevant cold tools when the request shape suggests one is needed.

Anthropic's recently released tool search feature is a concrete instantiation of this pattern. Tools marked defer_loading: true stay out of the initial context — the planner only sees the tool-search primitive itself plus the eagerly-loaded hot path. When Claude searches for a capability, matching tools expand into full definitions only when needed. The published numbers are the kind that make you reread them: an 85% reduction in token usage for tool definitions, paired with selection accuracy improvements from 49% to 74% on Opus 4 and from 79.5% to 88.1% on Opus 4.5. Three measurements move at once: tokens drop, accuracy rises, and the catalog can grow without proportionally damaging either.

The retrieval-augmented variant, often packaged as RAG-MCP, hits similar economics from the framework side rather than the API side. The reported numbers — 1,084 prompt tokens versus 2,134 for the all-tools baseline (50%+ reduction), tool selection accuracy of 43.13% versus 13.62% (more than 3×) — track the same shape: the dominant token-savings lever once your catalog grows past 20 is keeping cold tools out of the prompt entirely, not making each description shorter.

What both patterns require, and what most teams underestimate, is a separate index. The tool-search step needs something to search against — descriptions, example queries, parameter shapes, embeddings, or some hybrid — and that index has its own freshness requirements, its own quality bar, and its own failure modes that don't map onto the planner's eval suite. Treating retrieval as a black box because "it's just RAG" is how you ship an agent whose tool selection is worse than the all-tools baseline, just cheaper.

Each tool earns its description budget

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates