Skip to main content

Your Tool Catalog Is a Power Law and You're Optimizing the Long Tail

· 11 min read
Tian Pan
Software Engineer

Pull a week of tool-call traces from any production agent and the shape is the same: three or four tools handle 90% of the calls, and a couple of dozen others split the remaining 10%. The catalog is a power law, but the framework treats it like a uniform list. Every tool description ships in every system prompt, every selection rubric weights tools equally, every eval samples the catalog as if a search-files call and a refund-issue call were drawn from the same distribution. They are not.

The cost of that flatness is invisible until it isn't. A team adds the eighteenth tool, the planner's accuracy on the original three drops two points, nobody can localize the regression to a specific change because everything moved at once, and the eval suite — itself uniform across the catalog — averages the slip into a number that still looks fine. Meanwhile the tokens spent describing tools the model will not call this turn now exceed the tokens spent on the user's actual prompt.

The fix is not a better selection rubric. It is admitting that "tool catalog" is two different data structures glued together by convention, and that the gluing is what's making the planner worse.

The shape of the data: accuracy collapses non-linearly with catalog size

The empirical curve is uglier than most teams assume. Independent benchmarks running the same prompts against catalogs of varying size find that with around 50 tools, modern models hold 84–95% selection accuracy. Push to ~200 tools and the range fragments to 41–83% depending on model. At ~740 tools, accuracy collapses to 0–20% across most models. The RAG-MCP study reports baseline accuracy of 78% at 10 tools dropping to 13.6% at 100+ — an 82% degradation, and not on a smooth curve. Performance falls off a cliff at thresholds the team did not measure for, because the eval did not segment on catalog size.

Two failure modes are doing the work. The first is attention dilution: the planner has to read every tool description on every turn, and the prompt that used to fit in a focused window now spreads attention thin across a directory of definitions. The second is positional bias — the well-known "lost in the middle" effect — which tool catalogs hit harder than text retrieval because the catalog has no narrative momentum. With 741 tools, middle positions (40–60% of the catalog) score 22–52% selection accuracy versus 31–32% at the edges. The planner is less likely to pick the right tool not because it doesn't know — but because it didn't read that part of the prompt with the same care.

The token cost compounds the accuracy cost. A five-server MCP setup (GitHub, Slack, Sentry, Grafana, Splunk) burns roughly 55,000 tokens just on tool definitions before the user prompt has loaded. Twenty MCP servers exposing twenty tools each — a routine enterprise scale — produces 400 tool definitions in JSON Schema, which is the most token-hostile serialization available. The cumulative overhead consumes a substantial fraction of the context window before reasoning begins.

Hot path and cold catalog are different products

The architectural lever that recovers this performance is partitioning. There is a hot path of high-frequency tools that earn their place in the prompt — they are loaded eagerly, described in detail, covered by tight eval rubrics, and treated as part of the planner's permanent vocabulary. And there is a cold catalog of long-tail tools that is discovered on demand: the planner does not see them at all on most turns, but it has a lookup primitive (a tool-search tool, a retriever, an MCP-zero-style discovery API) that surfaces relevant cold tools when the request shape suggests one is needed.

Anthropic's recently released tool search feature is a concrete instantiation of this pattern. Tools marked defer_loading: true stay out of the initial context — the planner only sees the tool-search primitive itself plus the eagerly-loaded hot path. When Claude searches for a capability, matching tools expand into full definitions only when needed. The published numbers are the kind that make you reread them: an 85% reduction in token usage for tool definitions, paired with selection accuracy improvements from 49% to 74% on Opus 4 and from 79.5% to 88.1% on Opus 4.5. Three measurements move at once: tokens drop, accuracy rises, and the catalog can grow without proportionally damaging either.

The retrieval-augmented variant, often packaged as RAG-MCP, hits similar economics from the framework side rather than the API side. The reported numbers — 1,084 prompt tokens versus 2,134 for the all-tools baseline (50%+ reduction), tool selection accuracy of 43.13% versus 13.62% (more than 3×) — track the same shape: the dominant token-savings lever once your catalog grows past 20 is keeping cold tools out of the prompt entirely, not making each description shorter.

What both patterns require, and what most teams underestimate, is a separate index. The tool-search step needs something to search against — descriptions, example queries, parameter shapes, embeddings, or some hybrid — and that index has its own freshness requirements, its own quality bar, and its own failure modes that don't map onto the planner's eval suite. Treating retrieval as a black box because "it's just RAG" is how you ship an agent whose tool selection is worse than the all-tools baseline, just cheaper.

Each tool earns its description budget

Once you accept the partition, a second budgeting question opens. Hot-path tools deserve verbose descriptions, parameter examples, and disambiguation hints — the planner reads those every turn, and clarity at the top of the funnel pays for itself across millions of calls. Cold-path tools, conversely, need descriptions optimized for retrieval, not for execution. The cold-path retriever has to find the tool from a vague user phrase; once found, the full definition expands into context. So a cold-path tool's description is two artifacts: a retrieval blob (rich in synonyms, example queries, and intent phrases) and an execution blob (rich in parameter semantics and edge-case guidance), with different optimization criteria for each.

This split also makes it tractable to enforce description quality. The single biggest disambiguation failure in production agents is overlapping or vague tool names — notification-send-user versus notification-send-channel, or two search tools from different teams. Anthropic's own guidance is explicit about namespacing by service and resource (asana_search, jira_search, asana_projects_search, asana_users_search) precisely because the planner's selection accuracy is dominated by name disambiguation when descriptions blur. On the hot path, you have the budget to describe each tool's distinguishing job in a sentence the planner cannot misread. On the cold path, you have a retriever doing the disambiguation, so descriptions optimize for being findable, not for being prose.

A useful sanity check: if you can't say in one line what a hot-path tool does that no other hot-path tool does, the tool isn't ready for the hot path. It either needs to be merged with its near-neighbor, renamed to disambiguate, or demoted to the cold catalog where the retriever can disambiguate by intent rather than the planner by name.

Eval the partition, not the catalog

The eval discipline that catches long-tail regressions is to stop reporting one accuracy number. Hot-path-conditional accuracy and cold-path-conditional accuracy are different metrics measuring different systems, and aggregating them lies twice — first by mixing two distributions whose error modes don't compose, second by hiding regressions in the smaller cohort.

The hot-path eval should be tight: every tool covered, every parameter shape exercised, multi-turn flows that compose hot tools, regression tests on every prompt change. The cold-path eval is structurally different — it samples user phrasings that should activate the retriever, scores whether the right tool was retrieved (not just selected), and tracks the retriever-then-planner success rate as a separate funnel. A cold-path miss can fail in two places: the retriever didn't surface the right tool, or the retriever surfaced it but the planner didn't pick it from the small candidate set. Logs need to disambiguate.

Crucially, the hot path needs a promotion test and the cold path needs a demotion test. When traffic shifts and a previously cold tool starts handling 8% of calls, somebody has to notice and consider whether to move it to the hot path — eagerly loading it, expanding its description, adding eval coverage — and the converse for hot tools whose share has decayed. Without this, the partition is set once at design time and degrades against actual traffic. With it, the partition becomes a property of the running system, not of a static config.

The benchmark literature has not converged on standard hot-path/cold-path metrics yet, which means most teams will have to define their own. The minimum viable split: tag every tool with a partition, slice every selection-accuracy report by partition, and look at the cold-path number first when something feels off. The aggregate number will lie to you for at least a quarter after a regression starts.

The org failure mode is unowned partitioning

The reason most teams ship a flat catalog is not that they don't know better. It's that the partitioning question has no owner. Each team that adds a tool wants their tool in the global catalog because that's how it becomes discoverable; nobody wants their tool demoted to the cold path because they read "cold" as second-class. The platform team that owns the planner has the strongest incentive to enforce a partition but the weakest political mandate, because they don't ship the user-facing features that depend on individual tools.

The pattern that works is making the partition a property of the tool registration, not of a downstream config. A tool gets registered with a declared expected call frequency, a hot-path budget gate enforced at the platform layer (e.g., a maximum of N tools on the hot path, with a quarterly review), and an automated promotion/demotion proposal driven by traffic. The conversation shifts from "is my tool important?" — a status question with no falsifier — to "does my tool's traffic justify its slot?" — an empirical question with a number on it.

A second org pattern: the eval suite for hot-path tools is owned by the platform; the eval suite for cold-path tools is owned by the team that registered the tool. This is the only sustainable allocation — the platform team cannot author quality evals for 200 cold tools they don't use, and the registering team cannot maintain consistent rubrics across a hot path they don't own. Drawing the boundary at the partition draws it where ownership naturally falls.

What changes when you accept the partition

The mental model that survives is that an agent's tool catalog is not a set of permissions. It is the generator of a permission set the security team never enumerated, served against a traffic distribution the framework never measured, with a quality bar the eval suite never partitioned to detect. Each of those three is a fix, but they only work together: the partition without the eval split hides regressions; the eval split without the registration discipline lets the partition drift; the registration discipline without the retrieval primitive demotes tools into a cold catalog the planner cannot reach.

The forward-looking version is that the next generation of agent frameworks will probably treat the partition as a first-class abstraction the way HTTP frameworks treat routes — with declared frequency tiers, automatic promotion/demotion, retrieval baked into the planner's input pipeline, and benchmarks that report partitioned accuracy by default. We are not there yet. In the meantime, the team that runs the partition by hand, with explicit hot/cold tags and segmented evals, ships a measurably better agent than the team that doesn't — and the one that doesn't will spend a quarter chasing a regression they cannot localize, because the numbers were averaged against a distribution that didn't exist.

References:Let's stay in touch and Follow me for more thoughts and updates