The Dependency Bomb in Your Tool Catalog: When Adding One Tool Breaks Five Agents

May 9, 2026 · 8 min read

Software Engineer

A team I know shipped a new lookup_customer_v2 tool to their support agent's catalog on a Tuesday. The tool was scoped narrowly, well-tested in isolation, and approved by review. By Thursday, an unrelated workflow — refund processing — was failing on roughly four percent of cases that used to succeed. The refund tool hadn't changed. The refund prompt hadn't changed. The model hadn't changed. What changed was that the planner was now picking lookup_customer_v2 for refund-eligibility queries that had previously routed cleanly to get_account_status, because the new tool's description happened to contain the word "eligibility" and ranked higher under whatever similarity heuristic the model uses internally.

This is the dependency bomb. Teams treat the tool registry as additive — "we're just adding one thing, what could go wrong" — but the planner doesn't see your registry as a list of independent capabilities. It sees a probability distribution over choices, and every entry redistributes the mass. Adding a tool can quietly subtract behavior somewhere else, and your eval suite will probably miss it because nobody wrote a regression test that says "the agent should still pick the old tool for this case."

The catalog is a global namespace, and the planner is a softmax

Engineers reason about tool registries the way they reason about a function library: each entry is a discrete capability, and adding one expands the set of things you can do without affecting the rest. That mental model is wrong for LLM-driven planners.

When a model decides which tool to call, it is implicitly comparing every tool's name and description against the user's intent and the current trace. The decision is competitive, not independent. Add a new tool whose description overlaps even loosely with an existing tool — synonyms, shared verbs, the same domain noun — and you've changed the relative scores for every prompt that used to land confidently on the old tool. Microsoft Research's study of MCP tool spaces found 775 tools with name collisions across the public ecosystem, and the verb "search" alone appears across 32 distinct MCP servers. Those collisions don't just cause hard errors when two servers claim the same name; they cause silent re-routing across semantic neighbors well before any literal name clash.

The same dynamic shows up at scale even without overlap. OpenAI's guidance is to keep fewer than twenty functions available at the start of a turn for highest accuracy. Anthropic recommends switching to dynamic Tool Search once you cross thirty tools — and the data behind that recommendation is striking: a controlled experiment on Claude Opus 4.5 found tool-selection accuracy rose from 79.5% to 88.1% just by deferring tool definitions behind a search step instead of stuffing them all into context. Token cost on the same workload dropped from roughly 77k to 8.7k. The MCPVerse benchmark generalizes the finding: most models degrade as the catalog grows, and the rate of degradation depends on the model, not just the tools.

The point is not that your catalog is too big. The point is that the catalog is a coupled system. Every entry is in tension with every other entry, and there is no way to add a tool without making a probabilistic claim about all the others.

The eval suite passes because it never asked the right question

Most teams have eval cases that look like "given this user query, the agent should produce this answer" or "the agent should call refund_order with these arguments." Those cases catch the gross failure modes: hallucinated arguments, wrong endpoint, malformed JSON. They do not catch a regression where the planner used to pick tool A, now picks tool B, and tool B happens to return a plausible-but-wrong answer that the eval's output check still accepts.

This is the structural blind spot. Trajectory-level evals score the path the agent took, not just the final answer. Without trajectory checks, a refactor that makes the agent take a less efficient or less correct route is invisible — until a quarter later, when an unrelated incident sends someone replaying traces and they notice the workflow has been silently degraded for weeks.

The fix is selection-stability tests. For each canonical user prompt in your suite, pin the expected tool — not the answer, the tool — and gate merges on whether the planner's first call still resolves to that tool. These tests are cheap to write and cheap to run; they don't require a labeled ground-truth answer, just an expected tool name. The investment pays for itself the first time someone adds a tool whose description shadows an existing one, and selection-stability fails before merge instead of in production.

A reasonable rubric:

Every workflow with a "happy path" tool gets at least one stability test pinning that tool for a representative prompt.
Every tool added or renamed triggers an ablation: run the full stability suite with the catalog as-is, then run it with the new/renamed tool removed, and diff the planner's choices. Any prompt whose chosen tool changed is a candidate regression — not necessarily a bug, but something that needs a human to look at.
Description edits are versioned and treated as behavior-changing releases, not copy edits. The temptation to "tighten the wording" of a tool description ships under the radar in most teams; that wording is part of the prompt, and changing it is a release.

Why "just add it, ablate later" doesn't work

There's a tempting workflow: add the tool, deploy, watch the dashboards. If something regresses, roll back. This sounds like classical canary deploy logic, but it breaks down for tool catalogs in two specific ways.

First, the regression often doesn't show up in the metric you're watching. If the new tool successfully serves its intended workflow at a high rate, your "success rate of tool X" goes up. The fact that also it now siphons traffic from get_account_status and answers refund-eligibility questions slightly worse than the old tool does is invisible at the per-tool aggregate. You'd need to watch end-to-end task completion broken down by inferred intent — and most teams don't have that fidelity.

Second, the time-to-detect on this kind of regression is long because the symptoms are statistical. A four percent regression on refund eligibility is exactly the noise level where it gets dismissed as variance for several weeks before someone notices a trend. By then, you've shipped two more tools on top, and the bisect surface is awful.

This is the same shape as the tool-use tax that recent research has been quantifying: the protocol overhead and selection cost of tool calling can outweigh the capability gain on real distributions, especially under semantic noise. Adding tools is not free even when each tool individually works.

What discipline actually looks like

Treating the tool catalog as a release surface, not a config file, costs more upfront and saves enormously later. Concretely:

Selection-stability tests in CI. Pin the expected tool for canonical prompts. Every PR that touches the registry runs them. Failures require explicit acknowledgment ("yes, this prompt should now route to the new tool") rather than silent passes.
Pre-merge ablation. When adding or renaming a tool, automatically run the stability suite with and without the change and surface the diff to the reviewer. Most PRs will show zero diffs and fly through. The ones that show diffs are exactly the cases that deserve a second look.
Description churn is a release. A change to a tool's description string goes through the same process as a code change to the tool's behavior. No "drive-by wording cleanup" merges from tickets unrelated to the agent.
Namespace your tools. If you're aggregating multiple MCP servers or internal tool sources, prefix consistently — crm.lookup_customer, billing.refund_order. This doesn't eliminate semantic interference, but it kills the literal-collision class of failures and makes the planner's decision more legible in traces. It also means a future migration where you swap the underlying server for one tool doesn't accidentally rename a different team's tool out from under them.
Cap initial tool count, lean on retrieval for the long tail. If your catalog crosses the twenty-to-thirty range, accept that you're now in the regime where dynamic discovery (Tool Search, semantic pre-filtering, hierarchical decomposition) outperforms a flat registry. The token-cost and accuracy data are unambiguous.
Ship a tool-changelog with the why. Not just "added X" but "X was added because Y; we expect it to be selected for prompts of shape Z; here are the stability tests that pin it." This is the artifact your future self needs when bisecting a regression six months from now.

The realization

The tool catalog isn't a kit of independent screwdrivers; it's a single language the planner reads as a whole. Every entry constrains the meaning of every other entry, the way a new word in a dictionary subtly shifts the boundaries of its neighbors. Treat additions and renames with the same care you'd give to changing a function signature in a public API — because functionally, that's what they are.

The teams that learn this cheaply are the ones that wrote the selection-stability harness before they crossed twenty tools. The teams that learn it expensively are the ones that found out from a customer ticket six months in, bisected through three release cycles, and discovered that the regression started the Tuesday someone "just added one tool."

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Dependency Bomb in Your Tool Catalog: When Adding One Tool Breaks Five Agents

The catalog is a global namespace, and the planner is a softmax

The eval suite passes because it never asked the right question

Why "just add it, ablate later" doesn't work

What discipline actually looks like

The realization

Recommended Reading

About Tian Pan

The catalog is a global namespace, and the planner is a softmax​

The eval suite passes because it never asked the right question​

Why "just add it, ablate later" doesn't work​

What discipline actually looks like​

The realization​

Recommended Reading

About Tian Pan

The catalog is a global namespace, and the planner is a softmax

The eval suite passes because it never asked the right question

Why "just add it, ablate later" doesn't work

What discipline actually looks like

The realization