Skip to main content

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

The failure mode is not the planner refusing to act. The planner almost never refuses. It picks a plausible-sounding tool that is wrong, or it combines two tools in a way that does not compose, or it calls a tool with parameters shaped for a different tool's schema. The response is structurally valid JSON. The trace looks clean. The eval suite — if it was built around the original five tools — does not catch the regression because the new failure cases are not yet in it. This is what makes the composability tax invisible until a customer reports it.

The curve is real and the inflection is yours to measure

The temptation is to treat tool count as a constraint to engineer around. Anthropic's Tool Search keeps full schemas out of the prompt and discovers tools on demand, lifting Opus 4 from 49% to 74% on MCP evaluations with large tool libraries. Code-mode execution collapses an eight-turn agent loop into a single call. Schema compression can take a 17,600-token GitHub MCP server down to 500 tokens. These approaches work, and you should use them. But none of them changes the underlying fact: the model still has to pick the right tool from a set, and the set's cardinality is a knob the team controls.

Treating tool count as a knob means measuring the curve for your product. Most teams cannot tell you, today, what their planner's accuracy looks like with twenty tools versus thirty versus fifty. They can tell you the catalog size. They cannot tell you the inflection point — the count at which an additional tool stops being worth the accuracy it costs. The inflection is product-specific. A team with five highly overlapping CRM lookup tools will inflect earlier than a team with fifty tools that span clearly disjoint domains. The composability tax is not paid per tool; it is paid per indistinguishable tool, weighted by how often the indistinguishable tools collide on real traffic.

The minimum viable measurement is a held-out eval suite that the team scores against the catalog as it stands, then re-scores under ablations: same suite, twenty tools removed; same suite, half the catalog removed; same suite, only the top ten most-used tools. The curve falls out. If accuracy with thirty tools matches accuracy with fifty, those last twenty tools are not paying their way — they are taxing every request for no return.

The retirement metric: frequency × success × downstream lift

Adding tools is easy: a PR, a schema, a description, a deploy. Retiring them is hard, because every tool has a sponsor who built it, a Slack thread that justified it, and a non-zero call count that someone can point to. The discipline a team needs is a metric that decides retirement on evidence, not on sponsor pressure.

The metric has three multiplicative pieces:

  • Tool-use frequency: how often the planner calls the tool on real traffic. A tool called twice a quarter is paying its description tokens on every request for almost nothing.
  • Tool-call success rate: when the planner does call the tool, does the call succeed (right tool, right arguments, downstream system accepts it)? A tool with high frequency but 30% success is a tool the planner is misrouting to — it is actively making the planner worse for traffic that should have gone elsewhere.
  • Downstream eval lift: when the tool succeeds, does it actually move a user-visible outcome — task completion, resolution rate, time-to-resolve — relative to a counterfactual where the planner used the next-best tool? A tool that succeeds but does not lift the outcome is a tool whose closest alternative was good enough.

Multiplying these three is intentional. A tool that scores zero on any of them is a tool the catalog can shed without the planner noticing the loss — and with the planner getting a measurable accuracy bump on the simple cases that the deprecated tool was siphoning attention from. Tools that score low on all three are the long tail eating the planner's accuracy budget.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates