Skip to main content

The Agent's I-Don't-Know Rate That Fell After You Added More Tools

· 9 min read
Tian Pan
Software Engineer

You added the search tool, then the calendar tool, then the CRM tool, then four database wrappers and a calculator. The dashboard moved the way you wanted: task-completion ticked up, latency held, the "I don't know" rate dropped from 14% to 4%. Looks like a capability win. It is not. The planner did not learn more; it learned less abstention. Every question now looks answerable because there is always some tool that pattern-matches the query well enough to call. The 10 percentage points of "I don't know" you removed did not turn into correct answers — they turned into confident wrong ones, distributed across the long tail where nobody is grading carefully.

This is the false-competence trap of tool surface expansion. It is the most common way a team ships a regression while celebrating an improvement. The eval rubric measures whether the agent attempted the task and produced a plausible-shaped answer; it does not measure whether the agent should have refused. Abstention is not free, but it is the cheapest correct behavior available, and you stop being able to see it the moment your tool palette gets large enough that something always fires.

The mechanism is not mysterious, and it is now well documented. Larger tool sets dilute the planner's attention across more candidates. Semantically similar tools blur into each other. And reasoning-tuned models — the ones you upgraded to last quarter — turn out to amplify rather than suppress tool hallucination, because the chain-of-thought rationalizes any plausible-looking call into a confident one. Your "I don't know" rate did not fall because the agent got smarter. It fell because the agent got more willing to guess and you gave it more shapes to guess into.

What you actually measured

The abstention rate on its own is a useless number. It is a ratio whose denominator silently expands every time you add a tool. A query that previously returned "I can't help with that" because no tool fit now returns a tool call to whichever entry in the manifest had the closest-looking name. The agent did not start handling the query correctly; the agent started handling it at all. Whether the handling was right is a question your "task completion" metric is almost certainly not asking, because task completion is usually scored on whether any answer was produced and whether it parses, not on whether refusal would have been the correct outcome.

Three traps follow from this:

  • The "answered" bucket has grown without you measuring whether the new entries are correct.
  • The "abstained" bucket has shrunk, and you cannot tell whether the lost abstentions were correctly lost (the new tool actually solved a previously-unsolvable query) or incorrectly lost (the new tool gave the planner license to fake an answer).
  • The eval suite, written before the tool expansion, did not anticipate that the planner would now attempt categories it used to skip — so the cases that exercise the new failure modes are not in the test set.

The number that matters is not the abstention rate. It is the abstention rate conditional on the query being unanswerable. If you cannot construct a held-out slice of queries you know the agent should refuse, you do not have an abstention metric. You have an attempt-rate metric dressed up as a quality metric.

Why more tools push the planner toward false competence

Three forces compound, and each one is invisible in isolation.

Context dilution. Every tool definition you load costs tokens in the system prompt or tool manifest, and the model's attention has to spread across all of them before the user's query arrives. Research on the "lost in the middle" effect has shown for years that LLM accuracy collapses when relevant context is buried in a large prompt — and a 50-tool manifest is exactly that. The signal from the user's intent gets diluted by the noise of 49 tools the planner will not use. Selection accuracy drops not because any individual tool is harder to call, but because picking the right one out of a crowd is harder than picking the only one in the room.

Semantic blur. Tools you add over time tend to overlap. search_docs and search_kb and search_articles look different to you because you wrote them and remember the boundary. To the planner they are three points in a small region of embedding space, and the planner picks whichever description's surface features best match the query — which is not the same as whichever tool will actually return the right answer. Researchers studying tool-selection bias have shown that structurally similar API names alone are enough to drive misselection rates up sharply, even when the underlying tools behave very differently.

Permissive priors. Reasoning-tuned models, the ones that produce a visible chain of thought before acting, have been shown to increase tool hallucination relative to their non-reasoning counterparts. The intuition is uncomfortable but clear: a reasoning model that has been rewarded for confident multi-step plans treats abstention as a missing step. Its reasoning trace looks for a path through the available tools, finds one that is locally plausible, and follows it. The chain of thought rationalizes the call rather than questioning whether one should happen.

Put together: a bigger manifest gives the planner more candidates, the candidates are more similar than they look, and the planner has been trained to prefer action over inaction. The "I don't know" rate falls because the architecture is now hostile to abstention. The drop is a leading indicator of a regression, not a lagging indicator of progress.

A correct query is not the same as an attempted query

Teams notice this kind of drift late because the eval suite was built around what the agent could do, not what it should refuse to do. Refusal-as-correctness is the missing axis.

A useful reframing is to split queries into three populations before you score anything:

  • Answerable with this tool surface. A correct response means the right tool was called and the output was right.
  • Not answerable with this tool surface but answerable in principle. A correct response means a clear, scoped refusal — "I don't have a way to do that, here is why" — without a hallucinated attempt.
  • Underspecified. A correct response is a clarifying question, not a tool call.

The trap is that without this split, every query gets graded against the first category, and refusals look like failures even when they are the right call. Teams unconsciously train themselves to celebrate attempt rates, and the planner, downstream of any preference signal you produce, learns the same lesson.

If you want abstention to survive tool expansion, the eval needs at least one slice where the correct answer is "no tool fits." That slice should be larger after each tool expansion, not smaller, because each new tool creates a new category of near-miss queries where the planner will be tempted to call it inappropriately. The honest version of "we added five tools" is "we added five tools and five new ways the agent can be wrong, and the abstention test set grew accordingly."

The patterns that hold abstention steady

Once you accept that tool expansion is also abstention expansion, a few patterns become non-negotiable.

A refusal slice in the eval, sized to the manifest. Every new tool ships with held-out queries that look like they belong to the new tool but should be refused. The pattern is the same as adversarial testing in classifiers: the near-miss is where calibration lives. If you cannot generate the near-miss set, you do not understand the tool's boundaries well enough to add it.

A confidence signal the planner is forced to consult. Verbalized confidence from the model is unreliable — multiple studies have shown that a model saying "I'm not sure" does not predict its accuracy well. But a separate, calibrated gate — a small classifier, a conformal prediction layer, or a deliberate two-stage planner where one step picks a tool and a second step checks fit — can hold the abstention rate where it should be. The key is that the gate has to be outside the main reasoning loop. Inside the loop, the model will rationalize past it.

A scoped tool manifest per query. The "expose every tool always" pattern is the source of most of the dilution. Most modern agent frameworks now support some form of tool retrieval — pick the top-k relevant tools for the query, then plan against that smaller set. This is the structural fix, and it is more important than any prompt-level intervention. A 10-tool manifest selected from a 200-tool catalog routinely outperforms the 200-tool manifest on selection accuracy, and the abstention rate stays meaningful because the planner cannot pattern-match to whatever happens to be in the room.

An abstention budget, monitored as a first-class metric. The team's mental model should be: the floor of "I don't know" responses for our query mix is X%, set by the share of underspecified or out-of-scope queries we know we receive. If the abstention rate drops below X%, that is a regression alert, not a celebration. Treat the abstention rate the way you treat error rate — a floor, not a ceiling — and the false-competence trap stops being silent.

What to do tomorrow

Pull the last 30 days of agent traces. Bucket them by whether the agent called a tool and whether the resulting interaction completed successfully — but also pull a random sample of "tool called, completion succeeded" rows and have a human read whether the right answer would have been to decline.

You will likely find a chunk of false-competence cases hiding under the green completion metric. They are the ones where the planner picked a tool that almost fit, returned a plausible-shaped result, and the user moved on without complaining because the wrong answer was not obviously wrong. That chunk is the cost of every "I don't know" your tool expansion silently converted into a confident attempt.

The unfashionable truth about agent reliability is that abstention is a capability, not a failure mode. An agent that knows when not to act is more valuable than one with a larger tool palette, and the easiest way to lose the first is to expand the second without noticing. The dashboard will not warn you. You have to instrument for refusal specifically, or you will keep mistaking attempts for answers.

References:Let's stay in touch and Follow me for more thoughts and updates