The Filler Tool Call: When Agents Perform Diligence Instead of Doing Work
Open the trace of any production agent and look at the tool calls that ran between the user's question and the first useful action. You will find a get_user_profile that returned a name nobody used, a check_status that came back green and was never referenced, a list_recent_orders whose result was summarized as "ok" and dropped on the floor. None of these calls changed the answer. All of them cost real money, real latency, and a real line in the trace. Your agent has learned to look diligent — and looking diligent is now your single largest source of waste.
This is the filler tool call: an action the agent emits not because it needs the result, but because the surrounding pattern of "thinking out loud, then acting" has been rewarded enough times during training that the model now performs thoroughness as a side effect of answering anything. It is the LLM equivalent of a junior analyst opening five tabs they never read so the senior across the room sees activity. The difference is that the junior gets bored. The agent never does.
The economics are worse than they look. A 2025 study from Alibaba on their Metis agent reported redundant tool invocations as high as 98% of total calls in baseline systems, dropping to 2% after targeted training — and crucially, accuracy improved at the same time. Tool overuse mitigation work like SMART has shown that a small 7B agent given a knowledge-boundary signal can match a 70B agent on the same benchmarks with one-fifth of the calls. In other words: the filler is not just expensive. It is also actively degrading your output, because every extra observation in the context window dilutes the signal the model is supposed to be reasoning over.
Why models learn to fake thoroughness
Filler emerges from the training loop, not from the prompt. If the reward signal during fine-tuning or RL is dominated by outcome correctness on tasks that almost always benefit from a lookup, the policy gradient pushes the model toward calling tools more often, not less, because the marginal cost of an unnecessary call is invisible to the loss function while the marginal cost of a missed call is fatal. This is the same asymmetry that makes a defensive doctor order one more scan: false negatives have a name attached, false positives don't.
Three specific training pressures conspire to produce filler:
- Outcome-only reward models. When the reward only looks at the final answer, the model treats extra steps as free. There is no gradient pulling it toward parsimony.
- Imitation from verbose demonstrations. Most public agent traces used as supervised data show "I'll just verify…" patterns. The model learns the verbal scaffolding, then learns that the scaffolding usually comes with a tool call attached.
- Tool description bias. Edits to tool descriptions alone have been shown to disproportionately shift selection frequency. A tool described as "the canonical source for X" gets called for anything that mentions X, regardless of whether the agent already knows the answer.
The behavior also gets locked in by your own prompts. The classic system instruction — "before answering, gather all relevant context" — is read literally. The model gathers, even when the context required to answer was already in the user's question.
What filler actually costs you
Each filler tool call has four meters running. Most teams instrument only the first.
The first meter is tokens for the call itself. Tool definitions are pasted into the context on every turn, and their schemas often run hundreds of tokens. When too many tool servers are wired into one agent, the definitions alone can dominate the prompt budget. The second meter is tokens for the result, which is usually worse: a tool that returns a JSON blob with fifty fields when you needed three burns thousands of input tokens on the next turn while the model figures out what to ignore. The third meter is latency. A single round trip to a model is around 800ms in a healthy setup; a tool call adds the network hop plus the backend's processing time plus another model turn to absorb the result. Three filler calls on the path to an answer is the difference between a feature that feels live and one that feels like a slow API. The fourth meter is accuracy decay, which is the one teams discover last. Long contexts containing irrelevant observations measurably degrade the model's ability to reason over the relevant ones; in the worst cases the model anchors on a filler observation and produces a wrong answer that would have been right with a smaller context.
The sum of these meters is why agent cost stacks rarely match napkin math. Your token bill grew 3x but your user-visible work grew 1x because the extra 2x is performative.
The counterfactual: would the answer have differed without this call?
- https://arxiv.org/abs/2502.11435
- https://arxiv.org/pdf/2505.18135
- https://www.codeant.ai/blogs/poor-tool-calling-llm-cost-latency
- https://venturebeat.com/orchestration/alibabas-metis-agent-cuts-redundant-ai-tool-calls-from-98-to-2-and-gets-more-accurate-doing-it
- https://arxiv.org/html/2504.13958v1
- https://www.braintrust.dev/articles/agent-observability-tracing-tool-calls-memory
- https://www.anthropic.com/engineering/code-execution-with-mcp
