Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running
Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.
A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.
The point of this post is to treat tool hallucination as a measurable quantity, not a folklore category of failure. That means three metrics, a probe suite that every agent team should run but almost none does, and a discipline around what the prompt, the catalog, and the dispatcher each contribute to the rate.
Three Rates, Not One
"Tool hallucination rate" is a suitcase term. If your only metric is the count of UnknownToolError exceptions per thousand turns, you are measuring your dispatcher's strictness more than your model's behavior. Split it into three:
Unknown-tool rate — how often the model emits a function name that does not exist in the registered tool set for this turn. This is the classic hallucinated call. It is bounded from below by the dispatcher's error log, and bounded from above by whatever your fuzzy-matching layer silently absorbed.
Shadow-call rate — how often the model emits a name that is close enough to a real tool that the dispatcher (or the framework's fuzzy matcher) routes it to something the model did not intend. update_user_profile getting routed to update_user because the matcher was feeling generous is the worst of both worlds: you hallucinated, it "succeeded," and now you wrote the wrong fields to a real record. This rate is invisible unless you instrument for it.
Hallucinated-argument rate — how often the tool name is correct but an argument is fabricated. search_documents(query=..., tenant_id="enterprise_tier") where tenant_id should be the literal UUID of the requester, not a description of them. The call succeeds at the schema level and fails at the semantic level. In the Reliability Alignment literature this shows up as "tool content hallucination" and is one of the subtlest failure modes: JSON parses, types validate, the side effect is wrong.
A recent taxonomy from the agent-hallucination survey literature decomposes execution hallucinations into tool-type, tool-timing, tool-format, and tool-content categories, and the rates for those last two — format and content — dominate in production agents that expose large catalogs. The industry conversation skews toward unknown-tool rate because it is easy to count. Your rate-card should split all three.
Why the Rate Is Higher Than You Think
Catalog size drives the rate nonlinearly. Teams that expose 5–10 tools per turn see tool-hallucination rates that are comfortable to ignore. Teams that expose 50+ tools per turn see the rate inflate in ways that pattern-matching alone cannot explain — the model starts confabulating across tool families because the context itself suggests a larger action space than what is actually registered. The newer a tool is in the catalog, the more likely a nearby hallucinated cousin will appear, because the training distribution has familiar neighbors and the model interpolates.
Naming is a leak. If your catalog has create_invoice, list_invoices, and get_invoice, the model will confidently call delete_invoice the first time a user asks to cancel one. The shape of the catalog implies completeness. Pair that with a system prompt that overclaims — "you have access to a full suite of billing tools" — and you have taught the model that the missing tool must exist, because the prompt said so.
Training distributions leak in the other direction. Models have seen send_email, send_message, and send_notification implemented in a thousand codebases. Register only one, and the other two are in the priors. The ToolTweak line of research has shown that surface-level tool metadata is adversarially manipulable precisely because tool names carry outsized weight in selection — the same weight that, in benign conditions, drives hallucinations toward plausible-but-absent neighbors.
The prompt contributes too. Agents that are instructed to "be resourceful" or "try alternative approaches when a tool fails" have a measurable uptick in hallucinated calls after a first-turn failure. The instruction gives the model license to improvise, and the cheapest improvisation is a new tool name.
The Probe Suite You Should Be Running
- https://arxiv.org/html/2412.04141v1
- https://arxiv.org/html/2509.18970v1
- https://arxiv.org/html/2510.02554v1
- https://arxiv.org/html/2504.17550v1
- https://blog.langchain.com/few-shot-prompting-to-improve-tool-calling-performance/
- https://deepeval.com/docs/metrics-tool-correctness
- https://www.giskard.ai/knowledge/function-calling-in-llms-testing-agent-tool-usage-for-ai-security
- https://community.openai.com/t/prompting-best-practices-for-tool-use-function-calling/1123036
- https://www.lakera.ai/blog/guide-to-hallucinations-in-large-language-models
