Skip to main content

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

· 9 min read
Tian Pan
Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

The point of this post is to treat tool hallucination as a measurable quantity, not a folklore category of failure. That means three metrics, a probe suite that every agent team should run but almost none does, and a discipline around what the prompt, the catalog, and the dispatcher each contribute to the rate.

Three Rates, Not One

"Tool hallucination rate" is a suitcase term. If your only metric is the count of UnknownToolError exceptions per thousand turns, you are measuring your dispatcher's strictness more than your model's behavior. Split it into three:

Unknown-tool rate — how often the model emits a function name that does not exist in the registered tool set for this turn. This is the classic hallucinated call. It is bounded from below by the dispatcher's error log, and bounded from above by whatever your fuzzy-matching layer silently absorbed.

Shadow-call rate — how often the model emits a name that is close enough to a real tool that the dispatcher (or the framework's fuzzy matcher) routes it to something the model did not intend. update_user_profile getting routed to update_user because the matcher was feeling generous is the worst of both worlds: you hallucinated, it "succeeded," and now you wrote the wrong fields to a real record. This rate is invisible unless you instrument for it.

Hallucinated-argument rate — how often the tool name is correct but an argument is fabricated. search_documents(query=..., tenant_id="enterprise_tier") where tenant_id should be the literal UUID of the requester, not a description of them. The call succeeds at the schema level and fails at the semantic level. In the Reliability Alignment literature this shows up as "tool content hallucination" and is one of the subtlest failure modes: JSON parses, types validate, the side effect is wrong.

A recent taxonomy from the agent-hallucination survey literature decomposes execution hallucinations into tool-type, tool-timing, tool-format, and tool-content categories, and the rates for those last two — format and content — dominate in production agents that expose large catalogs. The industry conversation skews toward unknown-tool rate because it is easy to count. Your rate-card should split all three.

Why the Rate Is Higher Than You Think

Catalog size drives the rate nonlinearly. Teams that expose 5–10 tools per turn see tool-hallucination rates that are comfortable to ignore. Teams that expose 50+ tools per turn see the rate inflate in ways that pattern-matching alone cannot explain — the model starts confabulating across tool families because the context itself suggests a larger action space than what is actually registered. The newer a tool is in the catalog, the more likely a nearby hallucinated cousin will appear, because the training distribution has familiar neighbors and the model interpolates.

Naming is a leak. If your catalog has create_invoice, list_invoices, and get_invoice, the model will confidently call delete_invoice the first time a user asks to cancel one. The shape of the catalog implies completeness. Pair that with a system prompt that overclaims — "you have access to a full suite of billing tools" — and you have taught the model that the missing tool must exist, because the prompt said so.

Training distributions leak in the other direction. Models have seen send_email, send_message, and send_notification implemented in a thousand codebases. Register only one, and the other two are in the priors. The ToolTweak line of research has shown that surface-level tool metadata is adversarially manipulable precisely because tool names carry outsized weight in selection — the same weight that, in benign conditions, drives hallucinations toward plausible-but-absent neighbors.

The prompt contributes too. Agents that are instructed to "be resourceful" or "try alternative approaches when a tool fails" have a measurable uptick in hallucinated calls after a first-turn failure. The instruction gives the model license to improvise, and the cheapest improvisation is a new tool name.

The Probe Suite You Should Be Running

A probe suite for tool hallucination is a held-out eval that does not overlap with your production traces. The shape that works:

  • Decoy catalogs. For each real task, assemble a catalog that deliberately omits the tool the task wants. The goal is not to test task completion — it is to measure what the model reaches for when the right tool is missing. Score the output: did it refuse, did it ask a clarifying question, did it invent a tool, did it misuse a present tool? "Invented" is your unknown-tool rate on this probe. The distribution over those four outcomes is the signal.

  • Lookalike traps. For each real tool, add a catalog variant where one decoy with a similar name sits next to the real tool. search_docs next to search_documents. user_info next to user_profile. Measure how often the model calls the decoy when the task unambiguously wants the real tool. That is your shadow-call rate. If your framework does fuzzy matching, turn it off for this eval — you want to see what the model produced, not what the matcher coerced.

  • Optional-argument fabrications. For each real tool with optional arguments, run tasks that do not justify those arguments, and measure how often the model fills them anyway with fabricated values. This is hallucinated-argument rate.

  • Overclaimed-prompt probes. Deliberately overclaim what the agent can do in the system prompt ("you have a tool for X") while leaving X out of the catalog. The delta in hallucination rate between matched and overclaimed prompts quantifies how much your prompt is the problem.

Run this suite on every prompt change, every catalog change, and every model upgrade. It is the tool-calling equivalent of a regression test. A 3% unknown-tool rate on your probe suite is not a passing grade — it is the floor you are trying to push down. The DeepEval-style frameworks give you tool-correctness as a metric, but tool-correctness is "did it call the right tool" — it does not by itself tell you which of the three hallucination rates is responsible for the misses.

Prevention, in Descending Order of Effectiveness

The highest-leverage defense is schema-enforced generation, not schema-enforced dispatch. Strict mode on OpenAI function calling and Anthropic's structured-outputs beta compile the tool schema into a constrained decoding grammar, which means the model literally cannot emit tokens that violate the schema — unknown tool names become unreachable during generation, not rejected after. That collapses unknown-tool rate toward zero. It does nothing for shadow-calls or hallucinated arguments where the schema accepts the fabrication.

The second defense is dispatcher strictness. Turn off fuzzy matching. If a tool name does not match exactly, fail the turn and surface the intended name back to the model as an error message that lists the actual catalog. Most frameworks ship with permissive defaults because the permissive path looks like it helps; in production it hides shadow-calls and makes your observability lie.

The third defense is narrow catalogs per task surface. One agent with 50 tools always loses to five agents with 10 tools each, on hallucination rate and on latency. Route at the orchestration layer, not inside the model's head. If the model only sees the subset of tools relevant to the current phase of work, the priors for "this sibling tool probably exists" have nowhere to land.

The fourth defense is negative few-shot examples. Prompts that include explicit don't-call patterns — "if the task is X, do not call cancel_subscription; ask the user to confirm first" — measurably reduce the hallucinated-argument rate on nearby real tools. The LangChain work on few-shot prompting for tool calling is an underused reference here. Positive examples alone teach pattern; negative examples teach boundary.

The fifth defense is argument validation with semantic checks, not just type checks. A tenant_id that is a well-formed UUID but does not belong to the requesting principal is the hallucinated-argument failure mode. The dispatcher needs to check identity and scope before calling the tool, and log the refusal as a hallucination signal rather than a generic authorization error — they look the same in the transcript and mean different things for your eval feedstock.

What This Changes in Your Agent Review

If you already run a weekly agent-quality review, the change is small. Add three numbers to the dashboard — unknown-tool rate, shadow-call rate, hallucinated-argument rate — each measured on a frozen probe suite so the number is comparable week to week. Track them per model, per prompt version, per catalog version. When any of them spikes, the review becomes an investigation: did the catalog grow, did the prompt start overclaiming, did a model revision shift the priors?

The probe suite is the artifact you are investing in. A good one covers the long tail of your catalog, not just the popular tools. It gets updated when tools are deprecated so that the decoy-catalog probes reflect the current shape of the registry. It runs in CI before a prompt change ships, not after it breaks production.

None of this is glamorous. It does not produce a line chart that goes up and to the right for a launch deck. It produces three numbers you would prefer not to look at, and it makes them visible enough that the next time the model invents a function you never wrote, you see it on the dashboard instead of in a customer ticket.

References:Let's stay in touch and Follow me for more thoughts and updates