The Hallucinated Tool Argument That Passed Schema Validation

June 2, 2026 · 9 min read

Software Engineer

The agent calls fetch_order with order_id: "ORD-739241". The schema accepts it — three letters, a dash, six digits, matches the pattern exactly. The tool returns 404. The agent hedges, generates "ORD-739242", calls again, gets another 404, generates "ORD-739243". Your dashboard records three successful tool invocations and three clean schema validations. The customer waits. Somewhere in the trace, every layer of your safety stack is reporting green while the model invents identifiers at full speed.

The team's belief is that the schema caught it. The schema caught what it could catch: shape. It checked that the argument was a string, that it matched a regex, that the required field was present. The schema cannot check that ORD-739241 corresponds to a real order in your database, because the schema does not know your database exists. That gap — between syntactic plausibility and semantic correctness — is where most production tool-calling bugs live, and the failure is so quiet that the only signal is a customer's confusion.

This is not a model problem. It is an architectural mistake that predates the model. Treating JSON Schema as a safety layer is the same category of error as treating a regex on an email field as a deliverability check: the format is necessary, but the format is not the property you cared about. The model's job is to propose a syntactically valid call. Your job is to decide whether the call refers to anything real before the tool fires.

Shape Is Not Truth

Strict structured output and constrained decoding have made it nearly free to enforce that a model's tool call is the right shape. The provider's grammar-based decoder masks invalid tokens at every step, so by construction the output is a well-formed JSON object, the required fields are present, the enum is one of the listed values, and the integer is in the declared range. This is a meaningful guarantee — the year before, half the production stack was string-repair code patching invalid JSON — and it has been so liberating that teams routinely overestimate it.

The guarantee is exclusively about shape. The literal string "ORD-999999999" satisfies the pattern ^ORD-\d{6}$ just as cleanly as a real ID. An ISO date in 1812 passes the date-format check. A user UUID belonging to a different tenant passes the UUID format. The schema says the model returned an object of the right kind. It does not say the object refers to anything.

The same property holds in the opposite direction. A failed schema check tells you nothing about whether the underlying intent was reasonable; the model may have hallucinated a priority field that does not exist, and the schema rightly rejects it, but the schema cannot tell you whether the request was an attempt to express something legitimate that your tool surface does not support. Structured output makes shape errors loud. It makes referent errors silent.

The Retry Loop Is Your Cost Center

The failure mode is not the first 404. The failure mode is the second, third, and fourth.

A model that has been trained on examples of helpful agents will, on receiving an error, try again. If the agent harness exposes the error verbatim — "Order ORD-739241 not found" — the model's most likely next action is to assume it got the digits wrong and propose a near-neighbor. Each call passes schema validation. Each call costs tokens. Each call delays the user-facing answer. In an environment with no iteration cap, the loop terminates only when the model gives up or the harness intervenes; in an environment with a cap, the loop terminates having spent your budget on synthetic IDs.

The cost is not only money. The cost is also that your audit trail now contains a sequence of tool calls that look, to any reviewer, like the agent doing reasonable work. Three lookups for similar-looking order IDs is what a confused human operator would also do; reading the trace later does not tell you that the IDs were never extracted from any source. The retries launder the hallucination into something that pattern-matches investigation. By the time someone notices, the explanation lives in nobody's working memory.

Validate the Referent Before the Tool Fires

The missing layer is an existence check that runs between the model's proposed call and the tool's execution. It does not need to be sophisticated. It needs to answer one question: does the argument refer to a thing that exists in the authoritative system the tool will query?

For ID-shaped arguments, this is usually a lookup against the same store the tool would have hit, but cheap — an indexed EXISTS query, a Redis set membership test, a Bloom filter for the dense cases. The cost is one extra round-trip per tool call; the benefit is that the tool only runs against arguments that resolve, and the retry loop never starts because the harness, not the tool, tells the model the ID was not found.

The structured return path matters as much as the check itself. When the lookup fails, the harness should not raise an exception or hand back the 404 verbatim. It should return a result the model has been instructed to handle: a JSON object with a status: "not_found" field, an optional did_you_mean list of near-matches drawn from the actual data (not from the model's imagination), and a directive that the next action should be to ask the user for clarification rather than try another ID. The model is much better at "give up gracefully" when the harness gives it a graceful-give-up shape to fill in.

For arguments that are not IDs but are still referential — a customer name, a SKU, a date in a free-form field — the same pattern applies with a fuzzy match. Resolve the argument to a real entity before the tool fires, surface ambiguity to the user as a clarifying question, and refuse to call the tool on an unresolved reference. The agent's perceived "smartness" goes down, because it asks more questions; the agent's actual correctness goes up, because it stops acting on imagined entities.

Confidence Is a Cheap Second Signal

For entities extracted from the user's message — "the order I placed last Tuesday" — there is a second signal worth using: how confident the extraction was. If your entity-extraction step is itself a model call, the logprobs of the extracted tokens give you a usable proxy. If it is a separate classifier, you already have a score. Either way, the threshold question is the same: above what confidence does the agent act, and below what confidence does it ask?

Practitioners report that anything above roughly 0.85 on the joint confidence of the entity tokens is usable without confirmation, anything between 0.70 and 0.85 deserves a clarifying question, and anything below should not fire a tool call at all. The exact thresholds are domain-dependent — a customer-support agent and a payments agent have very different tolerances for being wrong — but the architectural point is invariant: confidence is a number you already have, and refusing to look at it is a choice. Self-consistency sampling, where the same extraction is run several times and disagreement is treated as low confidence, is a more expensive but more robust alternative for the cases where logprobs are not exposed.

The thing to avoid is letting the model self-report its own confidence as part of the structured output. "I am 95% sure the order ID is ORD-739241" is a hallucination wearing a number; the model has no calibrated access to its own uncertainty in natural language. Use a signal computed outside the model — sampling, logprobs, or a separate verifier — or use no signal at all.

The Architectural Realization

The schema and the referent live in different layers of your stack, and treating them as a single safety boundary is what makes the bug invisible.

JSON Schema is part of the tool-calling protocol. It belongs to the contract between the model and the function-calling API, and it validates that the model is emitting a syntactically valid request against a declared interface. The referent check belongs to your application layer. It validates that the request, well-formed as it may be, points at a thing you can act on. The schema is owned by the platform team; the referent check is owned by whoever owns the data the tool reads from. When nobody owns the second layer, the first layer's green checkmark is everything the team sees, and the team concludes the system is safe.

Naming this in the architecture review is the cheapest intervention. Every tool that takes a referential argument needs a documented pre-execution check, and the absence of that check should be a blocker, not a tech-debt item. The tools that read pure functions of their arguments — a calculator, a unit converter, a regex tester — do not need it; the tools that look something up or mutate something do. The dividing line is whether the argument refers to state, and almost every interesting tool refers to state.

The team that conflates schema with safety is paying for hallucinated retries on every silent ambiguity, and the bill arrives as a customer complaint rather than an alert. The team that names the gap and builds the referent check spends one extra round-trip and stops fighting a class of bug. The schema is doing what it was designed to do. The hallucination is happening in the layer above it, and that layer is yours to build.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Hallucinated Tool Argument That Passed Schema Validation

Shape Is Not Truth

The Retry Loop Is Your Cost Center

Validate the Referent Before the Tool Fires

Confidence Is a Cheap Second Signal

The Architectural Realization

Recommended Reading

About Tian Pan

Shape Is Not Truth​

The Retry Loop Is Your Cost Center​

Validate the Referent Before the Tool Fires​

Confidence Is a Cheap Second Signal​

The Architectural Realization​

Recommended Reading

About Tian Pan

Shape Is Not Truth

The Retry Loop Is Your Cost Center

Validate the Referent Before the Tool Fires

Confidence Is a Cheap Second Signal

The Architectural Realization