Skip to main content

Argument Hallucination Is a Drift Signal, Not a Model Bug

· 10 min read
Tian Pan
Software Engineer

The ticket says "model hallucinated a user ID." The triage label is model-quality. The fix is one more sentence in the system prompt. Six weeks later a different tool starts hallucinating a date format, and the loop runs again. After a year of this, the prompt has grown into a 4,000-token apology for the entire backend, and the team is convinced the model is just unreliable on tool arguments.

The model isn't unreliable. The model is a contract-conformance machine reading the contract you gave it — and the contract you gave it has been quietly drifting away from the contract on the other side of the wire. Most production "argument hallucinations" are not model failures. They are integration tests your tool description is silently failing, surfacing as model output because that is the only place in the stack where the divergence becomes visible.

This reframe matters because every downstream decision flips. If the model is bad, you patch the prompt and tune the temperature. If the description is stale, you instrument the gap between description and API, gate description edits in CI, and treat each "hallucinated" argument as a leading indicator of contract drift. One of those scales. The other accumulates technical debt that gets harder to remove every quarter.

The model generates from the description it was given

An LLM tool call is a translation from natural-language schema to JSON payload. The tool description — including the description field on the tool, the parameter docstrings, the enum values, the type hints — is the source text. The model's job is to render an instance of that source text that satisfies the user's intent. When the rendering is wrong, the bug is almost always in one of three places:

  • The source text disagrees with the destination API.
  • The source text is ambiguous enough that multiple renderings are equally valid.
  • The destination API has stricter requirements at runtime than the schema advertises.

None of those are model defects. A March 2026 incident report from a fintech team made the failure mode concrete: a backend rename from annual_income_verified to verified_annual_income shipped on a Tuesday. The tool description still mentioned the old field name in its prose. The model started returning null for the field, which was logged as a hallucination. Three days of "model regression" investigation later, the actual fix was a one-line edit to the description.

Multiply that across a tool surface of 30 functions, each with five parameters and a description that was written six months ago and never re-read, and the math is depressing. Every backend rename is a latent argument hallucination waiting for a triage pager. The model is doing exactly what was asked.

The base rate confirms it. When teams instrument the gap rigorously, schema mismatches dominate the ranked list of root causes — ahead of context truncation, deep nesting, escaping bugs, temperature randomness, and corrupted message history. Frontier models hit 95–99% argument validity on well-described tools; that drops to 70–85% on the same tools when descriptions go stale or inherit ambiguity from auto-generated OpenAPI specs. The model isn't the variable. The description is.

Three concrete drift modes that look like hallucination

Argument hallucinations cluster into a handful of repeatable patterns. Each one is a different gap between description and runtime, and each one wants a different fix. If your post-mortems lump them all under "model error," you'll never see the patterns.

Type drift. The description says customer_id: string, the runtime now expects string but used to accept int, and the model still generates 12345 because three of the few-shot examples in the description show integers. The model is rendering from the few-shots, not the type annotation. Fix: prune contradictory examples, not prompt the model harder.

Field-name drift. A renamed field stays in the description because the rename PR didn't touch the agent's tools file. The model produces calls with the old name; the API returns 422; the model retries; the loop burns tokens. Fix: a CI diff between the tool's natural-language description and the API's source-of-truth schema, failing the build on divergence.

Required-vs-optional drift. The schema says a field is optional, the runtime quietly made it required, and the model — rationally — omits it. The runtime returns a vague error, the model reads the error, and tries to fix it by inventing a value because the description offered no anchor for what a valid value looks like. Fix: align the schema's required set with runtime reality, and return errors that are themselves structured (the field name plus a one-line description of valid values), not opaque stack traces.

The unifying claim across all three: the model is generating from a stale or ambiguous source text. The "hallucination" is the rendering of that staleness into the JSON payload. Every other diagnostic — confidence scores, retry budgets, judge models — is downstream of getting the source text right.

Argument validity is a per-tool, time-series metric

Most teams do not have an argument-validity dashboard. They have a tool-call success metric, which lumps together "the API returned 200" and "the API returned 200 because the model recovered after three retries." Those two outcomes have radically different cost profiles, and only the second is a hallucination signal.

The instrumentation that earns its keep is narrower and more boring. Per tool, log the argument payload and a binary verdict: valid against the schema, or not. Slice the time series by tool name and by argument field. The result is a dashboard where each tool has its own argument-validity line, and a description edit that introduces drift shows up as a step function within hours.

This is not exotic. It is the same instrumentation any team would deploy on a public API: per-endpoint 4xx rates, sliced by client. The only twist is that the "client" here is the model, and the natural-language description is the API doc the client is reading. When you frame it that way, the operational playbook is well-trodden — and the alert thresholds map directly. A 30% drop in argument validity for one tool is the same kind of incident as a 30% spike in 422s on the corresponding endpoint. They want the same on-call response.

The aggregate metric — overall agent success rate — actively obscures this signal. A single tool can rot for weeks while overall success stays flat, because the agent is silently retrying around it and the cost is hiding in the token bill. The unit of measurement that catches drift is the per-tool, per-field validity rate. Anything more aggregate is too coarse to be actionable.

A CI gate beats a runbook

If the description is the contract and the API is the implementation, the gap between them is a contract test that's missing from CI. Most teams discover this gap reactively: a description edit ships, a customer reports a flaky agent, the team realizes the description didn't match runtime behavior, and someone writes a runbook entry about "remember to update the tool description when you rename a field."

Runbook entries decay. CI gates do not.

The shape of the gate is straightforward. The tool's natural-language description and parameter schema are versioned in source control alongside the API's source-of-truth schema (OpenAPI, protobuf, GraphQL — whichever flavor). A pre-merge check parses both, flags divergence at the field-name, type, and required-set level, and fails the build on any mismatch the author hasn't explicitly waived. The waiver flag exists because some divergences are intentional: the API exposes a field the agent shouldn't touch, or the description deliberately uses a more user-facing name. Forcing an explicit acknowledgment makes drift visible at the moment of the change, when the cost of fixing it is lowest.

A team that lands this gate stops shipping a category of incident entirely. The drift is caught at PR review, not at 2 a.m. when the agent's argument-validity rate craters and the on-call has to bisect across three repos to find the cause.

Eval against valid-call sets, not "did the call succeed"

The standard tool-correctness eval grades on a single binary: the agent invoked the right tool. That captures tool-selection failures and misses argument-validity failures entirely. A tool can be correctly selected and still receive completely fabricated arguments — the call hits the API, returns 422, and counts as a failure for an unrelated reason in the dashboards.

The eval that catches argument hallucination is a golden set of valid calls per tool, with the rubric scoring on field-by-field match against the expected payload. The dataset is small per tool — twenty to fifty examples covers most argument shapes — and grows organically from production traces. A field-by-field score reveals which parameter is drifting, not just that something is wrong, which means the eval failure points at the description edit responsible.

The non-obvious payoff: this eval doubles as a regression test for description edits. When the prompt-engineering team rewrites a tool description to be more concise, the eval re-runs and either holds or fails. A description that "reads cleaner" but tanks the argument-validity score is a description that has lost information the model needed. The eval gives that loss a name and a number, instead of leaving it as a vague suspicion that surfaces three weeks later as a customer complaint.

The benchmarks that matter for argument validity are project-specific, not shared. BFCL and similar public benchmarks measure aggregate function-calling competence across thousands of synthetic tools. They do not catch the drift in your specific tool's specific description, because the drift is local to your codebase. The golden set has to live next to the description it grades.

Argument hallucinations are a contract-drift signal — start treating them like one

The pattern is not subtle once you see it. Every "model invented an argument" ticket is a hypothesis with two competing explanations: the model failed, or the description drifted. The base rate strongly favors the second explanation. The triage that lumps them together blocks the team from ever instrumenting the difference, which means description drift compounds silently until the agent's reliability story collapses.

A few things to check this week. Pull the last 100 argument-related agent failures and re-classify them. How many are explainable by a stale field name, an outdated type, or an example that contradicts the schema? If the number is above 30%, the cheapest engineering investment in the agent's reliability is not a better model — it is a CI gate between the tool description and the API schema, plus a per-tool argument-validity dashboard, plus a golden-call eval set for the top ten tools by traffic.

That investment compounds. The model gets cheaper every quarter. The cost of letting your tool descriptions and your API drift apart does not. Argument hallucination is the signal a system gives you when those two contracts are diverging — and the team that learns to read it as a drift alarm, instead of a model defect, is the team whose agents stop getting flakier as their tool surface grows.

References:Let's stay in touch and Follow me for more thoughts and updates