Skip to main content

The Tool Description That Drifted Out of Sync With the Tool It Described

· 12 min read
Tian Pan
Software Engineer

A backend engineer renames a parameter from user_id to account_id because the two stopped being the same thing six months ago, and a support ticket finally made the ambiguity intolerable. The JSON schema for the tool gets updated in the pull request that ships the rename. The tool's prose description — the one paragraph the model actually reads to decide whether to call the tool and how — lives in a different repository, owned by a different team, updated through a ticket queue, and still reads "pass the user_id to look up the account." Nobody flags it. The model dutifully calls the tool with the right schema, fills the right field, and gets the right answer on every single happy-path query. The bug is invisible until the day a user types something where their authenticated user_id and the account_id they were asking about are two different entities, and the agent confidently returns somebody else's data.

This is the genre of failure where the system is doing exactly what you told it to do — twice, in two different places that disagree. The tool call is well-formed. The arguments pass schema validation. The eval suite is green. The model's reasoning trace, if you look at it, cites the prose description as justification for what it did, and the prose description is wrong. Your model has been quietly running on a specification you stopped maintaining.

Two Sources of Truth, No Authority Over Either

The schema and the description describe the same thing from different angles, but they live in different parts of your codebase and are updated by different humans on different cadences. The JSON schema lives next to the implementation. A backend engineer renames a field, and the type system or the API generator forces the schema to follow — there is no way to ship the code without updating the schema, because the runtime won't accept the old shape. The schema is load-bearing in the worst possible way: it cannot drift without breaking, and so it stays honest.

The prose description has none of that pressure. It lives in a tool registry, in a documentation site, in a YAML file the platform team maintains, in a paragraph somebody pasted into a Slack thread three quarters ago. It is updated when somebody remembers to file a ticket, or when a release manager catches it during review, or never. Nothing on the runtime path enforces that it stays accurate. Nothing in CI fails when it lies. The schema is treated as code; the description is treated as documentation, and documentation, every team in software has learned, drifts.

The model does not know any of this. The model is given two artifacts that purport to describe the same tool and is left to reconcile them on its own. When they agree, the model's behavior is what you would expect. When they disagree, the model's behavior is shaped by whichever artifact it weighted more strongly during selection, and for the call-site decision — "which tool should I use, and what does its argument mean?" — that is overwhelmingly the prose description. The schema tells the model what fields exist; the description tells the model what they mean. Meaning wins.

The Description Is Part of Your Prompt, You Just Don't Treat It Like One

Engineering teams that would never let a system prompt change without code review will happily let tool descriptions change through a wiki edit. The asymmetry has nothing to do with technical importance and everything to do with where the artifact happens to live. A description in your system prompt is reviewed because the system prompt is in git. A description in your MCP server's tool definition is reviewed if and only if the MCP server lives somewhere reviewed. A description in a third-party tool you mount into the agent is whatever the vendor decides it is on any given Tuesday.

This matters because the description is, functionally, system prompt at runtime. The model concatenates it into context every time the tool is available, weighs it during selection, and uses it to infer what arguments mean. A description that says "use this to look up the user's account" frames the tool as account-lookup; a description that says "use this to retrieve transaction history for an account when the user asks about past purchases" frames the same endpoint as transaction-retrieval and changes which queries the model will route to it. The schema is identical in both cases. The model's behavior is not.

So when the description drifts — when "user_id" stays in the prose after the schema field becomes "account_id" — the model is not just reading stale documentation. It is being told, in production, that the field semantics are something they are not. The model will reason about the tool using a definition the runtime no longer honors, and will keep doing so until the description is updated or until the discrepancy produces a failure visible enough to investigate. On the common path, neither happens.

How the Failure Looks in Production

The failure has a particular signature that makes it hard to catch in normal review. The tool call succeeds. The argument validates. The response has the right shape. The model's downstream behavior — summarizing the result, planning the next step — looks reasonable. If you sampled the conversation, you would say the agent did the right thing. The bug only surfaces in the slice of traffic where the prose description's wrong definition produces a wrong call, which on a well-designed schema is exactly the cases where the schema was renamed because two concepts were being conflated.

That is the worst-case selection bias for a regression: the failures cluster in precisely the queries where the underlying entity-distinction the rename was meant to fix is load-bearing. Your healthy users — whose user_id equals their account_id because they have a one-to-one relationship — see nothing. Your business customers — who manage many accounts under one identity, which was the reason for the rename in the first place — see cross-account leakage that looks like a permissions bug. Your eval set, if it was built before the rename, contains exactly zero examples of the distinction and grades the model as healthy.

Worse, the model's confidence is not affected. There is no hedging, no "I'm not sure if this is the right account" preamble, because from the model's perspective nothing is ambiguous: it read a description that said user_id, it looked up user_id, it returned the result. The reasoning trace will read as coherent and well-justified, and the post-incident review will spend a long time deciding whether to call this a model error or a tool error before realizing it is a documentation error that happened to be load-bearing at inference time.

What Closing the Gap Actually Requires

The fix is not "be more careful with the description." Teams have been being more careful with documentation for decades, and documentation keeps drifting, because the pressure to update it is always softer than the pressure to ship the underlying change. The fix is to remove the description's status as an independent artifact and make it a derived view of the schema.

The strongest pattern is generation: the prose description is produced from the schema annotations themselves, with parameter names, types, and per-field descriptions pulled from the same source the runtime validates against. A parameter rename becomes a single change to the schema, and the description that ships to the model is regenerated from that schema on every build. There is no longer a second place for the truth to live, because the second artifact does not exist as a writable thing. This is the pattern OpenAPI-to-MCP generators implement, and it is the right default for any tool surface large enough that drift becomes statistical.

When full generation is not possible — because the prose contains pedagogical framing that the schema annotations cannot capture, or because the description is co-authored with non-engineers — the second-strongest pattern is contract testing on the description's references to schema fields. A test parses the prose, extracts every token that looks like a parameter name (every backticked identifier, every snake_case word that matches the field-naming convention), and asserts that every one of those tokens is a current field in the schema. The test fails when the description mentions user_id after the schema's field has become account_id. It does not require the description to be perfect, only to be referentially honest about the names of the things it describes.

Registry-side validation closes the same gap from the publishing direction: an MCP registry, an internal tool catalog, or a function-calling gateway can refuse to accept a tool definition whose description references a field the schema does not declare. The check is cheap, the failure mode is loud, and it relocates the burden from "remember to update both" to "you cannot ship the broken state." The pattern is identical to the schema-validation step that already prevents wrongly-typed tool calls at runtime — it is just applied to the description as an input rather than to the arguments as an output.

For the descriptions of third-party tools — vendor MCP servers, hosted function-calling APIs, anything you do not control — the right discipline is treating each tool description as a pinned dependency. The version of the description that you accepted goes into your tool manifest, the new version that arrives next month is reviewed before it ships to production, and the change to the prose is treated with the same gravity as a change to the schema. A vendor renaming a description's framing is, for your model's behavior, a breaking change that did not bump a version number, and you should be prepared to catch it the way you would catch any other silent upstream change.

The Leadership Problem Hiding Inside a Documentation Bug

There is a leadership-level version of this problem that the technical fix does not fully address. The reason the description drifted in the first place is that two teams own two artifacts that describe the same thing, and neither team's review process treats the other team's artifact as part of their own deliverable. The backend engineer who renamed the field did everything right by the standards of their own discipline. The platform engineer who maintains the description was never told the rename happened. The handoff between them is the thing that broke.

Treating tool descriptions as part of the system prompt — and therefore part of the product surface that ships to users — pulls them inside the same review boundary as the rest of the prompt. That means a code owner who has to approve description changes, a CI check that fails when a tool description and its schema diverge, and a release process that does not consider a tool change complete until the description has been re-derived from the schema and re-reviewed. None of those mechanisms are technically difficult. The difficulty is the organizational decision to say that the description is product, not documentation, and to put it under the discipline that distinction implies.

The teams that get this right tend to converge on the same operating principle: every artifact that the model reads at inference time is part of the prompt, and every part of the prompt is a contract with the user. The schema, the description, the system prompt, the tool-result formatter, the retrieval context — all of them shape what the model says, and all of them deserve the same review gravity. The description is not the smallest of those; it is the one that defines what the schema means. Letting it drift is letting the meaning of your tool surface drift, and the model — being a model — will follow the meaning rather than the type.

What to Watch For

If you want to find this failure mode in your own system before a customer does, three checks surface it quickly. First, parse every tool description for identifier tokens and grep them against the live schema; the false-positive rate is low and the true positives are exactly the drift cases. Second, sample a week of production tool calls and compare the argument the model passed against the prose description's framing of what the field is for; mismatches between the two are the cases where the description shaped the model's interpretation away from the schema's intent. Third, look at the cardinality of failures across users — drifted-description bugs concentrate in users whose data violates the simplification the old description assumed, and the failure rate looks like a sharp segment-specific spike rather than a smooth baseline rise.

The tool description is the part of your prompt that nobody on your team thinks of as prompt, and that is exactly why it is the part most likely to be lying to your model right now. The schema is honest because it cannot afford to lie. The description has no such constraint, and unless you build one for it, it will drift the way documentation has always drifted — quietly, asymmetrically, and into precisely the cases where the truth matters most.

References:Let's stay in touch and Follow me for more thoughts and updates