Skip to main content

The Tool Schema You Changed Without Telling the Agent

· 11 min read
Tian Pan
Software Engineer

A backend engineer renames a field. user_id becomes customer_id, because the team finally standardized on the word "customer" across services. They add one more argument, region, because billing now needs it. The change ships behind a normal pull request with two approvals. Every downstream service that calls the endpoint gets updated in the same release. The integration tests are green. By every measure a backend team uses, this is a routine, well-executed API change.

A week later, support tickets start climbing. The agent that places orders is occasionally placing them with no customer attached, or attaching them to the wrong region. Nobody changed the agent. Nobody changed the prompt. The model is the same version it was last month. And yet the agent is now wrong in a way it was not wrong before.

The cause is not a bug in the model and not a bug in the backend. It is that the tool schema has two consumers, and only one of them was in the room when the change was reviewed.

A schema is a contract with two readers

When you expose a function to an agent, you write a tool definition: a name, a description, and a JSON Schema for the arguments. It is easy to think of that schema the way you think of an OpenAPI spec — a machine-readable description that your own code validates against. That mental model is half right, and the missing half is what bites you.

The first consumer of the schema is the developer who writes the integration code. They read the field names, wire up the call site, and handle the response. If the backend renames a field, this consumer finds out immediately: the build breaks, the types stop matching, the linter complains. The breaking change is loud for the human, because the human's relationship to the schema is mediated by a compiler.

The second consumer is the model. When you pass a tools array to a model API, the provider does not hand the schema to some separate validation subsystem. It serializes your tool definitions — names, descriptions, parameter schemas, and any examples — into a system prompt and feeds that prompt to the model. The schema is not config that sits next to the model. It is prompt text. The model reads it the way it reads every other instruction: as natural-language context that shapes the next token.

That distinction is the whole problem. The model's relationship to the schema is not mediated by a compiler. It is mediated by attention over a prompt. When the backend renames user_id to customer_id, the model does not get a build error. It gets nothing. It keeps emitting user_id because that is the field name it was anchored on, and there is no mechanism anywhere in the stack that tells it otherwise.

Why the model breaks quietly

A human consumer of an API fails fast and fails completely. A model consumer fails slow and fails partially, and both of those properties make the failure harder to catch.

It fails slow because the model is not re-derived on every request. Whatever shape it learned — from the tool description, from few-shot examples, from the statistical pull of its training data toward common field names like user_id over rarer ones like customer_id — that shape is baked into how it generates the call. A backend can change the schema on Tuesday and the model will keep producing Monday's arguments until somebody updates the prompt and examples it reads from.

It fails partially because tool-call accuracy is statistical, not binary. The model does not switch from 100% correct to 0% correct. It drifts. Maybe 85% of calls still happen to come out right, because the new field name is close enough to the old one, or because the model sometimes copies the corrected name out of the updated description and sometimes falls back to its prior. A failure rate that moves from near-zero to fifteen percent does not trip an alert designed to catch outages. It shows up as a vague quality regression that someone notices a week later, after the data is already dirty.

The drift taxonomy is worth naming explicitly, because each variant fails in its own quiet way:

  • Renamed field. The model emits the old key. Your tool receives undefined for the new key. If you do not validate, the call proceeds with a missing argument.
  • Added required parameter. Nothing in the model's anchored context mentions the new argument, so it omits it entirely. The call is structurally incomplete.
  • Changed enum values. The model emits a value from the old set. It is a plausible string that is no longer valid.
  • Changed type. The model sends a string where a number is now expected. A permissive backend coerces it; a strict one rejects it; either way the intent is mangled.
  • Removed field. The model still sends the retired argument, and it is silently dropped.

Every one of these is a breaking change by the standard definition used in API versioning — and the standard definition was written with the human consumer in mind. The model is a consumer too, and it is the more fragile one, because it was tuned against the old shape and it has no contract test guarding it.

Borrow the vocabulary you already have

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates