Skip to main content

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

· 10 min read
Tian Pan
Software Engineer

A new optional parameter goes into a tool description on a Tuesday. The change is small — six lines in the diff, no breaking signature change, no callers updated, no eval cases touched. The PR description says "adds support for an optional language filter to the existing search tool." Two reviewers approve. It ships.

A week later, the cost dashboard shows that the search tool is being called eighteen percent more often than the prior baseline. Latency on the affected agent has crept up by roughly the same proportion. Nobody can point to a single failing eval. The new parameter, when used, behaves correctly. The new parameter, when not used, doesn't matter. And yet the planner has clearly changed its mind about when to reach for this tool — and the eval suite, which grades tool correctness, has nothing to say about a shift in tool frequency.

This is the tool schema evolution trap, and it catches almost every team that ships an agentic feature within the first eighteen months of production. The trap is invisible because the failure mode doesn't look like a failure. No call is broken. No output is malformed. The schema is backward-compatible by every reasonable definition. The only thing that changed is the model's prior — the implicit belief, encoded in the parameters of the planner step, about how capable and applicable this tool is for the queries flowing through the agent. And tool descriptions are part of the prompt surface. They are read on every turn. A description edit is a prompt edit.

A Tool Description Is a Prompt Addendum

Most teams treat tool schemas the way they treat OpenAPI specs: as machine-readable contracts that live in code, that are owned by engineering, and that change through normal code review. Prompts, by contrast, are treated as a behavioral surface — they have an A/B harness, they have eval gates, they have a version history annotated with the failure modes that motivated each edit.

The split makes sense in theory. The schema is structured; the prompt is freeform. The schema is consumed deterministically by the agent runtime; the prompt is interpreted probabilistically by the model. Two artifacts, two disciplines.

In practice, the model reads both artifacts on every turn, and it doesn't know which one your team thinks is which. A tool description that says "use this to search the product catalog when the user asks about availability" carries the same prior-shifting weight as a system prompt sentence that says "when the user asks about availability, call the search tool." The model can't tell them apart. They are concatenated into the same context window. They are tokenized by the same tokenizer. They are attended to by the same heads.

The implication is uncomfortable for the way most teams organize work. The team that owns the agent's behavior — the prompt — does not own the tool description. The team that owns the tool description — engineering — does not have a behavioral test suite for it. The handoff between the two falls into a gap, and every edit on either side of the gap is a prompt edit on the other side that nobody graded.

Why Optional Parameters Are Especially Sticky

The specific failure mode — a new optional parameter inflating tool call frequency — has a mechanism worth naming. When a model sees a tool with parameters query and filters, it forms a prior over which queries this tool is suited for, conditioned on the description and the parameter set. When you add a third optional parameter — say, language — three things happen at once.

The description gets longer to explain the new parameter, and a longer description tends to read as a more capable tool. The parameter list itself expands the model's mental model of what the tool can do; even if the parameter is optional, its presence is a capability signal. And the surrounding examples or hints — if you added any to demonstrate the new parameter — tilt the planner toward queries that resemble those examples, even when the user's actual query is unrelated.

None of these effects are bugs. They are exactly what you'd want if the new parameter genuinely expanded the tool's reach. But the team adding the parameter is rarely thinking about reach. They are thinking about a specific upstream feature that needs language filtering. They expect the tool's behavior on the unaffected ninety-five percent of calls to be unchanged. The expectation is wrong, and the eval suite has no way to tell them.

Tool description bloat compounds the effect. Anthropic's internal testing showed that 58 tools could consume around 55,000 tokens, and on complex catalogs the combined schema can add several thousand tokens to every request — even for tools the agent never calls. When tool selection accuracy collapses from 43% to under 14% as catalogs bloat, the team can't always tell whether the regression came from the latest edit or from accumulated bloat across the past quarter.

Semver for Tool Schemas

The discipline that catches this class of regression starts with treating tool schemas the way you'd treat any other versioned API — but with a behavioral interpretation of "breaking change" that goes beyond signature compatibility.

A schema edit that changes the parameter list is at minimum a minor version. A description edit that changes phrasing in a way that could shift the planner's prior — "use this for" versus "this can" versus "you may" — is also at minimum a minor version. A truly trivial edit, like fixing a typo that doesn't change meaning, is a patch. The point of the versioning isn't compliance theater. It's to mark, at PR time, which edits demand a behavioral eval gate and which don't.

The eval gate for a minor schema version has to grade frequency, not just correctness. The team needs a baseline of how often each tool is called across a representative slice of production traffic — call it the tool-call frequency baseline — and the gate fails any PR where the post-edit frequency on the same traffic differs by more than some threshold (five percent is a reasonable starting point; tune from production).

This is the eval that's missing in most stacks today. Tool-calling evaluations focus on whether the model picked the right tool for a given query, and whether the parameters it generated conformed to the schema. Both are correctness measures. Neither answers the question of whether the planner is now reaching for a tool more often than it should, or whether a tool that used to be a last resort is now a first instinct.

Description Style as a Behavioral Surface

Beyond versioning, the team needs a description-style guide that names the phrasings known to shift the planner's prior. The guide is short and specific because the effects are specific.

"Use this for X" reads as a directive and biases the planner toward calling. "This can do X" reads as a capability statement and is more neutral. "You may use this when X" hedges further and biases the planner away from calling. Capability lists biased toward verbs ("searches, filters, sorts") read as broader than capability lists biased toward nouns ("a product search interface"). Example queries embedded in the description anchor the planner to the surface form of the examples, often more strongly than the abstract capability description that precedes them.

None of this is folklore. It is observable in any A/B harness that compares two tool descriptions on the same traffic and measures call frequency. The reason most teams don't observe it is that they don't have the harness, and the reason they don't have the harness is that they treat tool descriptions as code rather than as prompts. The fix is to set up the harness, run the comparison once on a description edit that didn't pass behavioral review, and let the team see the frequency shift in numbers. After that, the discipline is self-sustaining because the engineers who broke it once don't want to break it twice.

A Deprecation Path for Behavioral Debt

Some of the parameters added in haste won't pay back their behavioral cost. A team adds a language filter for one upstream feature, the feature is deprioritized two quarters later, and the parameter sits in the description forever — still inflating call frequency, still costing tokens on every request, still encoding a capability the team no longer wants to advertise.

A deprecation path for tool parameters needs the same discipline as a deprecation path for any other piece of capability surface. Each optional parameter should have a metadata block — owner, motivation, behavioral cost when added, expected usage rate. A quarterly audit walks the catalog, looks at which parameters are actually being used in production calls (not just the ones the agent could call with them, but the ones the planner is actually choosing to use), and prunes the ones whose usage rate is below the threshold that would justify their frequency-inflation cost.

The audit is unglamorous. It is exactly the kind of work that doesn't make a PR description, doesn't show up in promo packets, and doesn't get noticed until the team that didn't do it starts seeing tool call costs that don't track to a single recent change. The teams that do it find one or two prune-worthy parameters per audit, save a few percentage points on tool selection accuracy, and recover some of the planner's attention budget for the calls that actually matter.

The Architectural Realization

Tool descriptions are part of the prompt surface. The phrase is short, the implication is structural. It means the artifact your engineering team versions in git lives downstream of the same discipline the prompt repo has been building for two years — eval gates, A/B harnesses, deprecation paths, style guides, behavioral changelogs.

The team that recognizes this early gets to design the seams between schema engineering and prompt engineering before the artifacts diverge. The team that recognizes it late discovers, the first time the cost dashboard spikes for no apparent reason, that the failure mode it just observed has been silently accumulating for months — that a dozen small description edits, each individually defensible, have collectively shifted the planner's prior across the entire tool catalog, and that backing out the regression means undoing edits whose original motivation has been forgotten.

The work isn't expensive once the discipline is in place. A frequency baseline is a single batch job over a sampled day of traffic. An A/B harness for tool descriptions is the same harness the team already runs for system prompts, pointed at a different artifact. A versioning convention is a CONTRIBUTING.md paragraph and a CODEOWNERS line. A quarterly audit is a calendar event and a query.

What's expensive is discovering the trap retroactively. The team that learns this lesson the hard way pays it in confused incident calls, post-hoc forensics across description edits nobody remembers writing, and the dawning realization that the version control system was never going to surface the regression because the regression wasn't in the diff. It was in what the diff meant to the model — and that meaning was something the team had decided, by organizational default, not to measure.

References:Let's stay in touch and Follow me for more thoughts and updates