Skip to main content

Schema Entropy: Why Your Tool Definitions Are Rotting in Production

· 10 min read
Tian Pan
Software Engineer

Your agent was working fine in January. By March, it started failing on 15% of tool calls. By May, it was silently producing wrong outputs on another 20%. Nothing in your deployment logs changed. No one touched the agent code. The tool definitions look exactly like they did six months ago — and that's the problem.

Tool schemas don't have to be edited to become wrong. The services they describe change underneath them. Enum values get added. Required fields become optional in a backend refactor. A parameter that used to accept strings now expects an ISO 8601 timestamp. The schema document stays frozen while the underlying API keeps moving, and your agent keeps calling it confidently, with no idea the contract has shifted.

This is schema entropy: the gradual divergence between the tool definitions your agent was trained to use and the tool behavior your production services actually exhibit. It is one of the most underappreciated reliability problems in production AI systems, and research suggests tool versioning issues account for roughly 60% of production agent failures.

What Schema Entropy Actually Looks Like

Schema entropy isn't a single failure mode — it's a category of failures with a shared root cause.

The most visible form is a hard break: you rename a required parameter, and agents immediately start generating calls that fail with 400 errors. These are actually the easy cases. You see the failure, you find the mismatch, you fix it.

The dangerous form is soft rot. Consider these scenarios:

  • You add a new enum value PENDING_REVIEW to a status field. Your agent's tool description still only lists the four values it knew about at launch. When the API starts returning PENDING_REVIEW, the agent tries to interpret it with its existing mental model — sometimes correctly by inference, sometimes not.
  • You make a parameter optional that used to be required. Calls that omit it now succeed at the API layer but trigger different behavior in the backend. Your agent doesn't know this branch exists.
  • You change a field from accepting a bare integer to requiring a string-formatted integer ("42" vs 42). The API silently coerces it for a while, then a framework upgrade stops the coercion. Agent calls start failing with cryptic type errors weeks after the backend change.
  • You add a more specific sibling tool (search_position_budgets) alongside an existing one (search_positions). Both look similar in your tool manifest. The agent starts mixing them up, routing 30% of budget queries to the wrong endpoint.

Research on agent tool testing found exactly this last pattern: improving tool descriptions to better distinguish similar tools had a larger impact on agent accuracy than improving the agent's own logic. The description was the bug.

Why Agents Make Bad Schema Consumers

When a human developer calls an API with a stale schema, the workflow includes reading error messages, consulting updated docs, and adjusting. Agents don't have that recovery loop by default. They receive a tool definition at the start of a run and treat it as ground truth for the duration.

More critically, agents fail in ways that are structurally different from code failures. A controlled study on schema-first tool APIs found it useful to distinguish three failure categories:

  1. Interface misuse: Structurally malformed calls — wrong types, missing required fields, hallucinated parameter names. These are the failures formal schemas prevent.
  2. Execution failures: Calls that are well-formed but trigger runtime preconditions the agent didn't know about.
  3. Semantic misuse: Schema-valid calls that are logically wrong for the task. The agent called the right tool with the right structure, just with semantically incorrect values.

Schema entropy primarily causes failure types 2 and 3. A tool definition can be syntactically correct and still point your agent toward the wrong behavior because the meaning of the schema has drifted from the behavior of the service.

This is compounded by a pernicious observability problem: many APIs return HTTP 200 even when an operation failed. The StackOne research on agent testing found that rate-limit errors buried in response bodies — returned as {"status": 200, "data": null} — were being interpreted by agents as "no records exist" rather than "request failed." When HTTP success doesn't mean business success, your agent has no signal that the schema has led it astray.

The Three Places Schema Rot Starts

Understanding where entropy enters the system tells you where to put your defenses.

External services. You don't control when a third-party API changes its schema. Payment processors add new charge states. CRMs evolve their contact models. Every external tool dependency is a potential schema rot vector, and you won't get a webhook when their API changes.

Internal services crossing team boundaries. A backend team refactors a service endpoint. They update their API documentation. They do not update the LLM tool definition their colleagues wrote six months ago in a different repo. This is the most common scenario in practice — tool definitions live near the agent code while the services they describe live elsewhere.

Model-side changes. Provider updates can change how models interpret schemas. Strict mode enforcement, changes in how optional fields are treated, behavioral differences in how models handle ambiguous enum values — these shift the effective contract even when your JSON schema bytes haven't changed.

Backward Compatibility Rules for Tool Schemas

The core principle is asymmetric: it is always safe to add, never safe to remove, and sometimes safe to change — depending on the direction.

Safe changes:

  • Adding new optional parameters with documented defaults
  • Adding new enum values if the agent can fail gracefully on unknown values
  • Adding entirely new tools to the manifest
  • Making previously required parameters optional (if the default behavior is correct for existing call patterns)

Breaking changes:

  • Removing or renaming any parameter
  • Making an optional parameter required
  • Changing the type of an existing parameter (even if technically compatible at the wire level)
  • Removing enum values that agents may currently be sending
  • Changing the semantic meaning of a parameter without changing its name

The last one is the hardest to detect. A priority field that used to accept values 1-5 now accepts low/medium/high — the type changed, but the field name didn't, and nothing in a schema diff will surface the semantic shift.

Practical rule: treat any change to an existing parameter as a potential breaking change. New behavior should go in new parameters or new tool versions. Old parameters should be deprecated explicitly with sunset dates in the description field, not silently repurposed.

Versioning Your Tool Manifest

Applying semantic versioning to tool collections — not just individual tools — is the clearest way to communicate change impact to consumers.

A MAJOR bump signals a breaking change: a parameter was removed, a type changed, behavior diverged from the description. Any agent consuming this tool manifest should be re-evaluated before upgrading. A MINOR bump adds new optional parameters or new tools while preserving all existing contracts. A PATCH bump corrects documentation, improves descriptions, or clarifies edge cases without changing behavior.

Concretely, this means:

  • Storing tool manifests as versioned artifacts, not inline strings in agent prompts
  • Including a schema_version field in every event your agent logs, so you can correlate failures with schema versions in post-incident analysis
  • Running multiple tool manifest versions simultaneously during rollouts, assigning traffic to versions the same way you'd assign traffic to API versions

The Snowplow team developed a similar framework for event schemas called SchemaVer, which explicitly separates model-breaking changes, additions, and fixes. The same taxonomy maps cleanly to tool definitions, and the core insight holds: versioning is not bureaucracy, it's the mechanism that makes rollback possible.

Integration Tests That Catch Schema Rot

Contract testing borrowed from microservice architecture is the right model here. The core idea: for every tool your agent exposes, maintain a suite of test cases that are entirely decoupled from the LLM. The test calls the tool handler directly with known inputs and asserts on the output schema, error behavior, and edge cases.

This gives you two important properties. First, these tests run fast — they don't make LLM API calls. Second, they fail when the underlying service changes, not when your agent happens to exercise that path.

For external APIs you don't control, the discipline is different: maintain a schema snapshot and run a diff against the live API on a schedule. When the live schema diverges from your snapshot, the diff is your alert. You then decide whether the change is backward-compatible before updating your tool definition — not after your agent starts behaving strangely in production.

A minimal CI/CD check for tool schema health should verify:

  • All required parameters in your tool definitions are still accepted by the target service
  • All enum values your definitions document are still valid on the receiving end
  • Response schemas match what your tool's output parsing code expects
  • No existing test calls produce 2xx responses with failure signals buried in the body

The last check is the easiest to miss and the most important. The pattern of "HTTP 200 but actually failed" is common enough that it deserves an explicit test category.

What to Instrument in Production

Prevention handles the changes you anticipate. Instrumentation handles the ones you don't.

Every tool call in production should log the tool name, tool manifest version, call parameters (sanitized), response status, and whether the response matched the expected output schema. Aggregating these logs lets you detect schema drift as a statistical signal rather than waiting for a loud failure.

Patterns that indicate schema entropy:

  • A specific tool's success rate drops over a rolling window without any deployment event
  • A new enum value starts appearing in responses where your schema doesn't list it
  • The distribution of parameter values shifts — agents start sending values outside the expected ranges
  • Tool call latency changes significantly, suggesting a different execution path on the service side

These signals won't fire immediately. Schema entropy is a gradual process. You need to be watching the right metrics continuously, not just at deploy time.

The Cultural Problem

All of this is technically straightforward. The hard part is organizational.

Tool definitions are owned by whoever writes the agent, but the services they describe are owned by other teams. When the payments team updates their transaction status enum, they don't think to notify the AI team. When the catalog team adds a new product type, no one runs the diff. The schema lives in the agent repo and the service lives somewhere else, and there's no mechanism that bridges them.

The fix isn't technical — it's treating tool definitions as shared contracts between producer and consumer teams, with the same care you'd give a public API. That means:

  • Tool definitions are checked into a location both the agent team and the service team watch
  • Service owners sign off on tool definition updates that touch their APIs
  • Breaking changes in backend services go through the same process as breaking API changes, because the agent is just another consumer

The underlying principle: if a team can change behavior without updating the tool definition, entropy is guaranteed. The schema will drift. The only question is how long before an agent starts making confidently wrong decisions based on a contract that no longer reflects reality.

Schema entropy is not a hard problem to solve once you can see it. The difficulty is that it's designed to be invisible — no exceptions, no stack traces, just an agent working as hard as ever against a model of the world that quietly stopped being true.

References:Let's stay in touch and Follow me for more thoughts and updates