Skip to main content

The Tool Description That Rotted While Your Agent Kept Calling It

· 10 min read
Tian Pan
Software Engineer

Your agent has been quietly wrong for six months and your error rate looks fine. The underlying API shipped a renamed error code, made one optional field required, and started rejecting calls without an idempotency header. The tool description in your agent's system prompt — pasted from a Notion page in Q4 of last year — describes none of this. The agent keeps calling the old shape, the orchestration layer keeps catching the failure and retrying with the same broken arguments, and the only signal in your telemetry is a slightly elevated retry count that nobody on call has the context to investigate.

Tool descriptions are interface contracts. They age the moment the underlying API does. And unlike a typed SDK, they break silently — the model just makes worse calls.

The reason this failure mode is endemic is that nobody on your team owns the description as a versioned artifact. The backend engineer who shipped the API change updated the OpenAPI spec, ran the SDK codegen, and merged the migration. The agent platform's tool definition lives in a different file — sometimes a different repo, often a YAML blob in a prompt template, occasionally a Markdown block in someone's onboarding doc. It was pasted from a draft six months ago. The codegen never touched it. The PR template doesn't ask about it. The on-call runbook doesn't mention it. The drift accrues one quiet schema change at a time until the day a customer escalation forces someone to read the prompt out loud and discover the description is describing an API that no longer exists.

Why drift here breaks silently instead of loudly

A typed SDK with a stale type signature fails at compile time. An HTTP client against a contract that changed surfaces a 400 or a deserialization error your monitoring catches. The LLM tool-calling surface has neither of those gates.

When the description says a field is optional and the API has since made it required, the model omits the field, the API rejects the call, the orchestration layer treats it as a transient failure, and the retry policy issues the same broken arguments again. Your dashboards show "elevated retry rate on tool X." They don't show "the tool description is lying to the model." The model has no way to learn from the failure within a session — it sees a 400, infers a server problem, apologizes to the user, and moves on.

When the API adds a new optional parameter the description doesn't mention, the model never passes it, and the API's behavior subtly changes — defaults the parameter to a value that was sensible six months ago and is now wrong for the customer's tenant. The agent does the wrong thing competently. There is no error. There is just a slow accumulation of incorrect outcomes that surface as user complaints whose root cause nobody can trace.

When the API renames an error code from RATE_LIMITED to THROTTLED_BY_TENANT_POLICY, the description's "if you see RATE_LIMITED, wait and retry" instruction silently stops triggering. The model sees an error code it doesn't recognize, can't apply the retry logic the description encoded, and either gives up or hallucinates a different recovery path. The orchestration layer's metrics will eventually notice, but the time-to-detect is measured in weeks because the failure shows up as user-reported flakiness rather than an alert.

The shared property of all three: the description is part of the agent's reasoning surface, and a wrong description corrupts the reasoning before the call ever lands.

The drift modes that don't even involve an API change

The harder version of this problem is that the surrounding system can drift while the API itself stays still. Most teams don't think about this category until it has burned them.

A downstream service introduces an idempotency-key requirement. The API the agent calls still accepts the old request shape, but the service behind it now retries differently when the key is absent, producing duplicate state changes that look like the agent is double-spending. The tool description, which never mentioned idempotency, can't possibly cue the model to send the key. The fix is upstream of the model, but the symptom looks like an agent bug.

A rate limit gets tightened. The agent has no awareness that the per-minute quota is now half what it was, and it continues issuing the call burst the description's "you can call this freely" language implies. Half the calls bounce. The other half succeed in unpredictable order. The user sees a session that "sometimes works."

A new compliance policy means the call now requires a consent token in the header. The API accepts the call without the token but logs a violation. The team finds out six weeks later when audit pulls the log. The description never had any reason to mention consent because consent didn't exist when the description was written.

In each case, the tool description is technically correct as a description of the API's contract. It's wrong as a description of the call the agent should be making in the current operational reality. The interface the agent reasons against is broader than the OpenAPI spec, and the drift surface is correspondingly broader.

A versioning discipline that treats descriptions as compiled artifacts

The fix is to stop treating tool descriptions as documentation and start treating them as code. Concretely:

Generate descriptions from the same source of truth as the SDK. If the API has an OpenAPI spec, the tool description should be generated from it on the same CI step that generates the typed client. The description's parameter list, type constraints, and required/optional flags are derived from the spec, not transcribed by a human reading the spec. The human-authored content is the prose — what the tool is for, when the agent should reach for it, what the agent should do with the result — and that prose sits next to the spec in the same review process.

Version the description alongside the API. Tools should follow semver. Backwards-incompatible changes to a description's shape — a renamed parameter, a removed field, a tightened type — trigger a major version bump and a new tool name (createInvoice_v2), not a silent edit to the existing description. Agents that haven't been upgraded keep calling createInvoice_v1 against the same shape they were tested against. The platform serves both until the older one is deprecated through the standard channel.

Treat description prose changes as breaking changes. This is the counterintuitive one. A change to the description's prose — even if the JSON schema is identical — can flip a model's tool selection decision. The model that used to call searchOrders for "find my last purchase" might now call listInvoices because the description was reworded. From the API's perspective nothing changed. From the agent's perspective the surface area shifted. CI should treat any description edit as a candidate breaking change and run an eval that proves the tool-selection distribution hasn't moved on a held-out trace corpus before allowing the deploy.

Add a contract-drift CI gate. A scheduled job pulls the live API's OpenAPI spec, diffs it against the spec the deployed tool descriptions were generated from, and fails the build (or pages the owning team) on any change. The signal is "the API moved and the description didn't" and the gate forces the team to either regenerate the description or explicitly acknowledge a deferral with a TTL.

Replay traces against current API behavior. The eval that catches description rot is not a static prompt set — it's a corpus of recent agent traces replayed against today's API. The replay produces a delta per call: "this argument shape no longer validates," "this parameter is now required," "this response field no longer exists." The delta is the punch list of descriptions that need to ship. The replay also catches the silent failure mode where the call still succeeds but with semantically different behavior — the eval grades the outcome, not just the call shape.

Vendor the descriptions. Storing the generated descriptions in your own repo (rather than fetching them at runtime from a vendor's MCP server) means upstream changes don't auto-apply. You see the diff in a PR, you accept it deliberately, and the moment of breakage is a code review event rather than a 3 a.m. page. The tradeoff is staleness, which is what the contract-drift gate above is designed to bound.

Who owns the contract

The organizational failure that makes this category sticky is that no single team owns the tool description end-to-end. The backend team owns the API. The platform team owns the agent runtime. The product team owns the prompt and the user-facing behavior. The description sits in the middle of all three and gets ownership only when something breaks badly enough to justify a postmortem.

The pattern I've seen work is to treat the tool description the same way the codebase treats the SDK — as a generated artifact whose source is in the API repo, whose review is on the API team's PR template, and whose deploys are gated by the same CI checks that block a bad SDK release. The agent platform consumes the description as a versioned package the same way the typed SDK consumers do. When the API team ships a breaking change, the description bump and the SDK bump are the same PR. When a consumer is on an old description version, they're on an old SDK version too, and the deprecation timeline applies uniformly.

The shift is small but consequential: the description stops being a piece of prompt content owned by whoever last edited the agent's system prompt and becomes a piece of interface contract owned by whoever owns the API. The team that ships the API change is now responsible for shipping the description change in the same commit, and the team that consumes it gets the same versioning guarantees they get from the SDK.

What this means for your roadmap

If your agent platform doesn't have a contract-drift CI gate today, your descriptions are stale by some amount and you do not know how much. The first thing to do is measure: pull the live OpenAPI spec for each tool your agent exposes, regenerate the description, and diff it against what's deployed. Every red diff is a latent agent failure mode you are paying for in retries, in misrouted calls, in user complaints whose root cause looks like "the model got it wrong" but is actually "the description told the model the wrong thing."

The deeper realization is that the LLM tool-calling surface is a new place where interface contracts live, and the industry hasn't yet built the muscle to treat it the way it treats SDKs, gRPC stubs, or protobuf schemas. The teams that get this right early will have agents whose failure modes degrade gracefully across API evolution. The teams that don't will spend the next two years debugging agents whose problems are upstream of the model, in a description nobody owns, written against an API that no longer exists.

A tool description is part of your agent's surface area. An agent surface area without a versioning discipline is a surface area whose contract you have chosen not to own — and someone, eventually, will discover the consequences of that choice on a day you would rather not have it discovered for you.

References:Let's stay in touch and Follow me for more thoughts and updates