Retiring an Agent Tool the Planner Learned to Depend On

May 2, 2026 · 10 min read

Software Engineer

You unregister lookup_account_v1 from the tool catalog, swap in lookup_account_v2, and edit one paragraph of the system prompt to point at the new name. Tests pass. Three days later, support tickets start mentioning that the assistant "keeps trying to call something that doesn't exist," or — more disturbingly — that it answers customer questions with confident, plausible numbers and never hits the database at all. The deprecation didn't fail at the wire. It failed in the planner.

This is the gap between treating a tool deprecation as a syntactic change and treating it as a behavioral migration. The agent didn't just have your function in a registry; it had months of plans, multi-step recipes, and few-shot examples that routed through that function as a checkpoint. Pulling it out is closer to retiring an internal API your downstream services have informally hardcoded — except the downstream service is a model whose habits you cannot grep, and whose fallback when its preferred tool disappears is to invent one.

Tool retirement is the part of agent engineering where the API-versioning playbook breaks down most expensively. A REST consumer that calls a removed endpoint gets a 404 and a stack trace. An agent that "calls" a removed function might emit a phantom tool call against the old name, get rejected by the runtime, retry with a hallucinated argument shape, and then — having burned through its retry budget — produce a fluent answer that satisfies the user-visible turn while quietly skipping the lookup it was supposed to perform. The wire-level failure is the easy one to catch. The behavioral failure is the one that ships.

What the agent actually internalized about your tool

When engineers think "the agent uses the search_orders tool," they tend to picture a clean dependency: prompt → planner → tool spec → tool call. The reality is messier. The model has at least four overlapping representations of the tool, each of which has to be updated when you retire it.

The first is the schema in the runtime registry. That is the only representation you actually own and can atomically change. The second is the in-context tool list the model sees this turn — typically rendered from the registry, but cached in your prompt template, sometimes pinned to a specific version for prompt-cache stability, occasionally still containing a copy of the old description nobody noticed. The third is the few-shot examples in your system prompt that demonstrate the canonical shape of a successful call: "to find an order by SKU, call search_orders with {sku, region}." Those examples were tuned six months ago against the old tool, and they are now lying to the planner about what's available.

The fourth, and most uncomfortable, is what the model absorbed during pretraining or fine-tuning about API patterns that look like yours. If your tool was named generically — search, get_user, lookup — there is a measurable chance the model will produce calls in that shape regardless of what your registry says, because that is a high-probability completion in many similar contexts. Retraining the registry doesn't dent that prior. The remediation is structural: distinctive names, distinctive argument shapes, and aggressive eval coverage on prompts known to trigger the prior.

The first two you can update with a deploy. The third requires you to find every system prompt and few-shot example in your codebase that referenced the old tool — and the third is where most teams forget to look. The fourth requires you to assume hostile prior beliefs and design around them.

The staged retirement pattern

The discipline that ships clean retirements borrows from how protobuf operators handled field deprecations a decade ago: never break, always overlap, observe behavior, then remove.

Mark deprecated, do not remove. The first move is to update the tool's description in the registry to declare the deprecation in plain language the model can read. "DEPRECATED — use lookup_account_v2. This tool will be removed on YYYY-MM-DD." The model reads tool descriptions every turn. That sentence is your cheapest behavioral nudge, and it costs you only a handful of tokens. Crucially, the tool still functions during this window — calling it returns the same data. You are not breaking anything; you are advertising the change.

Dual-register the new tool with a different name. Resist the temptation to release lookup_account_v2 as a drop-in replacement under the same name. Distinct names give the planner a freshness signal it can act on, and they let you build evals that diff the two. The new tool's description should explicitly reference the old: "Replaces lookup_account_v1. Identical contract except region is now required." That cross-reference is a hint catalog the planner can consume without you having to rewrite every system prompt.

Soft-removal window with telemetry. During the migration, every call to the deprecated tool succeeds — and emits a deprecation event your platform tracks. You want a dashboard that answers two questions: which sessions still route through the old tool, and which prompts produce that routing? The first tells you when it's safe to remove. The second tells you which few-shot examples and which downstream parsers in your codebase still need updating. The deprecation event is metadata, not a failure; it is the signal you use to schedule the actual removal.

Hard removal with a structured failure mode. When the soft window ends, do not silently 404 calls to the old tool. Return a structured error the model can recover from: "Tool lookup_account_v1 was retired on YYYY-MM-DD. Use lookup_account_v2 instead. Identical contract." The model can read that error and re-plan; a generic exception cannot. Teams that skip this step discover that their agents respond to the removal by hallucinating arguments to the dead function, falling into a retry loop, and finally producing an answer with no tool call at all.

The shape of this pattern should feel familiar. It is the same staged contract — deprecate, dual-publish, telemetry, remove — that any versioned API runs. What's different is the consumer: the planner is more forgiving than a strict client (it will pick up the new name if you point at it firmly enough) and more dangerous (when it can't, it will invent something).

The fallback-hallucination failure mode

The single worst outcome of a botched tool retirement is not the agent failing loudly. It is the agent failing silently, by producing an answer that looks plausible but never consulted the source of truth.

Here's the path. The planner has a multi-step recipe in its short-term context: "to answer balance questions, call lookup_account_v1, then format the result." You remove the tool. On the next balance question, the planner emits a call to lookup_account_v1. The runtime rejects it. The planner sees the rejection in its trace. Most modern agent harnesses will retry once, perhaps with a slightly mutated argument shape ("maybe it expected account_id instead of id"). That retry also fails. At this point, the planner has burned its tool budget for the turn, and the next step in the recipe — "format the result" — has no result to format. What it does next depends on the harness, the model, and the prompt.

A well-defended harness will refuse to answer and surface the failure. A loosely-defended one will let the model produce a fluent paragraph that confidently discusses the user's "current balance," because the user's recent messages mentioned a number, the system prompt referenced typical balance ranges, and the model is in a context where producing a plausible-sounding numerical answer is the most common training-time completion. The user reads the answer, finds it plausible, and goes about their day. The team finds out from a customer complaint a week later, when the number turns out to be wrong.

The defense has two layers. First, your runtime should never let an agent emit a final answer on a turn whose required tool calls failed — that is a harness-level invariant, not a per-feature decision. Second, your retirement plan should include eval prompts that are known to historically route through the deprecated tool, and those evals should assert that the new plan reaches the same outcome. If the post-retirement plan produces a fluent answer with no successful tool call, the eval should fail. Both layers exist; neither is universal.

Behavioral diff evals are the actual test

The textbook eval for a tool retirement compares final-answer correctness before and after. That is necessary but insufficient. It tells you the system still gets the right answer; it does not tell you the system gets it the right way.

What you actually need is a behavioral diff: for a fixed set of input prompts that previously routed through the deprecated tool, record the full trajectory — every tool call, every argument, every intermediate observation — under the old configuration and under the new one. Compare them step by step. The questions you want answered are not "did the answer match" but "did the planner reach the new tool, did it pass the right arguments on the first try, did it require any retries, did its plan length grow or shrink, did it skip a step entirely?"

Trajectory-level evaluation is well-supported by the current agent eval ecosystem — frameworks like AgentEvals and behavior-regression tools track exactly this — but most teams use them for model upgrades rather than tool changes. Tool retirements are exactly the change shape these tools were designed for. The dataset you need is not enormous; thirty to fifty representative prompts that the deprecation telemetry confirms used to route through the old tool will reveal whether the new plan is healthy. Run that diff under the old and new tool configurations, and let the difference between trajectories — not just answers — gate the deploy.

The investment pays off again the next time. The same trajectory dataset is the regression suite for every future change you make to that tool, including the next deprecation, the next argument shape change, and the next system-prompt rewrite.

Treat the tool catalog as a versioned API

The architectural realization underneath all of this is simple: an agent's tool catalog has a deprecation contract the same way every other versioned API has a deprecation contract. The fact that the consumer is a model rather than a service does not exempt the catalog from the discipline; it raises the bar.

Versioned APIs have published sunset headers, dual-publication windows, telemetry on deprecated-endpoint usage, and migration guides. Tool catalogs need the equivalent: schema-level deprecation flags the model can read, dual-named tool versions, deprecation telemetry per session and per prompt template, and migration evals on trajectories rather than answers. The team that ships removals as code changes — git rm, redeploy, update the prompt — is treating a behavioral migration as a syntactic one. The team that ships removals through the staged contract is the team whose agent doesn't quietly start hallucinating account balances.

The forward-looking shift is to push this discipline upstream into the platform. If your tool registry tracks a deprecation flag, your eval harness can automatically build a trajectory diff between old and new versions before allowing the flag to flip from "soft" to "hard." If your prompt templates are scanned at build time for tool-name references, the build itself can fail when a deprecated tool's name still appears in a few-shot example. Most teams currently run these checks manually, by accident, after an incident. The teams that ship reliable agents run them by default, before the incident. The cost of building that discipline is one engineer-week of platform work. The cost of skipping it is one incident per quarter, indefinitely.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Retiring an Agent Tool the Planner Learned to Depend On

What the agent actually internalized about your tool

The staged retirement pattern

The fallback-hallucination failure mode

Behavioral diff evals are the actual test

Treat the tool catalog as a versioned API

Recommended Reading

About Tian Pan

What the agent actually internalized about your tool​

The staged retirement pattern​

The fallback-hallucination failure mode​

Behavioral diff evals are the actual test​

Treat the tool catalog as a versioned API​

Recommended Reading

About Tian Pan

What the agent actually internalized about your tool

The staged retirement pattern

The fallback-hallucination failure mode

Behavioral diff evals are the actual test

Treat the tool catalog as a versioned API