Skip to main content

The Dead Tool Nobody Can Remove From the Registry

· 10 min read
Tian Pan
Software Engineer

A tool has been sitting in your shared agent catalog for fourteen months. It was wired up by an engineer who has since left, for a workflow that was sunset two reorgs ago, against a backend service whose owners are no longer sure who they are. The tool definition is 380 tokens. It ships in every system prompt for every agent in the org, on every turn, because nobody can prove it is unused, and the cost of being wrong about that proof is higher than the cost of carrying it forever.

That tool is the database column nobody dares drop. It is the cron job whose log file rotated out years ago. It is the dead code path you can grep for and find zero references to, except eval() exists and you cannot be sure. The agentic version of this problem is worse, because the carrying cost is not merely some bytes on disk — it is paid in tokens, in selection accuracy, and in security surface, on every single inference your platform runs.

Most teams discover this only after the catalog crosses some threshold. Around fifteen to twenty tools, selection accuracy starts to fall off a cliff: a recent line of work has measured the drop to below eighty percent once an agent's action surface exceeds that count, with the failure mode being not loud errors but quiet wrongness — the model picks a plausible-looking tool whose name is close to the right one, hallucinates parameters that pattern-match the schema, or omits a tool call it should have made. A dead tool in the catalog does not just take up tokens. It actively competes for selection.

The Ratchet Goes One Way

Adding a tool to a shared agent catalog is a one-line change someone makes on a Tuesday afternoon. Removing one is a multi-week audit. This asymmetry is structural, not cultural, and it shows up the same way across every agent platform that has run long enough to accumulate inventory.

The asymmetry is rooted in evidence. Adding a tool requires you to argue that some agent might benefit from it; the worst case if you are wrong is wasted context and a useless option in the action surface. Removing a tool requires you to argue that no agent in production depends on it, today or under any plausible user request, and the worst case if you are wrong is a silently broken capability the model used to invoke. The cost of adding a wrong tool is paid in cents per inference, smeared across millions of calls, attributable to nobody in particular. The cost of removing a tool that turned out to be load-bearing is a Slack thread with your name on it.

So the catalog grows. Every quarter, a new tool ships; no quarter, an old tool retires. The ratchet only goes one way. This is the same pathology that lets API surfaces grow by twenty endpoints a year and shrink by zero. The difference is that an unused REST endpoint costs you a route handler and some documentation. An unused tool costs you a slice of every model's reasoning budget, forever, until somebody pays the audit cost to remove it.

Telemetry Is the Precondition for Deletion

You cannot delete what you cannot measure. The honest deprecation conversation always starts the same way: "Show me how often this tool was called in the last ninety days, broken down by agent and by user-facing flow, and show me which calls came from a real user session versus a regression test." Most teams cannot answer that question.

This is not a hard problem to instrument once you decide to. The pattern that has converged across the OpenTelemetry-for-agents world is to emit a span per tool invocation tagged with the tool name, the calling agent identity, the conversation or session ID, the parent reasoning step, and the outcome — success, error, or timeout. Spanmetrics connectors then derive the RED triplet (rate, errors, duration) per tool. The output you actually want for deprecation decisions is a small panel that lists every tool in the registry along with: calls in the last 30 / 90 / 180 days, distinct agents that invoked it, distinct users in the resulting downstream effect, and last-called timestamp.

The hard part is not the wiring. The hard part is that nobody set this up before the catalog had fifty tools in it, and now retrofitting telemetry to historical inventory feels like a project, so it does not happen, so the catalog continues to grow under the same blind conditions that created the problem. The cheapest moment to add per-tool usage telemetry is the moment you add the second tool. The second-cheapest moment is today.

Soft Deprecation as a Borrowed Pattern

API design has had decades to think about deprecation as a lifecycle rather than a single event, and almost none of the wisdom has crossed over into agent tooling. The pattern worth borrowing has four stages, and each stage has a specific exit criterion.

Mark as deprecated. The tool stays in the catalog but its description gains a leading warning. In API land this is a Deprecation header and a Sunset date; in agent land it is a line in the tool description like "DEPRECATED — do not select unless explicitly requested by the user." Models will mostly avoid deprecated tools when the instruction is unambiguous, which is itself useful telemetry: usage should drop, and if it doesn't, you have learned something about how the tool was actually being used.

Log every call. Every invocation against a deprecated tool gets tagged in your observability pipeline as a deprecation hit, with the calling agent, the session, and the parent reasoning step preserved. The exit criterion is a quantitative one: ninety days of near-zero traffic, with any spikes investigated and accounted for.

Gate behind explicit opt-in. The tool is removed from the default action surface and only attached to agents that explicitly request it in their configuration. This is the moment of truth: any agent that was silently depending on the tool now has to surface that dependency. The exit criterion is the absence of new opt-in requests over a sustained window.

Remove. The tool definition is deleted from the catalog. Its schema lives on in version control and in your registry's audit log, so a future archaeologist can reconstruct what it once did and why.

The whole process should be measured in weeks, not hours. The mistake teams make is collapsing it into a single all-or-nothing PR — either the tool is in the catalog or it is not — which forces every removal to be a high-stakes negotiation. Treating deprecation as a multi-stage funnel turns the same removal into four small low-stakes decisions, each of which can be audited and rolled back independently.

The Carrying Cost Is Three Costs, Not One

The shorthand "dead tools waste tokens" understates the problem by about two orders of magnitude. The carrying cost has three components, and the token cost is the smallest of them.

The first cost is tokens. A modern tool schema runs 200 to 800 tokens depending on how rich the parameter description is. A catalog of one hundred tools — easy to hit in any platform team a year in — represents 20,000 to 80,000 tokens of overhead before the first user message is even appended. At today's input pricing, a million daily inferences against that overhead is a six- or seven-figure annual line item, billed silently because nobody attributes it to any individual tool.

The second cost is selection accuracy. This is the cost that actually shapes user experience. Every tool you list expands the model's decision space; every irrelevant tool introduces semantic noise that competes with the right answer. A dead tool whose name partially overlaps with a live tool is the worst kind of noise: it is a confusable distractor that the model has no signal to avoid. The product symptom is not a stack trace; it is a slow rise in subtly wrong tool calls that show up in your eval set as a percentage drop you cannot quite explain.

The third cost is security surface. Every tool definition shipped in the system prompt is a capability the model can be persuaded — through prompt injection, through a malicious upstream document, through a creatively phrased user request — to invoke. A dead tool is not just useless; it is an attack surface whose owners have all rotated out, whose authentication paths nobody monitors, and whose code path nobody patches. The blast radius of an exploit against an unowned tool is the same as the blast radius of an exploit against an owned one, but the time-to-detection is unbounded.

A platform team that frames the carrying cost as "tokens" gets pushback from finance and nowhere else. A platform team that frames it as "tokens, plus accuracy regression we cannot otherwise explain, plus an unowned security surface" gets the budget to actually delete things.

Who Owns the Lifecycle

The structural fix is to make sure every tool in the registry has a named owner with explicit responsibility for its lifecycle, and that the lifecycle has a default end state. This sounds obvious and is almost universally missing. The tool catalog inherits all the worst properties of shared infrastructure: contributions are welcomed, ownership is implicit, and the maintenance burden falls on whoever cares enough to do an audit.

The pattern that works is to treat the tool registry the way a mature service mesh treats the service catalog: every tool has a CODEOWNERS-equivalent that points at a real on-call rotation, a renewal cadence that requires the owner to reaffirm the tool's necessity at known intervals — quarterly is a reasonable default — and a default behavior where a tool whose owner has gone silent enters the deprecation pipeline automatically. The renewal can be a one-click acknowledgement; the point is not the friction, it is the existence of a moment where somebody who is currently employed at the company puts their name on the line for the tool's continued existence.

Without renewal, every tool persists by default. With renewal, every tool persists only because somebody chose, recently, to keep it. The same registry that today has a hundred undocumented orphans becomes a registry with thirty tools, each of which has a face behind it. The user experience improvement is immediate and measurable, and the audit anxiety goes away because the audit happens continuously instead of as a heroic quarterly project that nobody volunteers for.

Build the Removal Muscle Before You Need It

The teams that handle this well are not the ones with the cleverest deprecation policy. They are the ones that removed their first tool early, before the catalog was big enough to matter, while the cost of getting the removal wrong was small and the institutional muscle for "we delete things here" was being built. The teams that handle this badly are the ones whose first attempted removal happens at scale, against a fourteen-month-old tool with no telemetry and no owner, because by then the deletion is a heroic act rather than a routine one.

The lesson is the same one infrastructure teams keep relearning: capabilities you do not exercise atrophy. If your platform has never removed a tool, the first removal is going to be terrifying regardless of how good your process documentation is. Force the exercise early. Remove something small and recent on purpose. Then remove something a quarter older. Build the muscle now, and the dead tool in your registry — the one you are thinking about right now as you read this — becomes a routine cleanup item instead of an open-ended audit that nobody owns.

References:Let's stay in touch and Follow me for more thoughts and updates