Your Tool Descriptions Are Prompts, Not API Docs
The tool description is not documentation. It is the prompt the model reads, every single turn, to decide whether this tool fires and how. You are not writing for the developer integrating against the tool — the developer already has the schema, the types, the examples in the PR. You are writing for a stochastic reader that has never seen this codebase, is holding twenty other tool descriptions in the same context window, and has to pick one in the next forward pass.
Most teams don't. They paste the OpenAPI summary into the description field, stick the JSON Schema under it, and ship. Then the agent undercalls the tool, confidently calls the wrong adjacent tool, or fires the right tool with parameters that were "obviously" wrong to any human reading the schema. The team blames the model. The model was reading exactly what you wrote.
Recent work auditing 856 tools across 103 production MCP servers found that 97.1% of tool descriptions contained at least one quality smell, and 56% failed to clearly state what the tool is for. That is not a niche problem in one codebase. That is the default state of tool surfaces in 2026. When researchers mutated the descriptions under controlled conditions, fixing accuracy and functionality smells swung tool-selection outcomes by 8–12 percentage points on single changes. In competitive settings where multiple servers expose functionally equivalent tools, standard-compliant descriptions reached 72% selection probability versus a 20% baseline — a 3.6× gap decided entirely by the prose.
The Three Symptoms of Doc-Style Descriptions
Before fixing anything, you have to recognize what doc-style descriptions look like in production traces. The failure modes cluster into three signatures, and they all point at the description, not the model.
Undercalling is the quietest failure. The tool exists, the situation calls for it, and the agent does not fire it. Instead, the agent improvises an answer from its parametric knowledge, hallucinates a result, or asks the user a clarifying question the tool was supposed to eliminate. The giveaway in traces: reasoning tokens that mention the tool's general subject area ("the user wants weather") without naming the tool or attempting a call. This is what happens when the description is a noun phrase — "Weather service integration" — rather than a trigger condition the model can pattern-match against user intent.
Confident misuse is louder and more expensive. The agent fires the tool, but with parameters that would never come from a human reading the docs. Date ranges in the wrong format, city names where the tool expects IANA timezone identifiers, query strings with internal escape sequences the model made up to satisfy an underspecified string type. This is the symptom of schema-only descriptions: the model sees type: "string" and fills the most plausible-looking value. The schema constrains the structure, not the semantics, and the description did nothing to pin down the semantics.
Wrong-tool selection is the most embarrassing failure because it looks like a reading-comprehension error the model should not make. Two adjacent tools exist — get_user and get_user_profile, or search_issues and list_issues — and the model picks the wrong one. Benchmark work on function-calling leaderboards shows this pattern is not uniformly distributed; smaller models and models under context pressure keyword-match aggressively, and five of eight models tested fired get_weather whenever the word "weather" appeared in the prompt, even when explicitly told not to check the weather. Adjacent tools with overlapping verbs and objects are a landmine that pure API documentation cannot defuse because documentation describes tools in isolation. The model reads them together.
Tool Descriptions as Prompts: The Four Moves
Treating a tool description as a prompt means borrowing four techniques you already apply to system prompts and applying them to every single tool spec.
Move 1: Lead with trigger conditions, not capabilities. An API doc reader wants to know what the endpoint does. A model wants to know when to fire it. Replace "Retrieves order status from the fulfillment database" with "Use this tool when the user asks about the current status, shipping progress, or tracking information for a specific order. Requires an order ID (format: ORD-XXXXXX)." The first version describes the tool. The second version describes the decision the model is about to make. That decision is what the description is for.
Move 2: Include concrete example invocations. Providing input examples is the single highest-leverage addition to a tool description, especially for tools with nested parameters, optional fields, or format-sensitive inputs. Not a schema — an example. {"order_id": "ORD-847291", "include_items": true} teaches the model more about valid parameter ranges than three paragraphs of prose. Anthropic's guidance on writing tools for agents calls this out explicitly: concrete examples help with edge cases that schemas cannot express, like whether optional parameters should usually be included or omitted.
Move 3: Write negative examples and sibling disambiguation. This is where API docs stop and prompts start. If your description only says what the tool does, the model has no way to know when not to use it. Add an explicit "Do not use this for X" section pointing to the sibling tool that handles X. list_issues should end with something like: "Do not use to fetch a single issue by ID — use get_issue for that. Do not use to search issue content — use search_issues, which supports full-text queries." This feels redundant to a human reader holding all three specs in their head at once. The model, which reads the descriptions serially with limited attention, benefits every time.
Move 4: State the output shape in the description, not just the schema. The agent's next tool call depends on what the current one returned. If the model cannot predict the output shape from the description, it will over-fetch, call a second tool to re-derive information it already has, or truncate reasoning because it cannot tell what the response will contain. A one-line "Returns an object with status (enum: shipped, pending, delivered), estimated_delivery (ISO date or null), and tracking_url (string or null)" is worth dozens of failed tool loops.
The Eval Harness You Actually Need
Treating descriptions as prompts only pays off if you iterate on them the way you iterate on prompts — with an eval loop tight enough that a description change lands a measurable delta within a few minutes. Most teams don't have this. They have unit tests for the tool logic and a handful of demo scenarios they run by hand. That setup will catch a broken schema. It will not catch a description that sends the agent down a wrong-tool rabbit hole on 12% of relevant queries.
The harness has three layers, and you need all three to iterate confidently.
The first layer is the description-targeted eval set. For each tool, you collect 20–50 user queries that ought to trigger that tool, plus 20–50 queries that ought to trigger an adjacent tool or no tool at all. These are not the happy-path scenarios from your product demo. They are the messy, underspecified, keyword-adjacent queries that expose tool-selection failures. You score each run on three axes — did the right tool fire, did the wrong tool fire, did the model invent an answer without firing anything — and track those numbers separately. Collapsing them into a single accuracy score hides the failure mode that most wants to be surfaced.
The second layer is the competitive-context eval. Real agents hold dozens of tools in their system prompt. A description that looks great in isolation may lose attention to a nearby description with slightly stronger trigger language. You run your eval with the full tool catalog loaded, not just the tool under test. Research on MCP servers in competitive settings shows the gap between a well-described tool and a badly-described one widens as the catalog grows — the description doesn't just compete with the user's intent, it competes with every other description in context.
The third layer is the production trace sampler. Sample actual agent runs, group by tool selected, and flag the runs where the model's reasoning does not match the tool it called. This is the only way to find the failure modes you didn't think to test. The patterns you discover here become the next round of eval queries, and the descriptions you refine here become the next version of the system. Production is the corpus your eval harness is always chasing.
Budget and the Token Tradeoff
The immediate objection to writing tool descriptions like prompts is cost. Longer descriptions consume more input tokens on every turn, and modern agents load dozens to hundreds of tools into context. The math looks grim on the surface: a 50-token description for 40 tools is 2,000 tokens; a 200-token description for 40 tools is 8,000 tokens, repeated on every forward pass through a multi-turn conversation.
The math changes once you account for the cost of failure. A wrong-tool call that triggers a retry burns one full tool call's worth of tokens plus whatever reasoning the model does to recover. A hallucinated answer that the user has to correct burns the turn plus user trust. The research on augmenting MCP tool descriptions found that better descriptions improved task success by a median of 5.85 percentage points and partial goal completion by 15.12% — but also increased execution steps by 67% on some tasks, so the tradeoff is real and direction-dependent. The right move is not "make every description longer." It is "make the descriptions that gate the highest-stakes decisions detailed, and keep the ones for dead-simple, unambiguous tools terse." Your eval harness tells you which is which.
Prompt caching is the other half of this equation. If your tool descriptions sit in a stable prefix that gets cached across turns and across users, the per-request cost of a long description approaches zero while the quality benefit persists. Teams that have not moved their tool catalog into the cached prefix are paying the full price on every request and have a distorted view of the economics. Moving the tool list into a cache-friendly position is usually a higher-leverage intervention than shortening individual descriptions.
The Organizational Shift
The deeper change is treating tool descriptions as owned artifacts, not boilerplate. When a tool is built, the API team writes the endpoint and the OpenAPI spec. The prompt for the model that will call that endpoint should be owned by whoever owns the agent experience — which is often a different team, with different review standards. Today, most orgs either let the API team's doc string leak into the tool description or let the agent team rewrite descriptions without telling the API team, which drifts over time as endpoints change.
Version the tool description alongside the prompt, not alongside the endpoint. Review description changes the way you review prompt changes — with an eval run, not a code review focused on grammar. When endpoints change, force a description review; when descriptions change, force a prompt regression run. The eval harness is the forcing function; without it, descriptions drift and nobody notices until a production incident.
The simple reframe is the whole thesis: your tool descriptions are part of the prompt the model reads, and every prompt-engineering discipline you apply to system prompts — clarity, concrete examples, negative cases, disambiguation, iteration against evals — applies to them. Teams that internalize this ship agents that call the right tool, with the right parameters, more often. Teams that keep writing OpenAPI summaries ship agents that look fine in demos and fail in traces.
- https://www.anthropic.com/engineering/writing-tools-for-agents
- https://arxiv.org/html/2602.14878v2
- https://arxiv.org/html/2602.18914
- https://apxml.com/courses/building-advanced-llm-agent-tools/chapter-1-llm-agent-tooling-foundations/tool-specifications-descriptions
- https://huggingface.co/blog/kelseye/general-fc
- https://arxiv.org/html/2411.13547v2
- https://gorilla.cs.berkeley.edu/leaderboard.html
- https://blog.quotientai.co/evaluating-tool-calling-capabilities-in-large-language-models-a-literature-review/
- https://www.llamaindex.ai/blog/building-better-tools-for-llm-agents-f8c5a6714f11
- https://community.openai.com/t/prompting-best-practices-for-tool-use-function-calling/1123036
