Skip to main content

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

A recent empirical study of 103 MCP servers spanning 856 tools found that augmenting tool descriptions alone produced a statistically significant 5.85-percentage-point lift in task success. That is a remarkable effect size for a change that costs nothing at inference time and doesn't require a new model. But the same study saw regressions in 16.67% of cases — meaning naive description edits hurt more than they help roughly one in six times. The description field is powerful, and it is sharp. Most teams are holding it by the wrong end.

The Docstring Is Compiled to Prompt Tokens, Not Rendered to Humans

When you register a tool with a framework like LangChain or the Anthropic SDK, the function's docstring is pulled directly into the system context on every model call. In LangChain's @tool decorator, the docstring becomes the tool description verbatim. With the Anthropic API, the description field of each tool schema is inlined into the prompt the model sees. It is not documentation. It is prompt text with extra steps.

This reframing changes what "good" looks like. Traditional docstring virtues — brevity, neutral voice, "what it does" over "when to use it" — are actively harmful when the reader is a model choosing between twelve similar tools. The model needs boundary conditions, not slogans. Consider the difference:

  • Bad: "Search across internal knowledge and surface the most relevant results."
  • Better: "Search the customer support knowledge base for articles relevant to a user's question. Use only for questions about product features, pricing, or troubleshooting. Do NOT use for questions about billing history, account state, or anything requiring personalized data — use get_account_status for those. Returns up to 5 ranked results; empty list means no match, not low confidence."

The second reads like terrible marketing copy and excellent prompt engineering. It names the tool it competes with, establishes when-not-to-call rules, and tells the model how to interpret empty output. None of that information helps a human reader browsing the codebase. All of it matters for an LLM trying to route correctly at 3am in production.

At scale, this matters more than any individual tool. A typical five-server MCP setup with 58 tools consumes roughly 55,000 tokens of context before the user's first turn. Adding a Jira server alone can push that to 70,000. Every character in every description is paying rent in the context window, and every imprecise phrase in one tool description changes the probability distribution over every other tool's invocation. Tool descriptions are not independent. They form a joint prompt where each one defines itself by what it excludes.

Four Archaeological Strata of a Real-World Tool Description

Dig through any production agent codebase and you'll find tool descriptions that belong to different geological eras of the project:

  1. The MVP layer: written by whoever first wired the tool up, optimized for passing the "does it call at all" test. Usually accurate about the function's mechanics, silent about when to prefer it over alternatives because no alternatives existed yet.
  2. The bug-fix layer: lines added after specific production incidents. "Do not call this with empty string as argument." "Only use when user has explicitly confirmed." These read like compiler warnings masquerading as documentation.
  3. The feature-drift layer: the implementation changed, but the description did not. Parameter customer_id became account_id in code, in schemas, in every caller — except the natural-language description that still talks about customers.
  4. The capability-expansion layer: someone added a second use case by piling new bullet points onto a description that was originally scoped to one. The model now sees a tool that claims to do two loosely related things and confidently uses it for both, badly.

The sediment accumulates. What a new contributor sees is a description that looks coherent but is actually a palimpsest written by four different people over two years, with each layer optimized against a different failure mode, and no one holding the whole thing in their head. This is why description edits regress 16.67% of the time — the "fix" collides with an invariant established by an earlier layer that nobody remembers.

The archaeological fix is not to rewrite from scratch. It is to treat descriptions as first-class code artifacts with ownership, version history, and explicit change rationale. When you edit a description, say why — in the commit message, not the description itself. The description stays a prompt; the commit history becomes the archaeological record.

The Silent-Failure Mode That Passes Every Test

The scariest tool-use bug is not the model calling the wrong tool. That one at least produces a visible error when the tool fails or returns nonsense. The scary bug is the model calling the right tool for the wrong reason, in a situation where the tool happens to work and return a plausible-looking result.

Concrete pattern: a search_users tool whose description says "look up users by name, email, or user ID." A user asks "find the support ticket from dorathy." The model, lacking a search_tickets tool nearby in the description space, binds dorathy as the name and calls search_users. The API returns a matching user. The agent then hallucinates the ticket content from the user's profile. The user gets a confident, wrong answer. The tool worked. The API succeeded. No exception fired. Your eval suite — which tested search_users with names, emails, and IDs — passed.

Research on tool-selection hallucinations categorizes this as a tool-type hallucination combined with parameter binding hallucination. It surfaces disproportionately when the available tool list has coverage gaps the model doesn't know about. The model can only choose from the tools it sees; if none of the descriptions explicitly rule themselves out for the user's real intent, the closest-matching tool wins by default. In practice, "when not to call me" is more important information than "when to call me," because calling the wrong tool is strictly worse than calling none at all — the agent could at least ask a clarifying question.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates