Skip to main content

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

A recent empirical study of 103 MCP servers spanning 856 tools found that augmenting tool descriptions alone produced a statistically significant 5.85-percentage-point lift in task success. That is a remarkable effect size for a change that costs nothing at inference time and doesn't require a new model. But the same study saw regressions in 16.67% of cases — meaning naive description edits hurt more than they help roughly one in six times. The description field is powerful, and it is sharp. Most teams are holding it by the wrong end.

The Docstring Is Compiled to Prompt Tokens, Not Rendered to Humans

When you register a tool with a framework like LangChain or the Anthropic SDK, the function's docstring is pulled directly into the system context on every model call. In LangChain's @tool decorator, the docstring becomes the tool description verbatim. With the Anthropic API, the description field of each tool schema is inlined into the prompt the model sees. It is not documentation. It is prompt text with extra steps.

This reframing changes what "good" looks like. Traditional docstring virtues — brevity, neutral voice, "what it does" over "when to use it" — are actively harmful when the reader is a model choosing between twelve similar tools. The model needs boundary conditions, not slogans. Consider the difference:

  • Bad: "Search across internal knowledge and surface the most relevant results."
  • Better: "Search the customer support knowledge base for articles relevant to a user's question. Use only for questions about product features, pricing, or troubleshooting. Do NOT use for questions about billing history, account state, or anything requiring personalized data — use get_account_status for those. Returns up to 5 ranked results; empty list means no match, not low confidence."

The second reads like terrible marketing copy and excellent prompt engineering. It names the tool it competes with, establishes when-not-to-call rules, and tells the model how to interpret empty output. None of that information helps a human reader browsing the codebase. All of it matters for an LLM trying to route correctly at 3am in production.

At scale, this matters more than any individual tool. A typical five-server MCP setup with 58 tools consumes roughly 55,000 tokens of context before the user's first turn. Adding a Jira server alone can push that to 70,000. Every character in every description is paying rent in the context window, and every imprecise phrase in one tool description changes the probability distribution over every other tool's invocation. Tool descriptions are not independent. They form a joint prompt where each one defines itself by what it excludes.

Four Archaeological Strata of a Real-World Tool Description

Dig through any production agent codebase and you'll find tool descriptions that belong to different geological eras of the project:

  1. The MVP layer: written by whoever first wired the tool up, optimized for passing the "does it call at all" test. Usually accurate about the function's mechanics, silent about when to prefer it over alternatives because no alternatives existed yet.
  2. The bug-fix layer: lines added after specific production incidents. "Do not call this with empty string as argument." "Only use when user has explicitly confirmed." These read like compiler warnings masquerading as documentation.
  3. The feature-drift layer: the implementation changed, but the description did not. Parameter customer_id became account_id in code, in schemas, in every caller — except the natural-language description that still talks about customers.
  4. The capability-expansion layer: someone added a second use case by piling new bullet points onto a description that was originally scoped to one. The model now sees a tool that claims to do two loosely related things and confidently uses it for both, badly.

The sediment accumulates. What a new contributor sees is a description that looks coherent but is actually a palimpsest written by four different people over two years, with each layer optimized against a different failure mode, and no one holding the whole thing in their head. This is why description edits regress 16.67% of the time — the "fix" collides with an invariant established by an earlier layer that nobody remembers.

The archaeological fix is not to rewrite from scratch. It is to treat descriptions as first-class code artifacts with ownership, version history, and explicit change rationale. When you edit a description, say why — in the commit message, not the description itself. The description stays a prompt; the commit history becomes the archaeological record.

The Silent-Failure Mode That Passes Every Test

The scariest tool-use bug is not the model calling the wrong tool. That one at least produces a visible error when the tool fails or returns nonsense. The scary bug is the model calling the right tool for the wrong reason, in a situation where the tool happens to work and return a plausible-looking result.

Concrete pattern: a search_users tool whose description says "look up users by name, email, or user ID." A user asks "find the support ticket from dorathy." The model, lacking a search_tickets tool nearby in the description space, binds dorathy as the name and calls search_users. The API returns a matching user. The agent then hallucinates the ticket content from the user's profile. The user gets a confident, wrong answer. The tool worked. The API succeeded. No exception fired. Your eval suite — which tested search_users with names, emails, and IDs — passed.

Research on tool-selection hallucinations categorizes this as a tool-type hallucination combined with parameter binding hallucination. It surfaces disproportionately when the available tool list has coverage gaps the model doesn't know about. The model can only choose from the tools it sees; if none of the descriptions explicitly rule themselves out for the user's real intent, the closest-matching tool wins by default. In practice, "when not to call me" is more important information than "when to call me," because calling the wrong tool is strictly worse than calling none at all — the agent could at least ask a clarifying question.

Two patterns that help:

  • Negative examples in the description: "Do NOT use this tool to search for tickets, orders, or conversations; those have dedicated tools." Give the model the shape of the negative space.
  • Disambiguation cues for confusable tools: if you have get_user and search_users, the description for each should reference the other by name and explain the distinction (exact lookup vs. fuzzy match). The model cannot disambiguate tools it does not know are related.

Description-Implementation Drift and the Lint Rules That Catch It

A tool description is a contract with the model. The implementation is a contract with the runtime. When those drift, the model correctly invokes a tool whose behavior no longer matches what the description promised. Every downstream reasoning step is now poisoned by a true-sounding premise that is no longer true.

Schema drift is the easy case to detect: if customer_id is renamed to account_id in the JSON schema but the description still talks about looking up customers, a basic linter comparing the description's named parameters to the schema's actual parameter list will flag it. Several open-source tools now do this as part of agent CI.

The harder case is semantic drift. The description says "returns up to 5 results." Someone changes the implementation to return up to 10. The schema is unchanged — it was always results: array of objects. Nothing in the contract, narrowly construed, has broken. But the model has been making planning decisions based on the "up to 5" constraint for months, and now those plans are subtly miscalibrated (the agent might not paginate when it should, or might over-filter expecting fewer results than it gets).

Catching semantic drift requires something beyond structural linting:

  • Description-as-assertion tests: treat each factual claim in the description as a property to test. "Returns up to 5 results" becomes a test that calls the tool with a broad query and asserts len(results) <= 5. If you can't test it, don't claim it.
  • Golden invocation replay: keep a pinned set of canonical tool-call traces. When the description changes, replay the traces against a held-out eval set and diff the model's invocation decisions. A description rewrite that changes routing behavior should not ship silently.
  • Description changelogs: require a short rationale in the commit message for any description edit. This sounds bureaucratic; in practice it catches the "I tightened the wording" edits that accidentally tighten the semantics.
  • Integration tests that treat the description as input: the most under-used pattern. Your test calls the model with the tool schema, a synthetic user query, and asserts the model invokes the tool (or does not) in the expected way. This tests the description directly, not the implementation.

Treating Descriptions as Production Prompt Assets

The practical shift is reframing: tool descriptions are not documentation, and they are not code comments. They are prompts, deployed continuously into production, that govern the behavior of every agent turn. The disciplines that apply to system prompts — version control with rationale, regression testing, A/B experiments, rollout gates — should apply to descriptions with equal rigor.

Three patterns worth stealing:

  • Description pull requests require the same review as prompt changes, not the same as comment cleanups. If your team gates system-prompt edits behind eval runs, the same gates apply to description edits. If you don't gate system-prompt edits, you have a bigger problem than this post can fix.
  • Descriptions should have owners, not just authors. Assign a team member responsible for the quality of each tool's description the way you'd assign ownership of a critical API endpoint. Rotation is fine; absence of ownership is how you get the archaeological layers.
  • Write descriptions with the full tool list in mind, not the single tool. Run the existing tool list through a "confusability" check before adding a new tool: does any current description's scope overlap with the new one? If so, both descriptions need to explicitly name each other and draw the line.

The upside of taking descriptions seriously is asymmetric. A 5.85-percentage-point lift from description work alone is more than most prompt-tuning rounds produce, and it compounds across every user interaction for the lifetime of the tool. The downside of ignoring it is a class of silent, plausible-looking failures that erode user trust in ways your dashboards can't easily measure.

Start With the Two-Hour Audit

If you ship an agent today, there is a concrete exercise worth running before the next sprint: read every tool description in your codebase, in order, out loud, as if you were the model seeing them for the first time in a context window. For each one, ask:

  • If I only had this description, would I know when not to call this tool?
  • Are there other tools in this list whose descriptions overlap with mine, and if so, do we reference each other?
  • Does every factual claim here match the current implementation?
  • What happens if the tool returns an empty result, an error, or a partial result — and does the description tell the model how to interpret each?

Most teams find that half their descriptions fail at least two of these questions. The first pass at fixing them usually takes an afternoon and produces a larger quality improvement than whatever prompt-engineering task was originally on the sprint. The second pass, done three months later after you've learned what the production failure modes actually are, is where the real gains live. The description field is the highest-leverage prompt in your agent because you are already writing it; the only question is whether you treat it like the production artifact it has quietly become.

References:Let's stay in touch and Follow me for more thoughts and updates