Tool Schemas Are Prompts, Not API Contracts
The most expensive line in your agent codebase is the one that auto-generates tool schemas from your existing OpenAPI spec. It looks like a clean engineering choice — single source of truth, no duplication, auto-sync on every API change. It is also why your agent picks searchUsersV2 when it should have picked searchUsersV3, fills limit=20 because your spec's example said so, and silently drops the tenant_id because it was buried in the seventh parameter slot.
Nothing about this shows up in unit tests. The schema validates. The endpoint exists. The agent's call is well-formed JSON. And yet the model uses the tool wrong, every time, in ways your QA pipeline never sees because it tests the API, not the agent's reading of the API.
The bug is conceptual. OpenAPI was designed to describe APIs to humans who write SDK code; tool schemas are read by an LLM at every single call as a piece of the prompt. Treating them as the same artifact is the same category mistake as auto-generating user-facing copy from your database column names.
OpenAPI describes; tool schemas instruct
An OpenAPI spec is a contract. Its job is to let a code generator emit a typed client and let a developer reading the rendered Swagger UI understand what to send. The description prose is documentation — humans skim it, then write code that obeys the types. Parameter ordering is cosmetic; the SDK exposes named arguments and the keys go into a JSON body where order is irrelevant.
A tool schema is none of these things. It is a chunk of text that gets concatenated into the model's context every time the agent considers calling the tool. The model has no SDK to fall back on, no autocomplete to remind it which fields exist, no compile error to catch a missing required field before runtime. Everything it knows about your tool comes from the description, the parameter names, the parameter descriptions, the type annotations, and the defaults. That artifact is a prompt.
Once you internalize that framing, several "bad" model behaviors stop looking like model failures and start looking like prompt failures. The model didn't fill the wrong field; the field name was ambiguous and the description didn't disambiguate. The model didn't pick the wrong tool; the two tools had near-identical descriptions auto-translated from API summary fields written for human skim-readers. The model didn't drop the optional security parameter; it appeared near the end of a fifteen-parameter list with a description that read, in full, "Optional. See docs."
Your OpenAPI spec is fine. It just isn't a prompt.
The description is the contract
In an OpenAPI spec, the description field is documentation. In a tool schema, it is the first thing the model reads — and often the only thing it reads carefully — when deciding whether to call the tool, which tool to call among similar ones, and what to put in each argument.
A good tool description tells the model four things: what the tool does, when to use it (and when not to), what it returns, and what the caller must know that isn't already implied by parameter types. None of these are reliably present in an OpenAPI summary auto-translated into a description field. OpenAPI summaries are written for engineers who already know what the endpoint is for and just want a one-line reminder; LLM tool descriptions are written for an agent that may have eight similar tools available and needs to disambiguate.
Compare two descriptions for the same endpoint. The auto-generated one says "Search users." That is a fine OpenAPI summary. The hand-tuned tool description says "Find users by name, email, or employee ID. Returns up to 50 matches sorted by relevance. Use this when you need to look up a person from partial information; do not use this to enumerate all users in a tenant — use listTenantUsers for that. Returns an empty array if no match; never errors on miss."
The second description does the work that the OpenAPI spec assumes a human will do by reading the surrounding documentation, the related endpoints, and the page header. The model gets none of that surrounding context. If the surrounding context isn't in the description, it doesn't exist.
Parameter ordering is a priority signal
In a typed SDK, the order of parameters in a function signature is a usability concern: required first, optional last, related parameters grouped. Once compiled, the call site uses named arguments and the order is gone. In a JSON request body, order is gone before the body even leaves the client.
LLMs read tool schemas top to bottom. The parameters appearing earlier get more attention than the ones appearing later — a positional bias that has been documented in agentic failure analyses and that practitioners notice the moment they reorder a schema and watch the agent's behavior change. If your most important disambiguating parameter is the seventh one in the list, the model will fill the first six on autopilot and treat the seventh as something it can probably skip.
This is not a bug to be patched with better prompting. It is the predictable behavior of a system that consumes its instructions as a sequence of tokens. The fix is to design the parameter list as a priority-ordered prompt: the parameters the model must reason about go first, the parameters that have safe defaults go last, and the parameters that exist solely for backward compatibility don't go in the schema at all.
This conflicts directly with OpenAPI conventions. An OpenAPI spec lists parameters in a sensible documentation order — path params, then query params, then body fields, often grouped by resource subobject. Auto-generation preserves that order. The result is a schema where the parameters the model needs to think about are scattered across whatever taxonomy the API designer used in 2021, with no relation to which ones the agent should reason about first.
Defaults are few-shot examples
In an OpenAPI spec, a default is a fallback. If the client doesn't send the field, the server uses the default; the default exists so the API can evolve without breaking old clients. The default value rarely communicates anything about typical usage — it's often a permissive value (limit=100) or a backward-compatible one (format="legacy").
In a tool schema, every default is a one-shot example. The model reads "limit (number, default: 100)" and treats 100 as the canonical answer to "what number goes here when nothing else suggests otherwise." This is true even when the default exists for a reason that has nothing to do with the agent's use case — for instance, when the default was chosen years ago to match the pagination behavior of an older client that no longer exists.
The same thing happens with example values, enum orderings, and field-level defaults inside nested objects. An enum listed as ["pending", "active", "archived"] will get pending selected disproportionately when the model has no other signal, because the first listed enum value is the natural fallback. A nested metadata.source with a default of "web" will see the agent fill in "web" even when the conversation context strongly suggests the source should be the integration the agent is running inside.
If you are auto-generating these defaults from an OpenAPI spec, you are inheriting a decade of accidental defaults as in-context examples without ever deciding what the model should believe is normal. The discipline is to set defaults that reflect the desired behavior of the agent, not the historical permissiveness of the API.
What the discipline looks like in practice
The team that takes this seriously stops treating tool schemas as a generated artifact and starts treating them as a versioned prompt asset. That has a few practical consequences.
The first is that tool schemas live in a different file than the API spec, are reviewed by people who design prompts (not just people who design APIs), and ship through the same evaluation pipeline as system prompts. A change to a tool description gets an eval run, just like a change to the agent's instructions, because it is a change to the agent's instructions.
The second is that the surface area of tools the agent sees is smaller than the API surface. A real API might have forty endpoints across a resource; the agent gets seven tools, each one a curated combination of operations under a single description, with parameters reduced to the fields the agent will plausibly fill. The remaining thirty-three endpoints exist; they just aren't exposed to the model, because exposing them inflates context, dilutes selection accuracy, and forces the agent to reason about distinctions a human SDK consumer would resolve through documentation.
The third is that tool design becomes an iterative discipline measured by agent behavior, not by API completeness. You watch which tool the agent picks for a given user intent, you watch which fields it fills correctly and which it skips, and you tune the description and parameter ordering until the behavior matches what you want. The OpenAPI spec doesn't change. The tool schema is a separate layer, edited based on what the model actually does.
This is the same discipline that turns a system prompt from "a paragraph someone wrote on day one" into "a versioned, evaluated, deliberately edited artifact." Tool schemas are part of the prompt. They deserve the same treatment.
The auto-generation antipattern
The most common path into this problem is well-meaning. A team has an OpenAPI spec, they have a new agent feature, and they want to avoid maintaining two copies of the schema. So they wire up a generator that emits tool definitions from the spec at build time. It feels disciplined — single source of truth, automatic updates, no manual sync.
What they have actually built is a pipeline that takes prose written for one audience and ships it as prompt to a different audience, with no human reviewing the translation. Every time the API team adds an endpoint, the agent gets a new tool whose description was written assuming the reader is a developer with the rest of the docs open in another tab. Every time the API team renames a field for human ergonomics, the agent's mental model of the tool shifts in ways that don't surface in any test. Every time the API team adds a parameter for a use case the agent will never trigger, the agent's context inflates and its tool-selection accuracy degrades a little.
The fix is not to abandon the OpenAPI spec. The fix is to insert a curated layer between the spec and the agent — a layer that selects which endpoints become tools, rewrites descriptions in prompt-style prose, reorders parameters by reasoning priority, and sets defaults to match the agent's typical use case rather than the API's historical default. This layer is owned by whoever owns the agent's behavior, reviewed alongside the system prompt, and evaluated against an agent eval set rather than an API contract test.
The work is real. The failure mode of skipping it is a fleet of agents that look like they're calling your APIs correctly and aren't.
What to do tomorrow
If you have an agent in production that uses auto-generated tool schemas, the cheap diagnostic is to read your tool descriptions out loud as if they were prompt fragments. The ones that read like API documentation — GET /users/{id}: Retrieves a user resource — are the ones the model is most likely misusing. The ones that read like instructions to a colleague who has never seen this system before are the ones working as intended.
Pick the three highest-traffic tools, rewrite their descriptions in prompt-style prose, reorder their parameters so the ones requiring reasoning come first, audit their defaults against what the agent should actually fill in, and run your agent eval set against the change. If the eval moves, you have your evidence that the schema layer is load-bearing prompt — and your roadmap for the rest of the surface.
The failure mode this prevents is the quiet one: agents that look like they're working, ship to production, and slowly lose accuracy on the long tail of inputs while every dashboard says "tool calls succeeded." The schema didn't validate any wrong; it validated the wrong thing.
- https://platform.claude.com/docs/en/agents-and-tools/tool-use/overview
- https://www.anthropic.com/engineering/advanced-tool-use
- https://developers.openai.com/api/docs/guides/function-calling
- https://medium.com/percolation-labs/how-llm-apis-use-the-openapi-spec-for-function-calling-f37d76e0fef3
- https://dev.to/samchon/i-made-openapi-and-llm-schema-definitions-1mn0
- https://www.binwang.me/2025-04-27-Use-OpenAPI-Instead-of-MCP-for-LLM-Tools.html
- https://martinfowler.com/articles/function-call-LLM.html
- https://modelcontextprotocol.io/specification/2025-06-18/server/tools
- https://www.snaplogic.com/blog/unlocking-llms-with-openapi-tool-integration
- https://arxiv.org/pdf/2503.13657
- https://dev.to/docat0209/3-patterns-that-fix-llm-api-calling-stop-getting-hallucinated-parameters-4n3b
