Skip to main content

API Documentation Is Reliability Infrastructure: How Your Docs Determine Agent Success Rates

· 10 min read
Tian Pan
Software Engineer

Most engineering teams think of API documentation as a developer experience concern — something you improve to reduce support tickets and onboarding time. That framing made sense when your primary consumer was a human reading docs in a browser. It is no longer adequate.

When an AI agent calls your API via tool use, your documentation stops being a guide and becomes runtime behavior. A vague parameter description isn't a UX inconvenience — it is a direct instruction to the model that produces hallucinated values. A missing error code isn't a gap in your reference docs — it is an ambiguous signal that can send an agent into a retry loop with no exit condition. The documentation you wrote three years ago for a human audience is now being parsed by a stateless language model that will execute confidently regardless of whether it understood correctly.

This is a reliability problem, not a documentation problem. And unlike most reliability concerns, it sits entirely in your control.

Why Tool Calling Makes Documentation Load-Bearing

When an LLM agent calls a tool, the model doesn't make an API request the way a human engineer would — by reading the docs, forming a mental model, then writing code. It receives a structured schema (typically an OpenAPI or JSON function definition) injected directly into its context window, interprets the description fields as instructions, and generates parameter values in a single pass. There is no "I'll try this and see what happens" — the first attempt is live.

This changes what documentation quality means at a structural level. Every description field in your function schema is, functionally, a prompt instruction. If you write "id: string — the user ID", the model has almost no information about which user, in which format, from where. If instead you write "id: string — the UUID of the authenticated user making the request, found in the session token returned by /auth/login", the model has a navigable instruction it can execute correctly on the first attempt.

The performance gap between these two descriptions is not a matter of the model being smart or dumb. It is a documentation quality gap measured in success rate.

Research on tool-augmented LLMs has found that "tool misuse" — calling the right tool with the wrong parameters, sequence, or timing — is the most common failure mode in agentic systems. In production deployments, tool call failures occur in 3–15% of invocations, often silently: the agent receives a malformed response, misinterprets it, and continues rather than surfacing the error. Unlike a human developer who would notice the API response was wrong, the model treats its interpretation of that response as ground truth.

The Failure Modes That Bad Docs Produce

Documentation gaps manifest as specific, observable failure patterns in agentic systems. Understanding them helps you audit your own APIs.

Parameter hallucination is the most common. When a parameter is described too vaguely — or not described at all — the model infers its value from context. Sometimes that inference is correct. Often it isn't, and the result is an API call with a plausible-looking but semantically wrong value. Because the call doesn't error out (the value is the right type), the failure is silent: the wrong resource gets updated, the wrong filter gets applied, the wrong user gets charged.

Ambiguous error semantics produce agent loops. When an API returns an error that doesn't clearly indicate whether the condition is transient, permanent, or rate-limited, the agent must guess whether to retry. A 429 response without a Retry-After header, a 500 that's actually a validation error, or a 200 with an "error": true body are all documentation failures that translate into agent behavior failures. Ambiguous feedback is a loop trigger: agents have been observed retrying the same failing call hundreds of times because the error response gave them no reason to stop.

Implicit state requirements cause sequencing failures. Many APIs have operations that are only valid when some prior state exists — you can only call /order/ship after /order/confirm, you can only upload a file after creating the upload session. For a human engineer, this is typically covered in a "Getting Started" guide or a conceptual overview they've already read. An agent operating from a tool schema has no access to that prior context. If the schema for /order/ship doesn't document the state precondition, the agent will call it out of sequence and receive an error it can't interpret.

Token cost compounding is a less obvious but financially significant failure mode. A 93-tool MCP server with verbose schema definitions costs approximately 55,000 tokens per request just to inject the schema. At typical API pricing, that translates to hundreds of dollars per day at scale — not because the agent is doing more work, but because the documentation is structured inefficiently. Redundant prose, example-heavy descriptions, and missing structured type definitions are documentation choices with direct cost implications in agentic contexts.

Why You Cannot Prompt Your Way Out of This

The intuitive response to documentation gaps is to add compensating instructions to the system prompt: "When calling the payment API, make sure to include the customer ID in UUID format." This works — once, for a known gap you've already diagnosed.

It fails as a general strategy for three reasons.

First, you can't know in advance which gaps will cause problems. The failure space is the product of your documentation quality, the range of tasks agents will attempt, and the model's prior knowledge of your domain. Most documentation gaps don't surface until a production agent hits them in an unexpected task context.

Second, system prompt instructions compete with everything else in the context window. As agent sessions grow in length, as more tool results accumulate, the carefully placed instruction from line 47 of your system prompt gets pushed further from the model's attention. The fix is fragile; the documentation gap is persistent.

Third, this is a scaling problem. A team maintaining one agent for one internal API can afford to manually audit gaps and patch them in prompts. A platform that exposes an API to third-party agent builders cannot. Your documentation quality is directly inherited by every agent that integrates with you. Every ambiguity in your schema is a failure rate multiplied across your entire agent-consuming user base.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates