Skip to main content

Tool Output Schema Design: How Your Tool Responses Shape Agent Reasoning

· 9 min read
Tian Pan
Software Engineer

Most teams designing LLM agents spend considerable effort on tool selection and system prompt wording. Almost none of them think carefully about what their tools return. That's a mistake with compounding consequences — because the shape of a tool response determines how well the agent can reason about it, how much context window it consumes, and how often it hallucinates an interpretation the tool never intended.

Tool output schema design is infrastructure, not plumbing. Get it wrong and your agent will fail in ways that look like reasoning problems when they're actually schema problems.

The Reasoning Gap Between Good and Bad Schemas

A controlled study published in early 2026 found that schema-based tool interfaces reduce interface format errors relative to prose documentation. But it also found that semantic action quality and timeout-sensitive tasks remain the dominant failure modes — meaning good schema design doesn't guarantee reasoning success, but poor schema design makes reasoning success impossible.

The mechanism is straightforward: LLMs are next-token predictors trained on human-readable text. When a tool returns {"uuid": "d3fa...", "mime_type": "image/png"}, the model must infer what those values mean in context. When it returns {"name": "profile-photo.png", "image_url": "https://..."}, the model can reason directly. The cognitive overhead difference compounds across a multi-step agent loop.

Field naming alone matters more than most engineers expect. A field called ts requires inference. A field called created_at is self-evident. A field called urgency in one response and priority in another from the same tool breaks whatever reasoning pattern the agent built for the first response.

Six Anti-Patterns That Break Agent Reasoning

Bare JSON blobs with no semantic labels. An API that returns raw internal IDs, opaque type strings, and database UUIDs is designed for machine consumption by code that knows the schema in advance. Agents don't have that knowledge. They infer it from field names, and when field names are opaque, they guess — and their guesses show up as hallucinated interpretations downstream.

Error-in-success responses. A response that comes back with HTTP 200 but contains {"status": "error", "message": "Invalid parameter"} in the body violates HTTP semantics and confuses both the agent and any wrapper code handling failures. The agent sees a successful tool call, reads the body, and gets stuck trying to reconcile "the call succeeded" with "something went wrong." Well-designed tool responses make it structurally impossible to confuse success and failure: use appropriate HTTP status codes, and never bury errors inside success-shaped responses.

Ambiguous nullability. There's a meaningful difference between a field that's absent (not applicable), a field that's present and null (applicable but unknown), and a field that's present with a value. When your schema conflates these — sometimes omitting a field, sometimes setting it to null, with no documented convention — the agent is left to guess the semantics. It will be wrong often enough to matter in production.

Over-verbose responses. Output tokens cost 4–6× more than input tokens on most hosted models, and in a multi-turn agent loop, context costs grow roughly as n(n+1)/2 across turns — not linearly. Every unnecessary field in a tool response is a token that can't be used for reasoning. A 20-tool server that injects 2,000–4,000 tokens of schema overhead on every request, regardless of which tools are actually invoked, is silently degrading every downstream reasoning step. Pagination defaults matter here: a tool that returns 500 results when the agent needed 5 is burning context budget and diluting signal.

Under-informative responses. The opposite failure is equally damaging. A tool that returns only a primary key when the agent needs to reason about the resource's state forces either an additional tool call (more latency, more tokens) or a hallucinated assumption (more errors). The target is the minimum information necessary for the agent to complete the next reasoning step — not less, not more.

Inconsistent field types across responses. When a field is sometimes a string and sometimes an integer, or sometimes a list and sometimes a single value, the agent's learned reasoning pattern for that tool breaks unpredictably. Production CRM integrations have failed in exactly this way: schema drift between response shapes that looked consistent in testing but diverged under real traffic patterns.

What Good Tool Output Design Looks Like

Good tool output design follows a few durable principles.

Return only what the agent needs to reason about next. This isn't about minimalism for its own sake — it's about signal-to-noise ratio. Every field in a response is a claim on the agent's context budget and an opportunity for misinterpretation. If a field doesn't affect downstream reasoning, it shouldn't be there.

Use stable, semantic identifiers. Prefer human-readable slugs over opaque UUIDs where practical. When you must return internal IDs, include the entity name alongside them. {"ticket_id": "T-4821", "title": "Payment timeout in checkout"} gives the agent enough to reason about without looking up the ticket separately.

Make error states unambiguous. Every tool invocation should be unambiguously either a success or a failure. Use HTTP status codes correctly, return a consistent error schema (with an actionable message, not a code requiring lookup), and never mix success data and error signals in the same response shape.

Document nullability explicitly. Mark every field as required, optional, or nullable in your schema, and be consistent. An agent reading your tool's JSON Schema is making inferences about your API contract from that schema — ambiguity in the schema becomes ambiguity in the agent's beliefs.

Write field descriptions aimed at the LLM. In OpenAI function calling and Anthropic tool use, field descriptions appear in the model's context. These descriptions should explain the field's semantics, not restate its type. "The ISO 8601 timestamp when the order was placed, in UTC" is useful; "string" is not.

The Compounding Cost of Verbose Tool Outputs

In a stateless single-turn call, verbose tool outputs are merely wasteful. In a multi-turn agent loop, they're compounding. The total token cost of an n-turn loop scales roughly as n(n+1)/2 because the LLM API re-bills the full accumulated context on every call. A tool response that adds 1,000 unnecessary tokens on turn 1 adds those 1,000 tokens to every subsequent turn's input cost.

This math has an important design implication: tools that appear cheap in isolation become expensive in production loops. A team that tests a single-turn interaction at 0.02andseesa10turnworkflowcost0.02 and sees a 10-turn workflow cost 2.00 is right in the linear case. If the tool output is verbose and accumulates, that same workflow might actually cost 66–10.

The mitigation is a combination of tool output trimming (returning less by default, offering richer output on demand) and context management patterns that mask or summarize stale tool results rather than carrying them through the full loop. Google's production multi-agent framework separates context into a stable prefix (system instructions, durable summaries) and a variable suffix (recent tool outputs, current task state) — verbose tool outputs poison this pattern by making the variable suffix grow unboundedly.

Output Contract Testing

The final piece that most teams skip: testing tool response schemas before they reach agents in production.

Pydantic models as output schemas are the baseline — define the expected response shape as a typed model and validate every tool response against it before it enters the agent loop. This catches type mismatches, missing required fields, and shape drift between what the schema documents and what the API actually returns.

AWS's ToolSimulator framework takes this further, providing a scaffolding layer that validates tool responses against declared schemas before they're passed to the agent, catching malformed responses at the boundary rather than letting them propagate into reasoning failures. The same principle applies regardless of your stack: validate at the edge, before the agent.

For end-to-end testing, frameworks like DeepEval provide structured evaluation of agentic workflows — testing both deterministic properties (schema validity, format correctness) and probabilistic ones (whether the agent's interpretation of a tool response matches the intended semantics). The three-tier progression — natural language documentation, JSON Schema, JSON Schema with field-level diagnostic annotations — maps to progressively stronger guarantees. Most teams stop at tier one and wonder why their agents behave inconsistently.

The failure taxonomy from UC Berkeley's analysis of 1,600+ multi-agent execution traces across popular frameworks found that schema issues are among the most common root causes: parameters malformed, types mismatched, required fields missing. These failures manifest as agent reasoning failures, but they originate in schema design. Testing the schema independently from the agent catches them before the agent does.

Treat Tool Outputs as Part of the Agent's Reasoning Substrate

The most useful reframe is this: tool output schemas are not API concerns. They're reasoning substrate. Every field name, every nullable edge case, every implicit assumption in your response shape becomes material the agent reasons with. A poorly designed tool response doesn't just waste tokens — it trains the agent to make inferences that work in the happy path and fail on edge cases.

The teams building reliable agents in production treat tool design — input schemas, output schemas, error handling, and verbosity budgets — with the same care they give to system prompts and model selection. The ones who don't are debugging reasoning failures that are actually schema failures, and tracing back to the source takes longer than getting the schema right in the first place.

Start with the smallest response that lets the agent complete its next reasoning step. Name fields for the LLM, not for code. Make success and failure unambiguous. Validate outputs before they reach the context window. Then measure how much of your agent's apparent reasoning variance disappears.

References:Let's stay in touch and Follow me for more thoughts and updates