Skip to main content

Writing Tools for Agents: The ACI Is as Important as the API

· 9 min read
Tian Pan
Software Engineer

Most engineers approach agent tools the same way they approach writing a REST endpoint or a library function: expose the capability cleanly, document the parameters, handle errors. That's the right instinct for humans. For AI agents, it's exactly wrong.

A tool used by an agent is consumed non-deterministically, parsed token by token, and selected by a model that has no persistent memory of which tool it used last Tuesday. The tool schema you write is not documentation — it is a runtime prompt, injected into the model's context at inference time, shaping every decision the agent makes. Every field name, every description, every return value shape is a design decision with measurable performance consequences. This is the agent-computer interface (ACI), and it deserves the same engineering investment you'd put into any critical user-facing interface.

The Minimal Viable Tool Set Is a Feature, Not a Compromise

The most common mistake I see teams make when building agent systems is wrapping every available API endpoint as a tool. It feels complete. It feels powerful. It quietly degrades performance.

Every additional tool in an agent's context consumes attention budget and creates an additional ambiguous choice point. When an agent has twenty tools and three of them could plausibly answer the current query, the model distributes probability mass across all three. When it has five tools and only one is relevant, the correct selection is nearly certain. More tools does not mean more capability; it means more failure surface.

The design test is simple: if a senior engineer on your team can't definitively say, without hesitation, which tool should be used in a given situation, an AI agent cannot be expected to do better. Ambiguity in your tool taxonomy is ambiguity in agent behavior.

Start with three to five tools that cover your highest-impact workflows. Test thoroughly. Expand only when evaluation data shows a workflow gap. In practice, the biggest gains come not from adding tools but from consolidating them. A get_customer_context tool that returns name, status, recent transactions, and open notes in one call outperforms three separate get_customer_by_id, list_transactions, and list_notes tools — because it reduces the number of sequential decisions the agent must make, and sequential failures compound multiplicatively.

That last point matters more than most teams realize. If each tool call has 90% accuracy, a three-step chain is accurate 73% of the time. A seven-step chain falls to 48%. The architecture of your tool set — how many sequential invocations a task requires — is a first-class reliability variable.

Tool Descriptions Are High-Stakes Prompts

Treat every tool description as if you are onboarding a junior developer who has never seen your codebase. Make the implicit explicit. Specify when to use the tool, when not to use it, what each parameter means, what format is expected, and how this tool differs from the similar one next to it.

The description for search_contacts should not read: "Search for contacts." It should read something like: "Search contacts by name, email, or company. Use this when you need to find a specific person. Returns up to 10 matching records. Do NOT use this to retrieve all contacts for bulk analysis — use export_contacts_csv for that. If the user provides a company name, pass it as the company parameter rather than including it in the name query."

That level of specificity feels excessive for a human reader. For a model, it eliminates entire categories of invocation errors. Refinements to tool descriptions — even minor ones — produce measurable accuracy gains with no code changes.

Two patterns are especially useful:

Negative guidance: State explicitly when not to use a tool. Without it, models will attempt to use tools that seem plausible even when another tool is more appropriate.

Embedded examples: For tools with non-obvious invocation patterns, include a worked example in the description. A single concrete input-output pair resolves ambiguity that paragraphs of prose cannot.

Return Values Are Context Budget, Not Just Data

Tool return values are consumed token by token and permanently occupy the model's context window. Every field you return that doesn't contribute to the agent's next decision is a cost with no benefit — it crowds out reasoning capacity for the rest of the task.

The instinct to return everything and let the model decide what's relevant is wrong. The model cannot throw away tokens it has already processed. Design return values for agent comprehension, not developer completeness.

Avoid raw UUIDs and cryptic internal identifiers. A UUID returned in a tool result consumes tokens with zero semantic content — the model cannot reason about f47ac10b-58cc-4372-a567-0e02b2c3d479. A human-readable name, status string, or natural language description gives the model something to reason with.

Implement pagination with sensible defaults. A tool that returns 500 records when called without filters will exhaust context on any realistic dataset. Default to 10 or 20 results, support a limit parameter, and include a truncation message that steers the agent toward a more targeted follow-up query rather than just saying "more results exist."

One pattern worth implementing for any tool that returns potentially large responses: a response_format parameter with enum values like concise and detailed. Agents in multi-step workflows often only need high-level information from an intermediate step; the concise format gives them that without burning context on fields they won't use.

The token efficiency gains here can be dramatic. In code execution environments where data processing happens inside the execution sandbox before results are returned to the model, production systems have achieved greater than 90% reductions in token consumption per task.

Schema Design Is Performance Engineering

The JSON schema attached to each tool is injected into the model's system context on every API call. Its quality is not a documentation concern — it is a performance concern.

Enums are the most underused schema feature in practice. When a parameter has a bounded set of valid values, expressing that as an enum prevents the model from generating invalid values entirely. This is the agent equivalent of poka-yoke — mistake-proofing by making invalid states structurally unrepresentable. Use them for status fields, format selections, operation types, anything with a defined value set.

Require absolute paths rather than relative ones when tools interact with filesystems. Relative paths are a common model error; absolute path requirements eliminate the entire error class with a single schema constraint.

Automated schema generation from typed function signatures — using Pydantic's Field() annotations with model_json_schema() — is strongly preferable to manually maintained JSON. Manual schemas drift from implementation, develop inconsistencies, and don't benefit from type system validation. The schema becomes the source of truth only when it's generated from the code, not written alongside it.

On naming: use consistent namespacing across tools that serve the same domain. If you have multiple search tools, the patterns asana_search and jira_search — or the more granular asana_projects_search and asana_tasks_search — allow agents to navigate large tool sets by reasoning about taxonomy rather than memorizing individual semantics.

Error Messages Are Correction Opportunities

When an agent's tool call fails, the error response is returned directly into the model's context. This is not just a logging event — it is the model's only input for deciding how to recover. The quality of your error messages determines whether the agent self-corrects successfully or enters a retry loop that burns tokens without converging.

Generic errors like {"error": "invalid input"} are useless. The model gets no signal about what was wrong or how to fix it. A useful error message names the specific parameter that was invalid, explains what was received, and suggests the correct form. If a date parameter fails validation, the error should say what format was expected, what was received, and give an example of a valid value.

Handle the three error categories differently:

Structural errors — malformed arguments, wrong types, missing required fields — should return detailed, actionable feedback to the model for in-context correction.

Transient runtime errors — rate limits, timeouts, temporary API downtime — should trigger exponential backoff with jitter up to a sensible maximum. Always cap retry attempts; an uncapped retry loop will exhaust both time and token budget.

Permanent errors — invalid credentials, resource not found, authorization failures — should not be retried. They should escalate, surface to human oversight, or cause the agent to reformulate its approach.

The structural error feedback loop is particularly important for iterative agent tasks. An agent that receives a precise error message can reformulate its call correctly in the very next step. An agent that receives an opaque error will either retry blindly or abandon the task.

Evaluating Tool Design the Right Way

Evaluation is how you discover that your tool descriptions are ambiguous, your return values are bloated, or your tool set has redundant overlap. Single-tool, single-invocation tests will not expose these problems. The failure modes that matter in production emerge from multi-step workflows.

Use evaluation tasks that require three to five sequential tool calls against realistic data. Measure not just whether the final task completed correctly, but which tools were called, in what order, how many times, and what the total token consumption was. Patterns in transcript analysis tell you things aggregate accuracy scores do not:

  • Agents consistently misidentifying which of two similar tools to use: description boundary problem
  • Repeated calls to the same tool in a single task: pagination parameters or response scoping issue
  • Low accuracy on one tool despite high aggregate performance: schema clarity problem specific to that tool

The discipline here is iterative: write descriptions, run evaluations, analyze transcripts, refine descriptions, re-evaluate. Accurate per-tool performance matters, but end-to-end task completion on realistic multi-step benchmarks is the metric that reflects production behavior.

The ACI Deserves Its Own Engineering Discipline

The way you build tools for agents determines the reliability ceiling of your entire agent system. You cannot compensate for a poorly designed tool set with better prompts or more capable models. The context budget wasted on irrelevant return fields, the accuracy lost to ambiguous tool descriptions, the reliability degraded by long sequential chains — these are architectural properties, and they have to be addressed at the architecture level.

The parallel to human interface design is instructive. We've spent decades learning that the best APIs are not necessarily the best interfaces for humans — the interaction model, the affordances, the error recovery paths need to be designed for how the actual user operates. AI agents are the new user. The tools you write are the interface. And the quality of that interface, down to individual field names and description sentences, determines whether the system works in the real world or only in demos.


Building agent systems? The ACI is where the work is.

References:Let's stay in touch and Follow me for more thoughts and updates