Skip to main content

6 posts tagged with "tool-design"

View all tags

The Production Logs Your Agent Cannot Read

· 9 min read
Tian Pan
Software Engineer

You wired your incident-response agent into Splunk. You gave it the query syntax in the system prompt, a tool to execute SPL, and a fresh API token. The first time it triaged a real page, it pulled the wrong logs, summarized the wrong service, and confidently named the wrong customer. The integration was perfect. The agent was useless.

Here is what you forgot. Fifteen years of log conventions, undocumented field names, severity strings that drifted from ERR to error to ERROR across three reorgs, and team-specific suffixes that turn customer_id into cust_id_v2_actual on the auth service and tenant.user.id on billing — none of that is in the prompt. You gave the agent access to the API. You did not give it access to the institutional knowledge that makes the API useful.

The shape of this failure is bigger than Splunk. It applies to any agent integration where the tool exposes a query language over a corpus the team has been shaping by hand for a decade. The agent has the verbs. It does not have the nouns.

The Tool Default Argument Is a Policy Decision in Disguise

· 10 min read
Tian Pan
Software Engineer

Open the trace of any agent run and look at a tool call. You see the tool name and the arguments the model chose to pass. What you do not see is everything it did not pass. A search call with query set and nothing else still ran with a page size, a timeout, a result ranking, and a visibility scope. The agent decided none of those. You did, months ago, when you wrote the tool's schema and left those parameters optional with a default.

That default is not a convenience. It is a policy decision wearing the costume of a sensible blank. The default page size caps how much of the world the agent can see in one call. The default timeout decides when the agent gives up and improvises. The default visibility scope decides whether "search the docs" means the public handbook or the entire internal wiki including the unreleased roadmap. The default dry_run flag decides whether the agent's action is a rehearsal or a real, irreversible event in production.

The Tool That Worked Until Two Agents Called It At Once

· 9 min read
Tian Pan
Software Engineer

A tool passes its tests. You called it from one agent, watched it read a record, transform it, write it back, and return a clean result. It did exactly that, every time, for weeks. Then you scaled the agent fleet from one worker to twelve, and a customer reported that their subscription got upgraded twice in the same minute. The tool did not change. The number of things calling it did.

This is the failure mode that single-agent testing cannot catch, because single-agent testing never produces the condition that triggers it. One caller is, by construction, a serial workload. Every concurrency assumption your tool quietly relies on — that nobody else is mid-write when it reads, that a counter it increments is its own, that the draft it is editing will still be there when it saves — holds trivially when there is exactly one caller. The tool is not correct. It is untested. Those are different things, and the difference stays invisible until a second agent shows up.

Pagination Is a Tool-Catalog Discipline: Why Agents Burn Context on List Returns

· 11 min read
Tian Pan
Software Engineer

Every well-designed HTTP API in your stack returns paginated results. Nobody loads a million rows into memory and hopes for the best. Yet the tools your agent calls return the entire list, and the agent dutifully reads it, because the function signature says list_orders() -> Order[] and the agent has no protocol for "give me the next page" the way a human user has scroll-and-load-more.

The agent burns tokens on rows it could have skipped. The long-tail customer with 50K records hits context-window failures the median customer never sees. The tool author cannot tell from the trace whether the agent needed all those rows or simply could not ask for fewer. And somewhere in your eval suite, the regression that would have flagged this never runs because every test fixture has fewer than 100 records.

Pagination is not a UI affordance. It is a load-shedding primitive — and the agent that consumes a tool without it is reimplementing every SELECT * FROM orders mistake the API designers in your company spent a decade learning to avoid.

No Results Is Not Absence: Why Agents Treat Retrieval Failure as Proof

· 10 min read
Tian Pan
Software Engineer

The most dangerous sentence in an agent transcript is not a hallucination. It is four calm words: "I could not find it." The agent sounds epistemically humble. It sounds like due diligence. It sounds, to any downstream reader or caller, exactly like a fact. And yet the statement carries no information about whether the thing exists. It only carries information about what happened when a specific tool, invoked with a specific query, consulted a specific index that the agent happened to have access to at that moment.

Between those two readings lies a production incident waiting to happen. A support agent tells a customer "we have no record of your order" because a replication lag delayed the write to the read replica by ninety seconds. A coding agent declares "there are no tests for this module" because it searched a directory that did not contain the test folder. A compliance agent replies "no prior violations on file" because the audit index had not ingested last week's report. In each case the agent's output is grammatically a negation, but epistemically it is a shrug that has been re-typed as a claim.

Writing Tools for Agents: The ACI Is as Important as the API

· 9 min read
Tian Pan
Software Engineer

Most engineers approach agent tools the same way they approach writing a REST endpoint or a library function: expose the capability cleanly, document the parameters, handle errors. That's the right instinct for humans. For AI agents, it's exactly wrong.

A tool used by an agent is consumed non-deterministically, parsed token by token, and selected by a model that has no persistent memory of which tool it used last Tuesday. The tool schema you write is not documentation — it is a runtime prompt, injected into the model's context at inference time, shaping every decision the agent makes. Every field name, every description, every return value shape is a design decision with measurable performance consequences. This is the agent-computer interface (ACI), and it deserves the same engineering investment you'd put into any critical user-facing interface.