Skip to main content

32 posts tagged with "tool-use"

View all tags

Your Tool Catalog Is a Power Law and You're Optimizing the Long Tail

· 11 min read
Tian Pan
Software Engineer

Pull a week of tool-call traces from any production agent and the shape is the same: three or four tools handle 90% of the calls, and a couple of dozen others split the remaining 10%. The catalog is a power law, but the framework treats it like a uniform list. Every tool description ships in every system prompt, every selection rubric weights tools equally, every eval samples the catalog as if a search-files call and a refund-issue call were drawn from the same distribution. They are not.

The cost of that flatness is invisible until it isn't. A team adds the eighteenth tool, the planner's accuracy on the original three drops two points, nobody can localize the regression to a specific change because everything moved at once, and the eval suite — itself uniform across the catalog — averages the slip into a number that still looks fine. Meanwhile the tokens spent describing tools the model will not call this turn now exceed the tokens spent on the user's actual prompt.

Tool-Composition Privilege Escalation: Your Security Review Cleared the Nodes, Not the Edges

· 10 min read
Tian Pan
Software Engineer

read_file is safe. send_email is safe. Your security review cleared each one against its own threat model: read-only access to a known directory, outbound mail through an authenticated relay with rate limits and recipient logging. Each passed. Both got registered. Then the agent composed them, and a single line of injected text in a customer support ticket turned the pair into an exfiltration tool that the original review had no language to describe.

The danger does not live in any node of the tool graph. It lives in the edges. Every per-tool security review you ran produced a verdict on a vertex; the actual permission surface of your agent is the set of paths through the catalog, and that set grows quadratically while your review process scales linearly. By the time your agent has fifteen registered tools, you have reviewed fifteen things and shipped roughly two hundred reachable two-step compositions, none of which any human auditioned.

The Reasoning-Model Tax at Tool Boundaries

· 10 min read
Tian Pan
Software Engineer

Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.

This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.

Your Tool Descriptions Are Prompts, Not API Docs

· 10 min read
Tian Pan
Software Engineer

The tool description is not documentation. It is the prompt the model reads, every single turn, to decide whether this tool fires and how. You are not writing for the developer integrating against the tool — the developer already has the schema, the types, the examples in the PR. You are writing for a stochastic reader that has never seen this codebase, is holding twenty other tool descriptions in the same context window, and has to pick one in the next forward pass.

Most teams don't. They paste the OpenAPI summary into the description field, stick the JSON Schema under it, and ship. Then the agent undercalls the tool, confidently calls the wrong adjacent tool, or fires the right tool with parameters that were "obviously" wrong to any human reading the schema. The team blames the model. The model was reading exactly what you wrote.

Tool Manifest Lies: When Your Agent Trusts a Schema Your Backend No Longer Honors

· 10 min read
Tian Pan
Software Engineer

The most dangerous bug in a production agent isn't the one that throws. It's the one where a tool description says returns user_id and the backend quietly started returning account_id two sprints ago, and the model is still happily inventing user_id in downstream reasoning — because the manifest said so, and the few-shot history reinforced it, and nothing in the loop ever fetched ground truth.

This is manifest drift: the slow, silent divergence between what your tool descriptions claim and what your endpoints actually do. It rarely produces stack traces. It produces bad decisions with clean audit trails — the worst class of bug in agent systems.

Tool Schema Deprecation: Why You Can't Just Rename a Parameter

· 11 min read
Tian Pan
Software Engineer

You renamed query to search_query on a tool schema. The changelog says "non-breaking: clearer naming." The PR passed review. Three days later, your support queue fills up with reports that the assistant is "searching for blank results." What happened is not what anyone on the thread would tell you. The agents did not fail. They submitted the old field name, your tool server ignored the unknown key, defaulted search_query to the empty string, and returned zero hits. The model, seeing a legitimate-looking empty response, confidently explained to the user why their query returned nothing relevant.

This is the part of agent engineering that does not fit the mental model borrowed from REST API versioning. A REST client that sends a renamed field gets a 400 and a clear error — the field either exists in the validator or it doesn't. An agent that sends a renamed field gets a silent acceptance, a nonsense result, and a hallucinated rationalization. The failure is not at the wire; it is in the joint between the runtime schema and the model's in-context mental model of what the tool looks like.

Tool schemas live in two places. The first is the runtime spec — the JSON schema you publish to the MCP server or the function-calling registry. The second is the model's in-context representation of that spec, reinforced every turn by few-shot examples in your system prompt, by the serialized tool history the agent sees on multi-turn tasks, and by whatever the model already absorbed about your API during pretraining. You can atomically update the first. You cannot atomically update the second. That asymmetry is the whole problem, and it is why "additive only, reserve forever" — the discipline that protobuf and GraphQL operators internalized a decade ago — needs to migrate to the tool-schema layer now.

Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops

· 10 min read
Tian Pan
Software Engineer

If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.

This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.

The Over-Tooled Agent Problem: Why More Tools Make Your LLM Dumber

· 9 min read
Tian Pan
Software Engineer

When a team at Writer instrumented their RAG-MCP benchmark, they found that baseline tool selection accuracy — with no special handling — was 13.62% when the agent had access to a large set of tools. Not 80%. Not 60%. Thirteen percent. The same agent, with retrieval-augmented tool selection exposing only the most relevant subset, reached 43%. The tools didn't change. The model didn't change. Only the number of tool definitions visible at reasoning time changed.

This is the over-tooled agent problem, and it's quietly wrecking production AI systems at scale.

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

Schema Entropy: Why Your Tool Definitions Are Rotting in Production

· 10 min read
Tian Pan
Software Engineer

Your agent was working fine in January. By March, it started failing on 15% of tool calls. By May, it was silently producing wrong outputs on another 20%. Nothing in your deployment logs changed. No one touched the agent code. The tool definitions look exactly like they did six months ago — and that's the problem.

Tool schemas don't have to be edited to become wrong. The services they describe change underneath them. Enum values get added. Required fields become optional in a backend refactor. A parameter that used to accept strings now expects an ISO 8601 timestamp. The schema document stays frozen while the underlying API keeps moving, and your agent keeps calling it confidently, with no idea the contract has shifted.

This is schema entropy: the gradual divergence between the tool definitions your agent was trained to use and the tool behavior your production services actually exhibit. It is one of the most underappreciated reliability problems in production AI systems, and research suggests tool versioning issues account for roughly 60% of production agent failures.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.

Parallel Tool Calls in LLM Agents: The Coupling Test You Didn't Know You Were Running

· 10 min read
Tian Pan
Software Engineer

Most engineers reach for parallel tool calling because they want their agents to run faster. Tool execution accounts for 35–60% of total agent latency depending on the workload — coding tasks sit at the high end, deep research tasks in the middle. Running independent calls simultaneously is the obvious optimization. What surprises most teams is what happens next.

The moment you enable parallel execution, every hidden assumption baked into your tool design becomes visible. Tools that work reliably in sequential order silently break when they run concurrently. The behavior that was stable turns unpredictable, and often the failure produces no error — just a wrong answer returned with full confidence.

Parallel tool calling is not primarily a performance feature. It is an involuntary architectural audit.