Skip to main content

18 posts tagged with "tool-use"

View all tags

The Reasoning-Model Tax at Tool Boundaries

· 10 min read
Tian Pan
Software Engineer

Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.

This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.

Your Tool Descriptions Are Prompts, Not API Docs

· 10 min read
Tian Pan
Software Engineer

The tool description is not documentation. It is the prompt the model reads, every single turn, to decide whether this tool fires and how. You are not writing for the developer integrating against the tool — the developer already has the schema, the types, the examples in the PR. You are writing for a stochastic reader that has never seen this codebase, is holding twenty other tool descriptions in the same context window, and has to pick one in the next forward pass.

Most teams don't. They paste the OpenAPI summary into the description field, stick the JSON Schema under it, and ship. Then the agent undercalls the tool, confidently calls the wrong adjacent tool, or fires the right tool with parameters that were "obviously" wrong to any human reading the schema. The team blames the model. The model was reading exactly what you wrote.

Tool Manifest Lies: When Your Agent Trusts a Schema Your Backend No Longer Honors

· 10 min read
Tian Pan
Software Engineer

The most dangerous bug in a production agent isn't the one that throws. It's the one where a tool description says returns user_id and the backend quietly started returning account_id two sprints ago, and the model is still happily inventing user_id in downstream reasoning — because the manifest said so, and the few-shot history reinforced it, and nothing in the loop ever fetched ground truth.

This is manifest drift: the slow, silent divergence between what your tool descriptions claim and what your endpoints actually do. It rarely produces stack traces. It produces bad decisions with clean audit trails — the worst class of bug in agent systems.

Tool Schema Deprecation: Why You Can't Just Rename a Parameter

· 11 min read
Tian Pan
Software Engineer

You renamed query to search_query on a tool schema. The changelog says "non-breaking: clearer naming." The PR passed review. Three days later, your support queue fills up with reports that the assistant is "searching for blank results." What happened is not what anyone on the thread would tell you. The agents did not fail. They submitted the old field name, your tool server ignored the unknown key, defaulted search_query to the empty string, and returned zero hits. The model, seeing a legitimate-looking empty response, confidently explained to the user why their query returned nothing relevant.

This is the part of agent engineering that does not fit the mental model borrowed from REST API versioning. A REST client that sends a renamed field gets a 400 and a clear error — the field either exists in the validator or it doesn't. An agent that sends a renamed field gets a silent acceptance, a nonsense result, and a hallucinated rationalization. The failure is not at the wire; it is in the joint between the runtime schema and the model's in-context mental model of what the tool looks like.

Tool schemas live in two places. The first is the runtime spec — the JSON schema you publish to the MCP server or the function-calling registry. The second is the model's in-context representation of that spec, reinforced every turn by few-shot examples in your system prompt, by the serialized tool history the agent sees on multi-turn tasks, and by whatever the model already absorbed about your API during pretraining. You can atomically update the first. You cannot atomically update the second. That asymmetry is the whole problem, and it is why "additive only, reserve forever" — the discipline that protobuf and GraphQL operators internalized a decade ago — needs to migrate to the tool-schema layer now.

Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops

· 9 min read
Tian Pan
Software Engineer

If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.

This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.

The Over-Tooled Agent Problem: Why More Tools Make Your LLM Dumber

· 9 min read
Tian Pan
Software Engineer

When a team at Writer instrumented their RAG-MCP benchmark, they found that baseline tool selection accuracy — with no special handling — was 13.62% when the agent had access to a large set of tools. Not 80%. Not 60%. Thirteen percent. The same agent, with retrieval-augmented tool selection exposing only the most relevant subset, reached 43%. The tools didn't change. The model didn't change. Only the number of tool definitions visible at reasoning time changed.

This is the over-tooled agent problem, and it's quietly wrecking production AI systems at scale.

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

Schema Entropy: Why Your Tool Definitions Are Rotting in Production

· 10 min read
Tian Pan
Software Engineer

Your agent was working fine in January. By March, it started failing on 15% of tool calls. By May, it was silently producing wrong outputs on another 20%. Nothing in your deployment logs changed. No one touched the agent code. The tool definitions look exactly like they did six months ago — and that's the problem.

Tool schemas don't have to be edited to become wrong. The services they describe change underneath them. Enum values get added. Required fields become optional in a backend refactor. A parameter that used to accept strings now expects an ISO 8601 timestamp. The schema document stays frozen while the underlying API keeps moving, and your agent keeps calling it confidently, with no idea the contract has shifted.

This is schema entropy: the gradual divergence between the tool definitions your agent was trained to use and the tool behavior your production services actually exhibit. It is one of the most underappreciated reliability problems in production AI systems, and research suggests tool versioning issues account for roughly 60% of production agent failures.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.

Parallel Tool Calls in LLM Agents: The Coupling Test You Didn't Know You Were Running

· 10 min read
Tian Pan
Software Engineer

Most engineers reach for parallel tool calling because they want their agents to run faster. Tool execution accounts for 35–60% of total agent latency depending on the workload — coding tasks sit at the high end, deep research tasks in the middle. Running independent calls simultaneously is the obvious optimization. What surprises most teams is what happens next.

The moment you enable parallel execution, every hidden assumption baked into your tool design becomes visible. Tools that work reliably in sequential order silently break when they run concurrently. The behavior that was stable turns unpredictable, and often the failure produces no error — just a wrong answer returned with full confidence.

Parallel tool calling is not primarily a performance feature. It is an involuntary architectural audit.

The Tool Result Validation Gap: Why AI Agents Blindly Trust Every API Response

· 10 min read
Tian Pan
Software Engineer

Your agent calls a tool, gets a response, and immediately reasons over it as if it were gospel. No schema check. No freshness validation. No sanity test against what the response should look like. This is the default behavior in every major agent framework, and it is silently responsible for an entire class of production failures that traditional monitoring never catches.

The tool result validation gap is the space between "the tool returned something" and "the tool returned something correct." Most teams obsess over getting tool calls right — selecting the right tool, generating valid arguments, handling timeouts. Almost nobody validates what comes back.

The Tool Selection Problem: How Agents Choose What to Call When They Have Dozens of Tools

· 10 min read
Tian Pan
Software Engineer

Most agent demos work with five tools. Production systems have fifty. The gap between those two numbers is where most agent architectures fall apart.

When you give an LLM four tools and a clear task, it usually picks the right one. When you give it fifty tools, something more interesting happens: accuracy collapses, token costs balloon, and the failure mode often looks like the model hallucinating a tool call rather than admitting it doesn't know which tool to use. Research from the Berkeley Function Calling Leaderboard found accuracy dropping from 43% to just 2% on calendar scheduling tasks when the number of tools expanded from 4 to 51 across multiple domains. That is not a graceful degradation curve.