Skip to main content

32 posts tagged with "tool-use"

View all tags

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

Latency-Aware Tool Selection: When 'Good Enough Now' Beats 'Best Available Later'

· 10 min read
Tian Pan
Software Engineer

The tool description in your agent's system prompt is a six-month-old eval artifact. It says search_pricing returns "fresh inventory data with structured pricing" and the planner believes it, because nothing in the prompt has updated since the day the description was tuned. The actual search_pricing endpoint has been sitting at p95 of 11 seconds for the last forty minutes because the upstream vendor is rate-limiting your account, and the cheaper search_cache tool — which the prompt describes as "may be slightly stale" — would return the same answer in 200ms. The planner picks search_pricing anyway, because the description still reads like it did during eval, and the planner has no signal about what either tool costs to call right now.

This is the structural failure of static tool descriptions. The planner is making routing decisions on a snapshot of a world that has moved on. Tool selection isn't really a capability question — most production agents have two or three tools that overlap heavily in what they can answer — it's a cost-of-waiting question, and the cost of waiting is the thing your prompt template doesn't see.

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.

Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines

· 10 min read
Tian Pan
Software Engineer

The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.

Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.

Streaming Tool Results Break Request-Response Agent Planners

· 10 min read
Tian Pan
Software Engineer

A SQL tool ships rows as they come off the wire. The agent calls it expecting a result. The harness, written a year earlier when every tool was request-response, dutifully buffers the whole stream into a single string before invoking the model. Forty seconds later, the buffer is 200 KB, the context window is half-eaten, and the agent is reasoning about row 47,000 of a query it could have stopped at row 30. Nobody designed this failure — it falls out of treating "the tool returned" as the only event the planner reacts to.

The shift to streaming tools is happening below the planner's awareness. SQL engines emit progressive result sets. Document fetchers yield pages. Search APIs return hits in batches as relevance scores stabilize. MCP's Streamable HTTP transport, the 2025-03-26 spec replacement for HTTP+SSE, makes incremental responses a first-class transport mode rather than an exotic capability. The wire is ready. The planners on top of it are not.

Tool Composition Sandbox Escape: When Three Safe Tools Compose Into Data Exfiltration

· 10 min read
Tian Pan
Software Engineer

The security review approved each of the three tools individually. Read-only access to the customer database was rated low risk because the agent could see records but not modify them. Send-email-to-self was rated low risk because the recipient was hardcoded to a service-account inbox the agent was already authorized to write to. Template-render was rated low risk because it was a deterministic Jinja-style transform with no I/O. Three weeks after launch, the data-loss-prevention dashboard flagged customer PII showing up in a Slack channel that two hundred employees could read, and the post-mortem traced the leak to the agent composing the three tools into a chain that no single ACL had granted: read a customer record, render it through a template, email the result to its own service account whose inbox auto-forwarded into the channel.

No single tool was misused. No prompt injection bypassed any check. The agent did exactly what its tool catalog said it could do, and the composition produced a capability the security review had never been asked to evaluate.

Tool Discovery at Scale: Why Embedding-Only Retrieval Fails Past 20 Tools

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same problem on their fifth sprint: the agent can't reliably pick the right tool anymore. At ten tools, it mostly works. At twenty, accuracy starts to slip. At fifty, you're watching the agent call search_documents when it should call update_record, and the logs offer no explanation. The usual reaction is to tweak the tool descriptions — add more context, be more explicit, rewrite the examples. This occasionally helps. But it misses the root cause: flat embedding retrieval is architecturally wrong for large tool inventories, and better descriptions cannot fix an architectural problem.

Tool selection is retrieval, and retrieval has known scaling limits. Understanding those limits — and the structured metadata patterns that work around them — is what separates agent systems that hold up in production from ones that require constant babysitting.

Tool Output Schema Design: How Your Tool Responses Shape Agent Reasoning

· 9 min read
Tian Pan
Software Engineer

Most teams designing LLM agents spend considerable effort on tool selection and system prompt wording. Almost none of them think carefully about what their tools return. That's a mistake with compounding consequences — because the shape of a tool response determines how well the agent can reason about it, how much context window it consumes, and how often it hallucinates an interpretation the tool never intended.

Tool output schema design is infrastructure, not plumbing. Get it wrong and your agent will fail in ways that look like reasoning problems when they're actually schema problems.

Stale Tool Descriptions Are Your Agent's Biggest Silent Failure

· 9 min read
Tian Pan
Software Engineer

You ship a tool that lets your agent fetch user profiles. The description reads: "Retrieves user information by user ID." Six weeks later, the backend team renames user_id to customer_uuid and adds a required tenant_id field. Nobody updates the tool schema. Your agent keeps calling the old signature, gets back a 400, interprets the empty result as "no user found," and helpfully creates a duplicate record.

No error in the logs. No alert fired. The agent was confident the whole time.

This is the tool documentation problem: schema drift that turns stale descriptions into silent failure vectors. It is probably the most underappreciated reliability hazard in production AI systems today, and it gets worse the longer your agent lives.

Retiring an Agent Tool the Planner Learned to Depend On

· 10 min read
Tian Pan
Software Engineer

You unregister lookup_account_v1 from the tool catalog, swap in lookup_account_v2, and edit one paragraph of the system prompt to point at the new name. Tests pass. Three days later, support tickets start mentioning that the assistant "keeps trying to call something that doesn't exist," or — more disturbingly — that it answers customer questions with confident, plausible numbers and never hits the database at all. The deprecation didn't fail at the wire. It failed in the planner.

This is the gap between treating a tool deprecation as a syntactic change and treating it as a behavioral migration. The agent didn't just have your function in a registry; it had months of plans, multi-step recipes, and few-shot examples that routed through that function as a checkpoint. Pulling it out is closer to retiring an internal API your downstream services have informally hardcoded — except the downstream service is a model whose habits you cannot grep, and whose fallback when its preferred tool disappears is to invent one.

The Expensive-to-Undo Tool Taxonomy: One Approval Gate Per Risk Class

· 9 min read
Tian Pan
Software Engineer

The "send email" tool and the "delete account" tool are sitting behind the same modal. Your user has clicked "Approve" forty times today, none of those clicks involved reading the diff, and the next click — the one that ships an irreversible mutation to a production database — will look identical to the forty before it. This is the failure mode of binary tool approval, and it is the default in almost every agent framework shipped today.

The framing problem is that "needs human approval" is treated as a single boolean attached to a tool, when it is actually a five-or-six-class taxonomy that depends on what kind of damage the tool can do and how recoverable the damage is. Teams that ship safe agents stop asking "does this tool need a confirm dialog" and start asking "what risk class does this tool belong to, and what gate corresponds to that class." The right number of approval gates is not one and not many. It is one per risk class, and you have to enumerate the classes before you can build the gates.