Skip to main content

41 posts tagged with "tool-use"

View all tags

The Streaming Abort That Left the Side Effect Billable

· 11 min read
Tian Pan
Software Engineer

A user is watching your agent stream a response. Two hundred milliseconds in, they hit stop. The UI clears the bubble, the spinner disappears, and the product behaves as if the request never happened. It did happen. The agent already called send_invoice_email. The vendor's mail relay returned 250 OK. The customer received a draft invoice the user never approved. Your billing meter charged the user for the tokens that streamed before the abort. It cannot bill back the email.

This is the failure mode every team with streaming tool use ships at least once, and most teams never even detect. The stream layer reports cancelled. The tool layer reports succeeded. Your customer-facing log picks one of them based on whichever subsystem flushes last, and the two halves of the same request now disagree about whether it occurred.

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Agent That Burned an Hour Re-Trying the Same 400 Error

· 10 min read
Tian Pan
Software Engineer

An agent calls a tool. The tool returns 400 Bad Request with a perfectly structured error body: {"error": "missing required field", "field": "email"}. The agent's reasoning chain quotes the error back word for word, says "I need to include the email field," and then emits the next tool call with the exact same payload as before. The loop's retry policy, written years ago for a world of flaky HTTPS connections, treats the 400 as transient and fires it again. Three times. Eight. Fifteen. An hour later, the agent has burned through context, money, and a rate-limit budget without ever sending a request the tool could accept.

The failure looks like a model problem. It is not. The model read the error correctly. The harness around it never gave the correction anywhere to land.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Pointer Your Agent Mistook for a Value: Reference vs Value in Tool Outputs

· 11 min read
Tian Pan
Software Engineer

A search tool returns ten document IDs. An asset tool returns an S3 presigned URL. A database tool returns a row handle. A file tool returns a path. Each of those returns is, formally, a pointer — a small string that names a value the agent does not yet possess. The model's downstream behavior depends entirely on whether it knows that and dereferences before reasoning, or whether it treats the pointer as if it were already the thing.

The failure mode is invisible from the trace. The tool call succeeded. The return is well-formed. The model emitted plausible-looking output. Nothing in the log says "the agent reasoned about a filename and called it a document." The pointer-vs-value confusion sits underneath the visible behavior, in a layer your tool schema never named.

Streamed Tokens Are a Promise You Can't Take Back

· 9 min read
Tian Pan
Software Engineer

The model has streamed seventy percent of a confident-sounding answer to the user's screen. Then the tool call it was about to make returns an error, or no rows, or a 429. You now get to pick between two losses: let the model finish gracefully by inventing the rest, or stop mid-sentence with no clean way to walk it back. Neither is a recovery — both are damage.

This is the part of streaming UX that nobody priced when they turned the feature on. Streaming was framed as a perceived-latency win: time-to-first-token is the metric, the user starts reading sooner, the app feels alive. What the framing leaves out is that every token you stream is a commitment. You have published a draft of an answer that you do not yet know is correct, and the back half of your system has not yet finished running. When it finishes and disagrees, your UI has no native way to retract what it already showed.

The Tool You Added For One Agent Is Now In Every Agent's Hand

· 10 min read
Tian Pan
Software Engineer

Six months ago, somebody on the customer-support team wired a send_email tool for their agent. It worked. The platform team noticed it in the shared tool registry, gave a thumbs-up emoji on the PR, and moved on. This week, a security engineer ran an audit and discovered that send_email is in the action surface of the meeting-notes summarizer, the data-quality bot, an analytics assistant nobody officially owns, and a half-built prototype that hasn't been touched since January. None of these agents need to send email. None of them have ever been reviewed for whether they should be allowed to. The PRD for the meeting-notes summarizer is two sentences long and the words "outbound communication" do not appear in it.

This is the default state of every shared tool registry I have ever audited. The act of registering a tool — pushing a JSON schema and a handler into a central catalog — is treated as a developer convenience, like adding a utility function to a shared library. But once the registry is sourced into every agent's prompt, registering a tool is not a library change. It is a deployment to every agent in the company simultaneously, with no review of whether each of them should have received it.

The Agent That Narrated a Number It Should Have Computed

· 10 min read
Tian Pan
Software Engineer

Ask your agent for last quarter's churn rate and it answers 4.2% in one clean sentence. The number is plausible. The prose around it is confident. The dashboard, when someone finally checks, says 6.8%. The agent never queried anything — it produced a churn-shaped token sequence because, to a language model, narrating a number and computing one look identical on the way out.

This is the quiet failure mode that survives every demo. A hallucinated tool name throws an error you can catch. A malformed argument fails a schema check. But a fabricated figure, delivered in fluent English, passes through your entire pipeline looking exactly like a real one. There is no exception, no log line, no red text. The only signal that something went wrong is a human who happens to know the right answer — and the whole point of the agent was that no human had to.

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

Latency-Aware Tool Selection: When 'Good Enough Now' Beats 'Best Available Later'

· 10 min read
Tian Pan
Software Engineer

The tool description in your agent's system prompt is a six-month-old eval artifact. It says search_pricing returns "fresh inventory data with structured pricing" and the planner believes it, because nothing in the prompt has updated since the day the description was tuned. The actual search_pricing endpoint has been sitting at p95 of 11 seconds for the last forty minutes because the upstream vendor is rate-limiting your account, and the cheaper search_cache tool — which the prompt describes as "may be slightly stale" — would return the same answer in 200ms. The planner picks search_pricing anyway, because the description still reads like it did during eval, and the planner has no signal about what either tool costs to call right now.

This is the structural failure of static tool descriptions. The planner is making routing decisions on a snapshot of a world that has moved on. Tool selection isn't really a capability question — most production agents have two or three tools that overlap heavily in what they can answer — it's a cost-of-waiting question, and the cost of waiting is the thing your prompt template doesn't see.

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.