Skip to main content

44 posts tagged with "tool-use"

View all tags

The Downstream API That Kept Writing After the User Cancelled the Conversation

· 10 min read
Tian Pan
Software Engineer

The user hits stop. The browser closes the SSE connection. Your AI SDK fires onAbort. The agent runtime sees the signal, stops requesting more tokens from the model, and tears down its loop. From inside your codebase, the cancellation looks crisp. Every subsystem you can see is doing the right thing.

Meanwhile, two seconds earlier, the model emitted a tool call. The runtime dispatched it. The tool's execute function opened a TCP connection to a third-party API and posted a payload. That HTTP request is still in flight, the third party's server is still processing it, and the third party has no way of knowing that the conversation it is serving no longer exists. The write commits. The user's mental model says they escaped the action by hitting stop. The downstream system's database says otherwise.

The Revoked Tool Your Agent Kept Calling Because the Registry Cache Was an Hour Stale

· 11 min read
Tian Pan
Software Engineer

A user opens the integrations page, finds the Stripe connector they installed last month, clicks Remove, and closes the tab. They believe they have just rescinded an authority. What they have actually done is decrement a row in a database that the agent currently talking to them will not read again for another forty-three minutes. In the interval, the agent will try to call that Stripe tool, the registry's authorization layer will correctly say no, the agent's harness will see the denial as a transient downstream blip and retry three times, and the user's own Stripe audit log will record three unauthorized access attempts arriving from a vendor they thought they had just severed.

The user's escalation will read, almost verbatim: your platform kept trying to access my Stripe after I removed it. That is exactly what happened, and the root cause sits one layer deeper than the bug report ever reaches. The tool registry was the source of truth for what the agent was allowed to do. The agent did not read the source of truth. It read a cache.

The Agent's I-Don't-Know Rate That Fell After You Added More Tools

· 9 min read
Tian Pan
Software Engineer

You added the search tool, then the calendar tool, then the CRM tool, then four database wrappers and a calculator. The dashboard moved the way you wanted: task-completion ticked up, latency held, the "I don't know" rate dropped from 14% to 4%. Looks like a capability win. It is not. The planner did not learn more; it learned less abstention. Every question now looks answerable because there is always some tool that pattern-matches the query well enough to call. The 10 percentage points of "I don't know" you removed did not turn into correct answers — they turned into confident wrong ones, distributed across the long tail where nobody is grading carefully.

This is the false-competence trap of tool surface expansion. It is the most common way a team ships a regression while celebrating an improvement. The eval rubric measures whether the agent attempted the task and produced a plausible-shaped answer; it does not measure whether the agent should have refused. Abstention is not free, but it is the cheapest correct behavior available, and you stop being able to see it the moment your tool palette gets large enough that something always fires.

The Streaming Abort That Left the Side Effect Billable

· 11 min read
Tian Pan
Software Engineer

A user is watching your agent stream a response. Two hundred milliseconds in, they hit stop. The UI clears the bubble, the spinner disappears, and the product behaves as if the request never happened. It did happen. The agent already called send_invoice_email. The vendor's mail relay returned 250 OK. The customer received a draft invoice the user never approved. Your billing meter charged the user for the tokens that streamed before the abort. It cannot bill back the email.

This is the failure mode every team with streaming tool use ships at least once, and most teams never even detect. The stream layer reports cancelled. The tool layer reports succeeded. Your customer-facing log picks one of them based on whichever subsystem flushes last, and the two halves of the same request now disagree about whether it occurred.

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Agent That Burned an Hour Re-Trying the Same 400 Error

· 10 min read
Tian Pan
Software Engineer

An agent calls a tool. The tool returns 400 Bad Request with a perfectly structured error body: {"error": "missing required field", "field": "email"}. The agent's reasoning chain quotes the error back word for word, says "I need to include the email field," and then emits the next tool call with the exact same payload as before. The loop's retry policy, written years ago for a world of flaky HTTPS connections, treats the 400 as transient and fires it again. Three times. Eight. Fifteen. An hour later, the agent has burned through context, money, and a rate-limit budget without ever sending a request the tool could accept.

The failure looks like a model problem. It is not. The model read the error correctly. The harness around it never gave the correction anywhere to land.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Pointer Your Agent Mistook for a Value: Reference vs Value in Tool Outputs

· 11 min read
Tian Pan
Software Engineer

A search tool returns ten document IDs. An asset tool returns an S3 presigned URL. A database tool returns a row handle. A file tool returns a path. Each of those returns is, formally, a pointer — a small string that names a value the agent does not yet possess. The model's downstream behavior depends entirely on whether it knows that and dereferences before reasoning, or whether it treats the pointer as if it were already the thing.

The failure mode is invisible from the trace. The tool call succeeded. The return is well-formed. The model emitted plausible-looking output. Nothing in the log says "the agent reasoned about a filename and called it a document." The pointer-vs-value confusion sits underneath the visible behavior, in a layer your tool schema never named.

Streamed Tokens Are a Promise You Can't Take Back

· 9 min read
Tian Pan
Software Engineer

The model has streamed seventy percent of a confident-sounding answer to the user's screen. Then the tool call it was about to make returns an error, or no rows, or a 429. You now get to pick between two losses: let the model finish gracefully by inventing the rest, or stop mid-sentence with no clean way to walk it back. Neither is a recovery — both are damage.

This is the part of streaming UX that nobody priced when they turned the feature on. Streaming was framed as a perceived-latency win: time-to-first-token is the metric, the user starts reading sooner, the app feels alive. What the framing leaves out is that every token you stream is a commitment. You have published a draft of an answer that you do not yet know is correct, and the back half of your system has not yet finished running. When it finishes and disagrees, your UI has no native way to retract what it already showed.

The Tool You Added For One Agent Is Now In Every Agent's Hand

· 10 min read
Tian Pan
Software Engineer

Six months ago, somebody on the customer-support team wired a send_email tool for their agent. It worked. The platform team noticed it in the shared tool registry, gave a thumbs-up emoji on the PR, and moved on. This week, a security engineer ran an audit and discovered that send_email is in the action surface of the meeting-notes summarizer, the data-quality bot, an analytics assistant nobody officially owns, and a half-built prototype that hasn't been touched since January. None of these agents need to send email. None of them have ever been reviewed for whether they should be allowed to. The PRD for the meeting-notes summarizer is two sentences long and the words "outbound communication" do not appear in it.

This is the default state of every shared tool registry I have ever audited. The act of registering a tool — pushing a JSON schema and a handler into a central catalog — is treated as a developer convenience, like adding a utility function to a shared library. But once the registry is sourced into every agent's prompt, registering a tool is not a library change. It is a deployment to every agent in the company simultaneously, with no review of whether each of them should have received it.

The Agent That Narrated a Number It Should Have Computed

· 10 min read
Tian Pan
Software Engineer

Ask your agent for last quarter's churn rate and it answers 4.2% in one clean sentence. The number is plausible. The prose around it is confident. The dashboard, when someone finally checks, says 6.8%. The agent never queried anything — it produced a churn-shaped token sequence because, to a language model, narrating a number and computing one look identical on the way out.

This is the quiet failure mode that survives every demo. A hallucinated tool name throws an error you can catch. A malformed argument fails a schema check. But a fabricated figure, delivered in fluent English, passes through your entire pipeline looking exactly like a real one. There is no exception, no log line, no red text. The only signal that something went wrong is a human who happens to know the right answer — and the whole point of the agent was that no human had to.