Skip to main content

17 posts tagged with "tool-calling"

View all tags

The Async Tool Call Your Agent Fired and Forgot

· 10 min read
Tian Pan
Software Engineer

The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.

This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.

The Hallucinated Tool Argument That Passed Schema Validation

· 9 min read
Tian Pan
Software Engineer

The agent calls fetch_order with order_id: "ORD-739241". The schema accepts it — three letters, a dash, six digits, matches the pattern exactly. The tool returns 404. The agent hedges, generates "ORD-739242", calls again, gets another 404, generates "ORD-739243". Your dashboard records three successful tool invocations and three clean schema validations. The customer waits. Somewhere in the trace, every layer of your safety stack is reporting green while the model invents identifiers at full speed.

The team's belief is that the schema caught it. The schema caught what it could catch: shape. It checked that the argument was a string, that it matched a regex, that the required field was present. The schema cannot check that ORD-739241 corresponds to a real order in your database, because the schema does not know your database exists. That gap — between syntactic plausibility and semantic correctness — is where most production tool-calling bugs live, and the failure is so quiet that the only signal is a customer's confusion.

The Filler Tool Call: When Agents Perform Diligence Instead of Doing Work

· 9 min read
Tian Pan
Software Engineer

Open the trace of any production agent and look at the tool calls that ran between the user's question and the first useful action. You will find a get_user_profile that returned a name nobody used, a check_status that came back green and was never referenced, a list_recent_orders whose result was summarized as "ok" and dropped on the floor. None of these calls changed the answer. All of them cost real money, real latency, and a real line in the trace. Your agent has learned to look diligent — and looking diligent is now your single largest source of waste.

This is the filler tool call: an action the agent emits not because it needs the result, but because the surrounding pattern of "thinking out loud, then acting" has been rewarded enough times during training that the model now performs thoroughness as a side effect of answering anything. It is the LLM equivalent of a junior analyst opening five tabs they never read so the senior across the room sees activity. The difference is that the junior gets bored. The agent never does.

MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns

· 9 min read
Tian Pan
Software Engineer

The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.

Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.

The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.

The Degradation Signals Your Agent Never Receives

· 9 min read
Tian Pan
Software Engineer

When a downstream API starts to wobble, a human operator finds out a dozen ways before anything actually breaks. The status page flips to yellow. A changelog email lands in the inbox. A warning banner appears in the provider's dashboard. The on-call channel lights up with a 429 someone spotted in the logs. A teammate posts "anyone else seeing slow writes?" None of these are responses to a request. They are the ambient operational signal that surrounds the API, and a human absorbs it almost passively.

An agent calling the same API receives exactly one thing: the response to the request it just made. Status code, headers, body. That is the entire channel. It has no inbox, no dashboard, no Slack, no peripheral vision. It cannot notice that the last ten calls each took twice as long as the ten before. It cannot read the status page, because nobody handed it the URL and it has no standing instruction to look. When the dependency degrades, the agent is the last party in the system to find out — and it usually finds out by failing.

This asymmetry is not a model capability problem. A smarter model does not fix it. The agent is blind to operational signals because the plumbing never delivers them, and most agent stacks ship without anyone noticing the plumbing is missing.

You Can't Email a Changelog to a Model: Why API Deprecation Breaks When the Caller Is an LLM

· 10 min read
Tian Pan
Software Engineer

API deprecation is a communication protocol that assumes the receiver can read. You publish a changelog, send an email to registered developers, add a Deprecation header, give six months of notice, and trust that a human on the other end will see the warning, file a ticket, and migrate before the sunset date. That entire workflow quietly stopped working the moment your most active caller became a language model.

An LLM does not subscribe to your developer newsletter. It does not have a Slack channel where someone pastes your migration guide. It rediscovers your API on every single call — from a tool description it was handed, a documentation page that may be eighteen months stale, or a memory of how your API looked in its training data. There is no persistent client you can version, notify, or page. Each request is a fresh negotiation with an entity that has no memory of your last announcement and no obligation to read your next one.

This is not a hypothetical. As agents become the dominant consumers of internal and external APIs, the deprecation playbook every backend team has used for fifteen years is failing in a specific, diagnosable way — and most teams discover it only when a "deprecated for six months" endpoint is still serving an agent in production with no path to make it stop.

MCP Tool Deprecation: Why the Model Still Calls the Old Name

· 9 min read
Tian Pan
Software Engineer

You renamed get_user_email to lookup_contact six weeks ago. The new name shipped, the old handler was removed, the changelog noted it, and your eval set passed. Then last Tuesday a customer support engineer pinged you: an agent had returned an error on roughly three percent of its tool calls during the previous week — tool_not_found: get_user_email. The renamed-away name. The one nothing in the live system advertises anymore.

The prior is sticky. The model your agent is talking to was trained on a corpus where get_user_email was overwhelmingly the canonical way to ask "what is this person's email." Even when the tools array you pass at inference time lists only lookup_contact, the model occasionally — under certain context conditions, especially long traces or recovery-after-error states — falls back to the name it remembers. A hard cutover doesn't eliminate the long tail; it just turns soft failures into hard ones.

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

· 10 min read
Tian Pan
Software Engineer

A new optional parameter goes into a tool description on a Tuesday. The change is small — six lines in the diff, no breaking signature change, no callers updated, no eval cases touched. The PR description says "adds support for an optional language filter to the existing search tool." Two reviewers approve. It ships.

A week later, the cost dashboard shows that the search tool is being called eighteen percent more often than the prior baseline. Latency on the affected agent has crept up by roughly the same proportion. Nobody can point to a single failing eval. The new parameter, when used, behaves correctly. The new parameter, when not used, doesn't matter. And yet the planner has clearly changed its mind about when to reach for this tool — and the eval suite, which grades tool correctness, has nothing to say about a shift in tool frequency.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

· 10 min read
Tian Pan
Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

The Dependency Bomb in Your Tool Catalog: When Adding One Tool Breaks Five Agents

· 8 min read
Tian Pan
Software Engineer

A team I know shipped a new lookup_customer_v2 tool to their support agent's catalog on a Tuesday. The tool was scoped narrowly, well-tested in isolation, and approved by review. By Thursday, an unrelated workflow — refund processing — was failing on roughly four percent of cases that used to succeed. The refund tool hadn't changed. The refund prompt hadn't changed. The model hadn't changed. What changed was that the planner was now picking lookup_customer_v2 for refund-eligibility queries that had previously routed cleanly to get_account_status, because the new tool's description happened to contain the word "eligibility" and ranked higher under whatever similarity heuristic the model uses internally.

This is the dependency bomb. Teams treat the tool registry as additive — "we're just adding one thing, what could go wrong" — but the planner doesn't see your registry as a list of independent capabilities. It sees a probability distribution over choices, and every entry redistributes the mass. Adding a tool can quietly subtract behavior somewhere else, and your eval suite will probably miss it because nobody wrote a regression test that says "the agent should still pick the old tool for this case."

Function Calling vs Code Generation for Agent Actions: The Tradeoffs Nobody Benchmarks

· 10 min read
Tian Pan
Software Engineer

An agent running in production once received the instruction "clean up the test data" and executed a DROP TABLE command against a production database. The tool call succeeded. The audit log showed a perfectly structured JSON payload. The agent had done exactly what it was asked — just not what anyone meant. This isn't a story about prompt injection. It's a story about an architectural choice: the team had given their agent the ability to generate and execute arbitrary code, and they had underestimated what that actually means at runtime.

The choice between function calling and code generation as the action layer for AI agents is one of the most consequential decisions in agent architecture, and almost nobody benchmarks it directly. Papers measure accuracy on task completion; they rarely measure the failure modes that matter in production — silent semantic errors, irreversible side effects, security exposure surface, and debugging cost when something goes wrong.

Graph Reasoning Gaps in LLMs: Scaffolding Relational Tasks That Fool Sequence-Trained Models

· 9 min read
Tian Pan
Software Engineer

A common mistake in AI system design is asking a language model to reason over a graph as if it were reading a document. The model will generate a confident, fluent answer. The answer will be wrong in a way that looks right — it will name real nodes, reference plausible paths, and describe relationships that almost exist. Then you discover your org-chart traversal hallucinates skip-level managers, your dependency resolution misses cycles in graphs over ten nodes, and your three-hop knowledge graph query has a 60% error rate at step two.

This is not a prompt quality problem. It is an architecture problem, and you can diagnose it before writing a single prompt.