Skip to main content

11 posts tagged with "tool-calling"

View all tags

MCP Tool Deprecation: Why the Model Still Calls the Old Name

· 9 min read
Tian Pan
Software Engineer

You renamed get_user_email to lookup_contact six weeks ago. The new name shipped, the old handler was removed, the changelog noted it, and your eval set passed. Then last Tuesday a customer support engineer pinged you: an agent had returned an error on roughly three percent of its tool calls during the previous week — tool_not_found: get_user_email. The renamed-away name. The one nothing in the live system advertises anymore.

The prior is sticky. The model your agent is talking to was trained on a corpus where get_user_email was overwhelmingly the canonical way to ask "what is this person's email." Even when the tools array you pass at inference time lists only lookup_contact, the model occasionally — under certain context conditions, especially long traces or recovery-after-error states — falls back to the name it remembers. A hard cutover doesn't eliminate the long tail; it just turns soft failures into hard ones.

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

· 10 min read
Tian Pan
Software Engineer

A new optional parameter goes into a tool description on a Tuesday. The change is small — six lines in the diff, no breaking signature change, no callers updated, no eval cases touched. The PR description says "adds support for an optional language filter to the existing search tool." Two reviewers approve. It ships.

A week later, the cost dashboard shows that the search tool is being called eighteen percent more often than the prior baseline. Latency on the affected agent has crept up by roughly the same proportion. Nobody can point to a single failing eval. The new parameter, when used, behaves correctly. The new parameter, when not used, doesn't matter. And yet the planner has clearly changed its mind about when to reach for this tool — and the eval suite, which grades tool correctness, has nothing to say about a shift in tool frequency.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

· 10 min read
Tian Pan
Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

The Dependency Bomb in Your Tool Catalog: When Adding One Tool Breaks Five Agents

· 8 min read
Tian Pan
Software Engineer

A team I know shipped a new lookup_customer_v2 tool to their support agent's catalog on a Tuesday. The tool was scoped narrowly, well-tested in isolation, and approved by review. By Thursday, an unrelated workflow — refund processing — was failing on roughly four percent of cases that used to succeed. The refund tool hadn't changed. The refund prompt hadn't changed. The model hadn't changed. What changed was that the planner was now picking lookup_customer_v2 for refund-eligibility queries that had previously routed cleanly to get_account_status, because the new tool's description happened to contain the word "eligibility" and ranked higher under whatever similarity heuristic the model uses internally.

This is the dependency bomb. Teams treat the tool registry as additive — "we're just adding one thing, what could go wrong" — but the planner doesn't see your registry as a list of independent capabilities. It sees a probability distribution over choices, and every entry redistributes the mass. Adding a tool can quietly subtract behavior somewhere else, and your eval suite will probably miss it because nobody wrote a regression test that says "the agent should still pick the old tool for this case."

Function Calling vs Code Generation for Agent Actions: The Tradeoffs Nobody Benchmarks

· 10 min read
Tian Pan
Software Engineer

An agent running in production once received the instruction "clean up the test data" and executed a DROP TABLE command against a production database. The tool call succeeded. The audit log showed a perfectly structured JSON payload. The agent had done exactly what it was asked — just not what anyone meant. This isn't a story about prompt injection. It's a story about an architectural choice: the team had given their agent the ability to generate and execute arbitrary code, and they had underestimated what that actually means at runtime.

The choice between function calling and code generation as the action layer for AI agents is one of the most consequential decisions in agent architecture, and almost nobody benchmarks it directly. Papers measure accuracy on task completion; they rarely measure the failure modes that matter in production — silent semantic errors, irreversible side effects, security exposure surface, and debugging cost when something goes wrong.

Graph Reasoning Gaps in LLMs: Scaffolding Relational Tasks That Fool Sequence-Trained Models

· 9 min read
Tian Pan
Software Engineer

A common mistake in AI system design is asking a language model to reason over a graph as if it were reading a document. The model will generate a confident, fluent answer. The answer will be wrong in a way that looks right — it will name real nodes, reference plausible paths, and describe relationships that almost exist. Then you discover your org-chart traversal hallucinates skip-level managers, your dependency resolution misses cycles in graphs over ten nodes, and your three-hop knowledge graph query has a 60% error rate at step two.

This is not a prompt quality problem. It is an architecture problem, and you can diagnose it before writing a single prompt.

Contract Tests for LLM Tool Surfaces: When the Vendor Changes a Field and Your Agent Silently Adapts

· 11 min read
Tian Pan
Software Engineer

A vendor flipped "items" to "results" in a tool response last Tuesday. The agent didn't crash. It re-planned around the new shape, returned a confident-looking answer that was missing two-thirds of the rows, and the on-call engineer found out three days later when a customer asked why their export was short. No exception fired. No alert tripped. The eval suite, which runs against a frozen fixture from before the vendor change, was green the whole time.

This is the failure mode that contract testing was invented to catch in microservices a decade ago, and the one that almost no agent stack has any equivalent for today. HTTP services have Pact, schemathesis, and oasdiff to sit between consumer and provider and refuse to let breaking changes ship. The tools you hand to your agent — REST endpoints, internal RPCs, vendor SDKs, MCP servers — have nothing comparable. The model absorbs the change, adapts gracefully, and produces a degraded answer that looks correct.

The Acknowledgment-Action Gap: Your Agent's 'Got It' Is Not a Commitment

· 11 min read
Tian Pan
Software Engineer

An agent tells a customer: "Got it — I've submitted your refund request. You should see it in 5–7 business days." The customer closes the chat. No refund was ever submitted. There is no ticket, no API call, no row in the refunds table. Just a paragraph of polite, confident English, followed by a successful session termination.

This is the acknowledgment-action gap, and it is the single most expensive class of bug in production agent systems. The gap exists because the fluent prose that makes instruction-tuned models feel competent is a different output channel than the structured tool calls that actually change the world — and most teams wire their business logic to the wrong one.

Everyone who ships an agent eventually learns this the hard way. The model produces a polished confirmation that reads like a commitment, the downstream system interprets it as a commitment, and weeks later a support ticket arrives asking where the refund went. The embarrassing part is not that the model lied. The embarrassing part is that the system was designed to trust what it said.

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

· 9 min read
Tian Pan
Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

Phantom Tool Calls: When AI Agents Invoke Tools That Don't Exist

· 8 min read
Tian Pan
Software Engineer

Your agent passes every unit test, handles the happy path beautifully, and then one Tuesday afternoon it tries to call get_user_preferences_v2 — a function that has never existed in your codebase. The call looks syntactically perfect. The parameters are reasonable. The only problem: your agent fabricated the entire thing.

This is the phantom tool call — a hallucination that doesn't manifest as wrong text but as a wrong action. Unlike a hallucinated fact that a human might catch during review, a phantom tool call hits your runtime, throws a cryptic ToolNotFoundError, and derails a multi-step workflow that was otherwise running fine.

The Tool Explosion Problem: Why Your Agent Breaks at 30 Tools

· 9 min read
Tian Pan
Software Engineer

Every agent demo starts with three tools. A web search, a calculator, maybe a code executor. The agent nails it every time. So you ship it, and your team starts adding integrations — Slack, Jira, GitHub, email, database queries, internal APIs. Six months later, your agent has 150 tools and picks the wrong one 40% of the time.

This is the tool explosion problem, and it's one of the least discussed failure modes in production agent systems. The degradation isn't linear — it's a cliff. An agent that's 95% accurate with 5 tools can drop below 30% accuracy when you hand it 100, even if the model and prompts haven't changed at all.