Skip to main content

33 posts tagged with "mcp"

View all tags

The Async Tool Call Your Agent Fired and Forgot

· 10 min read
Tian Pan
Software Engineer

The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.

This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.

The OAuth Scope Your Agent Acquired Across Chained Tool Calls

· 10 min read
Tian Pan
Software Engineer

A user clicks "Authorize" on your agent's consent screen once. By the time the session ends, that agent has chained through eleven tool calls, negotiated three step-up authorizations, and now holds the union of scopes across every tool it touched. The user remembers granting one thing. Your audit log shows read-write access to half their account. The OAuth standard says everything is working as designed, and that is exactly the problem.

The classical OAuth consent model was built for a world where one app talks to one API. Agents shattered that assumption two years ago and the standard has not caught up in practice, even where the spec has. The result is a category of silent privilege escalation that no one decides to ship — it accretes, one tool registration at a time, while your security review keeps inspecting the front door.

The prompt injection that survived your sanitizer because the agent read it through a tool

· 11 min read
Tian Pan
Software Engineer

A team I talked to last month had a clean prompt-injection story. Their gateway ran every user message through a classifier. Anything that scored above a threshold got bounced with a polite error. They benchmarked it against a public adversarial set, hit 99.4% block rate, and shipped. Two weeks later, a customer-success ticket revealed that the agent had quietly drafted, approved, and sent an email instructing an internal billing tool to refund a stranger's invoice to a new account. The malicious instruction had never touched the user input. It came in through a Confluence page the agent fetched when the user asked, perfectly innocently, "what does our refund policy say?"

That is the failure mode no input sanitizer catches, and it is now the dominant prompt-injection vector in production agents. The classifier you trained on user prompts never saw the payload, because the payload arrived through a different door. By the time the bytes hit the model, the agent had already labeled them as "context I retrieved to help the user," not "untrusted text from a stranger on the internet." The model treats both with the same compliance instinct, because the model has no concept of trust at all.

The Tool Version Bump Your Agent Quietly Adapted To

· 10 min read
Tian Pan
Software Engineer

A downstream search service ships v2.3.2 on a Tuesday afternoon. The release notes mention a renamed status field, a new nullable confidence value, and a reordered array in the result envelope. Nothing in the CHANGELOG is marked breaking. The provider's own client libraries absorb the change in a point release. Your team's HTTP integrations would have logged a deserialization error inside an hour. Your agent — the one routing customer questions through that search tool — does not. It keeps answering. The questions still resolve. The dashboards stay green.

Six weeks later, someone notices that "out of stock" replies have crept up from two percent of queries to eleven. The root cause is the v2.3.2 bump. The renamed status string changed from in_stock to available, and the agent — being a flexible reasoner over text rather than a schema-strict client — interpreted the absence of the old token as "not available," then phrased that finding into helpful, confident, wrong customer messages. The contract regression was absorbed on the consumer side, where no test suite was watching.

This is the failure mode that conventional API hygiene was never designed to catch. Strict clients break loudly. Agents break quietly. And the longer you treat your agent like a normal HTTP consumer, the longer this class of bug hides inside metrics that look fine.

The Pointer Your Agent Mistook for a Value: Reference vs Value in Tool Outputs

· 11 min read
Tian Pan
Software Engineer

A search tool returns ten document IDs. An asset tool returns an S3 presigned URL. A database tool returns a row handle. A file tool returns a path. Each of those returns is, formally, a pointer — a small string that names a value the agent does not yet possess. The model's downstream behavior depends entirely on whether it knows that and dereferences before reasoning, or whether it treats the pointer as if it were already the thing.

The failure mode is invisible from the trace. The tool call succeeded. The return is well-formed. The model emitted plausible-looking output. Nothing in the log says "the agent reasoned about a filename and called it a document." The pointer-vs-value confusion sits underneath the visible behavior, in a layer your tool schema never named.

The Sandbox Your Agent Didn't Notice Was Real

· 10 min read
Tian Pan
Software Engineer

A team I know has a textbook staging setup. Read-only replicas of the production database. A mock Stripe account that pretends to charge cards. Synthetic users with fake email addresses on a domain nobody owns. The agent is asked to walk through an "account delinquent" escalation flow in staging, end to end, as part of a release rehearsal. The trace looks clean. The agent does what it is supposed to do.

Three minutes later, a real customer — a paying one, who churned six months ago and was still in a dormant export the developer had used to seed a test fixture — replies to a politely-worded payment-overdue email. The "send_email" tool, registered next to a dozen other tools that all terminate in mocks, was wired to the production Mailgun key. The developer who set it up two sprints earlier had been iterating fast on email templates and the sandbox tier capped them at five emails an hour, which broke the inner loop, so they swapped in the real key "just for the afternoon" and forgot. Nobody re-checked. The agent had no way to know.

The Tool You Added For One Agent Is Now In Every Agent's Hand

· 10 min read
Tian Pan
Software Engineer

Six months ago, somebody on the customer-support team wired a send_email tool for their agent. It worked. The platform team noticed it in the shared tool registry, gave a thumbs-up emoji on the PR, and moved on. This week, a security engineer ran an audit and discovered that send_email is in the action surface of the meeting-notes summarizer, the data-quality bot, an analytics assistant nobody officially owns, and a half-built prototype that hasn't been touched since January. None of these agents need to send email. None of them have ever been reviewed for whether they should be allowed to. The PRD for the meeting-notes summarizer is two sentences long and the words "outbound communication" do not appear in it.

This is the default state of every shared tool registry I have ever audited. The act of registering a tool — pushing a JSON schema and a handler into a central catalog — is treated as a developer convenience, like adding a utility function to a shared library. But once the registry is sourced into every agent's prompt, registering a tool is not a library change. It is a deployment to every agent in the company simultaneously, with no review of whether each of them should have received it.

MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns

· 9 min read
Tian Pan
Software Engineer

The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.

Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.

The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.

Your Tool Descriptions Are an Instruction Channel the Model Obeys

· 8 min read
Tian Pan
Software Engineer

When a security team reviews a new tool integration, they read the code. They check what the function does, what it touches, what scopes it needs, whether it logs secrets. They almost never read the one sentence that decides whether the model calls it at all — the tool's description. That sentence is not documentation. It is an instruction the model treats as authoritative, and in most agent stacks nobody reviews it.

A tool description is written for the model to read. The model uses it to decide when the tool is relevant, what arguments to pass, and how to interpret what comes back. That makes the description a control channel into the model's behavior. And the moment a tool arrives from a third-party registry, a Model Context Protocol (MCP) server you don't operate, or a plugin a teammate installed last week, that control channel is authored by someone you never agreed to trust.

This is the gap. Input sanitization inspects what users type. Code review inspects what functions execute. The tool description sits between them — it is configuration that behaves like input — and it falls through both nets.

MCP Tool Deprecation: Why the Model Still Calls the Old Name

· 9 min read
Tian Pan
Software Engineer

You renamed get_user_email to lookup_contact six weeks ago. The new name shipped, the old handler was removed, the changelog noted it, and your eval set passed. Then last Tuesday a customer support engineer pinged you: an agent had returned an error on roughly three percent of its tool calls during the previous week — tool_not_found: get_user_email. The renamed-away name. The one nothing in the live system advertises anymore.

The prior is sticky. The model your agent is talking to was trained on a corpus where get_user_email was overwhelmingly the canonical way to ask "what is this person's email." Even when the tools array you pass at inference time lists only lookup_contact, the model occasionally — under certain context conditions, especially long traces or recovery-after-error states — falls back to the name it remembers. A hard cutover doesn't eliminate the long tail; it just turns soft failures into hard ones.

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.