Skip to main content

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.

The Three Stacked Costs of a "Cold" Tool Call

When an agent decides to call a remote MCP tool for the first time in a session, three handshakes happen back-to-back, and each one looks reasonable in isolation.

The first is the transport handshake. If the server is on streamable HTTP, that's TCP setup, then TLS — and TLS without session resumption costs a full round-trip plus key derivation. On a cross-region call that's 80–150ms before a single byte of useful payload moves. If the auth flow refreshes an OAuth token on the way through, add another round-trip to the identity provider.

The second is the MCP capability negotiation. The client sends initialize, the server responds with its supported protocol version and capabilities, the client sends initialized, and only then is the session usable for tool calls. For a healthy server this is sub-100ms, but it's mandatory and serial — the client cannot pipeline a tools/call ahead of the negotiation completing.

The third is the tool discovery and schema fetch. tools/list returns the entire catalog of tools the server exposes, with full JSON Schema for each one. For a server publishing forty tools with rich schemas, the response can be tens of kilobytes and the server may be doing real work to assemble it — querying a downstream registry, computing schema fragments, materializing examples. Frameworks like the OpenAI Agents SDK call list_tools() on every agent run by default, which means this cost recurs unless the developer explicitly opts into cache_tools_list.

Add the three together and a fresh session into a remote MCP server commonly costs 400–900ms before the first useful tool result lands. A team that benchmarks "tool latency" by running a single tool call from a fresh client is measuring this entire stack and attributing it to the tool. A team that measures the second call gets a number that looks ten times faster, and concludes — wrongly — that the protocol is fast.

Why Step 7 Is Where It Hurts

The cold start cost is paid once per connection, not once per call, so naive intuition says it amortizes away in any session that does more than one tool call. That intuition is wrong for three reasons that compound in real agent workloads.

The first is that agent sessions touch many servers, not one. A research agent might pull from a search MCP, a calendar MCP, a vector store MCP, a code execution MCP, and a filesystem MCP within a single user turn. Each of those is a separate connection with its own cold start. The session amortizes within a server but pays full freight every time it crosses to a new one, and modern agent stacks routinely federate across five to ten servers.

The second is that frameworks tear down connections aggressively. Many client implementations open an MCP session per agent run rather than per process, on the theory that sessions are cheap. They're cheap once you've paid for them — the closing party here is what makes them expensive. A new run, a new tool call, a fresh handshake. If your harness spawns a subprocess for each task, every subprocess pays the full tax on its first call into every server it talks to.

The third, and the one teams underestimate most, is token-budget-driven schema reloading. As agents accumulate context across long sessions, frameworks compact or evict messages, and tool schemas — which can be enormous — get evicted with them. The next call then needs the schema back in context, which forces a re-fetch on a notionally warm connection. The connection is warm at the transport layer but cold at the protocol layer, and the latency difference is invisible to a transport-only health check.

By step seven of a multi-server agent loop, you've typically paid two to four cold-start taxes that you didn't anticipate when you sized the latency budget. The model finishes its turn in under a second; the user feels a five-second delay; the trace blames "tool execution time" because the truncation happens before anyone sees the negotiation overhead.

The Latency Budget Most Teams Are Not Tracking

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates