What Nobody Tells You About Running MCP in Production

April 8, 2026 · 10 min read

Software Engineer

The Model Context Protocol sells itself as a USB-C port for AI — plug any tool into any model and watch them talk. In practice, the first day feels like that. The second day you hit a scaling bug. By the third day you're reading CVEs about tool poisoning attacks you didn't know existed.

MCP is a genuinely useful standard. Introduced in late 2024 and quickly adopted across the industry, it has solved real integration friction between LLMs and external systems. But the gap between "got a demo working" and "running reliably under load with real users" is larger than most teams expect. Here's what that gap actually looks like.

The Transport Decision You'll Regret Getting Wrong

The first architectural decision in any MCP deployment — transport mechanism — is also the one teams most frequently revisit. MCP supports three transports: STDIO, HTTP+SSE, and Streamable HTTP.

STDIO is for local tooling. If you're building something that runs as a subprocess on a developer's laptop, it's fine. The moment you want multi-user remote access, you need HTTP.

HTTP+SSE was the original remote transport. It's now deprecated. The problem is inherent to SSE: it requires a persistent, long-lived connection per session. This makes serverless deployment economically painful — platforms like Lambda or Cloud Run that scale to zero keep warm connections alive and accumulate costs. Worse, stateful SSE connections break when you put multiple server instances behind a load balancer unless you implement sticky routing or exernal session storage.

Streamable HTTP is the right choice for production. It allows stateless operation across server instances, works correctly behind standard load balancers, and eliminates the connection-per-session constraint. If you're starting today, skip SSE entirely and design for Streamable HTTP from day one. Migrating later means updating every client that connects to your server — and clients don't always update quickly.

The practical test before you commit to your transport layer: can your server survive a restart mid-session without breaking active agent workflows? If the answer is no, you haven't finished the infrastructure work.

Tool Design Is a Prompting Problem

Here's an insight that takes most teams a few iterations to internalize: the quality of your MCP tool implementation is only half tool code. The other half is the schema you expose to the model.

LLMs decide which tool to call — and how to call it — based almost entirely on your tool names and descriptions. When two tools have similar names or overlapping responsibilities, models get confused and pick the wrong one. When parameter descriptions are vague, models pass invalid values and trigger errors. When a tool returns a 3,000-token JSON blob, the agent has to spend context budget parsing it before it can do anything useful.

Specific guidance that holds up in practice:

Fewer, richer tools beat many thin ones. A tool called search_documents that supports filtering, sorting, and pagination outperforms three separate tools called search_documents, filter_documents, and paginate_results. Composition at the model level works poorly when it requires reasoning about which sequence of tools to chain.

Write descriptions as if the model has never seen your system. The description is documentation for the LLM, not for a human developer. Include what the tool is for, what it is not for, what constraints apply, and what callers should expect in the response. "Searches documents" is not a description. "Searches indexed documents by keyword or semantic query. Returns up to 20 results ranked by relevance. Use this when the user asks to find, look up, or locate information." is closer.

Keep responses small and structured. If your tool might return large payloads, add pagination and make the agent call again for the next page. Summarizing with a secondary LLM call before returning is worth the latency when the alternative is blowing through context limits on a single tool response.

Validate schemas against the model you're actually using. Providers often claim support for JSON Schema constraints — enum, minimum, pattern — but don't enforce them. Test your schema with your actual model and verify that invalid inputs produce the expected rejections rather than silently passing through.

Security: The Attack Surface You Built Without Meaning To

The MCP security threat model is different from a conventional API's. The attack surface isn't just your server code — it's the LLM's behavior when it processes tool metadata.

Tool poisoning is the class of attack that blindsided most teams when it went public. The mechanism: an attacker embeds malicious instructions inside a tool's description field. The instructions are invisible to the user reviewing the tool list but are processed verbatim by the language model. A poisoned tool description can instruct the model to exfiltrate data, execute additional calls, or bypass application-level safety checks — all without the user seeing anything unusual.

In benchmark testing across major LLM agents, tool poisoning attack success rates were alarming. More capable models were often more vulnerable because the attack exploits instruction-following capability directly.

Rug pull attacks extend this further. MCP tools can update their own definitions after a user has approved them. An attacker who controls a third-party MCP server can deploy a benign-looking tool, wait for installations to accumulate, then silently update the description to include malicious instructions. The user approved the original version; they have no visibility into what the tool says about itself today.

CVE-2025-6514 demonstrated what full-compromise looks like: a malicious MCP server exploited a command injection flaw to execute arbitrary code on connected clients. Researchers separately demonstrated exfiltrating a user's entire chat history by combining a poisoned tool with a legitimate messaging integration.

What actually mitigates this:

Pin third-party MCP server versions and treat any update as requiring re-review. Don't auto-update.
Run each MCP server in the minimum privilege context it needs. Database servers should be read-only by default. Filesystem servers should be scoped to specific directories. Never expose raw shell execution without an explicit, audited sandbox.
Separate credentials from the agent. API keys and tokens should live in your gateway layer and be injected server-side at execution time, not placed in the agent's context where they can be leaked.
Log every tool invocation with the caller identity, parameters, and response. When something goes wrong, you need the audit trail.

The failure mode teams hit most often is deploying a collection of community MCP servers without reading the code. Any MCP server runs with your application's credentials. Treat installation of a third-party MCP server the same way you'd treat adding an arbitrary npm package that has your database password.

The Gateway Pattern and Why It Matters at Scale

Early MCP deployments tend to wire tool servers directly to the LLM. This works for single-user or low-volume scenarios. It fails in multi-tenant applications where you need per-user access control, rate limiting, and a coherent audit log.

The production pattern that scales is a centralized MCP gateway — a single service that brokers all tool access. The gateway handles authentication, authorization, rate limiting, tool registration and discovery, and centralized logging. Individual tool servers talk to the gateway; the LLM talks to the gateway; users never interact with tool servers directly.

This architecture has a few concrete benefits. First, it eliminates the multitenancy data-leakage problem: the gateway enforces that user A can only call tools with user A's credentials and can only access user A's data. Second, it centralizes the observability problem: instead of instrumenting each tool server separately, you get call graphs, latency histograms, and error rates from one place. Third, it lets you add or remove tool servers without changing the LLM's configuration.

Cold start latency is a real concern if you put serverless functions behind the gateway. Lambda cold starts run around 5 seconds, which is long enough to break the perceived responsiveness of an agent. For interactive use cases, keep at least the gateway itself warm.

Session State Is More Complicated Than It Looks

MCP has a session model, but most tutorials don't explain what happens when sessions need to survive server restarts or route across multiple instances.

In STDIO mode, a session lives for the duration of the subprocess. Clean and simple. In remote HTTP mode, sessions need somewhere to live that isn't the server's in-process memory — which means Redis or a database if you're running more than one instance.

The specific scenarios that will break your stateless assumptions: a load balancer routes the second turn of a conversation to a different server instance than the first; a deployment restarts in the middle of a long-running agent workflow; a user's browser closes and reopens during a multi-step operation.

The session management requirements for MCP aren't novel — they're the same requirements as any stateful web application. The mistake is assuming you don't have them because MCP looks like a simple RPC protocol.

Cache expensive operations at the session level — authorization checks, expensive queries, pagination cursors — but scope every cache entry to the individual user. Global caches without user scoping are multitenancy vulnerabilities waiting to trigger.

What Operational Readiness Actually Looks Like

The benchmark for MCP operational readiness isn't passing a health check. It's the following questions:

Observability: Can you tell, for any tool invocation in the last 30 days, who called it, with what parameters, what it returned, and how long it took? If not, you're operating blind.

Incident response: When an agent takes an unexpected action through a tool — creates a record it shouldn't, sends a message to the wrong recipient — do you have enough audit data to reconstruct exactly what happened? Can you roll back the action if it's reversible?

Version control: Do you track what version of each third-party MCP server is deployed? Do you have a process for reviewing changes before applying updates?

Failure modes: What does the agent do when a tool server is down? Does it retry indefinitely, time out gracefully, or fail open in a way that could cause unintended behavior?

Teams that answer "yes" to these questions have earned the right to run MCP in production. Teams that can't answer them have built a demo that happens to be serving real users.

The Standard Is Young; Build for Change

MCP is still evolving. The transport layer went from SSE to Streamable HTTP inside a year. Authentication is getting a significant redesign to support enterprise SSO flows. Horizontal scaling patterns are still settling.

This is normal for a protocol that went from initial release to widespread adoption faster than most open standards. What it means practically: don't build deep coupling to implementation details that might change. Keep tool servers thin and focused on business logic. Put the infrastructure concerns — auth, logging, routing, rate limiting — in the gateway layer where they can be updated without touching individual tool implementations.

The teams that are thriving with MCP are the ones who treated it the same way they'd treat any external-facing API: designed for failure, scoped permissions tightly, and built observability before they needed it. The teams struggling are the ones who moved fast on the integration and slow on the operations. That's not a new story, but MCP makes it unusually easy to skip the infrastructure work when the initial prototype feels so smooth.

Build the gateway. Pin your dependencies. Log everything. The protocol will keep improving; your production system needs to be reliable while that happens.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

What Nobody Tells You About Running MCP in Production

The Transport Decision You'll Regret Getting Wrong

Tool Design Is a Prompting Problem

Security: The Attack Surface You Built Without Meaning To

The Gateway Pattern and Why It Matters at Scale

Session State Is More Complicated Than It Looks

What Operational Readiness Actually Looks Like

The Standard Is Young; Build for Change

Recommended Reading

About Tian Pan

The Transport Decision You'll Regret Getting Wrong​

Tool Design Is a Prompting Problem​

Security: The Attack Surface You Built Without Meaning To​

The Gateway Pattern and Why It Matters at Scale​

Session State Is More Complicated Than It Looks​

What Operational Readiness Actually Looks Like​

The Standard Is Young; Build for Change​

Recommended Reading

About Tian Pan

The Transport Decision You'll Regret Getting Wrong

Tool Design Is a Prompting Problem

Security: The Attack Surface You Built Without Meaning To

The Gateway Pattern and Why It Matters at Scale

Session State Is More Complicated Than It Looks

What Operational Readiness Actually Looks Like

The Standard Is Young; Build for Change