Skip to main content

The Internal LLM Gateway Is the New Service Mesh

· 10 min read
Tian Pan
Software Engineer

Walk into any company with fifty engineers writing LLM code in production and you will find seven gateway-shaped artifacts. The recommendations team built one to route between OpenAI and Anthropic. The support-bot team wrote one to attach their prompt registry. The platform team has a half-finished proxy that handles auth but not rate limiting. The growth team has a Lambda that does PII redaction on its way out. The data-science team is calling the vendor SDK directly and nobody has told them to stop. There is no shared gateway. There are seven shared problems, each solved poorly in isolation, and a CFO who is about to ask why the AI bill grew 40% quarter over quarter with no clear owner for any of it.

This is the same architectural beat the industry hit with microservices in 2016 and 2017. A thousand external dependencies, the same shared concerns at every team — auth, retries, observability, policy — and a choice between solving them once or rediscovering them everywhere. The answer then was the service mesh. The answer now is the internal LLM gateway, and most companies are still in the rediscovering-everywhere phase.

What every team is independently building

The list is remarkably consistent across companies. Once LLM usage moves past the demo stage, every team rediscovers the same set of cross-cutting concerns:

  • Auth to provider APIs — whose key, whose budget, whose audit trail. The shared OPENAI_API_KEY env var that started as a hackathon convenience now charges seven business units to one ledger entry.
  • Routing across providers and models — per-feature, per-tenant, per-cost-tier, with fallback during outages. Nothing in the vendor SDK helps here; the routing logic ends up in application code, copy-pasted.
  • Rate limiting — per-user, per-feature, per-tenant. Provider quotas are global to the API key and do not match any unit your business cares about.
  • Prompt registry — where the canonical version of a system prompt lives, who can change it, how it rolls back. In the absence of a registry, the prompt lives in source code, in a feature flag, and in a Notion page that all disagree.
  • Structured-output normalization — JSON mode means different things at different providers. The fallback path that looked drop-in compatible in the runbook activates during a primary outage and 12% of downstream parsers start throwing.
  • Request and response logging — for forensics, evals, and the inevitable incident review six weeks from now.
  • PII redaction at the egress boundary — the surface where customer data crosses into a third-party model. Compliance considers this a control point. Without a gateway, it is a control surface that does not exist.
  • Cost attribution — back to the team or feature that issued the call. Without it, you have one bill and no one to send it to.

Every team that hits this list independently builds a half-baked subset of these capabilities into their own service. The result is the seven-artifact problem: the company has paid for the gateway pattern seven times, owns it zero times, and has shared none of the work.

The pattern that's emerging

The shape that consistently wins is a dedicated LLM gateway sitting in front of every external provider call, owned by a platform team, with a stable internal API that downstream services call instead of the vendor SDK. Modern enterprise gateways now bundle semantic routing, token-aware rate limiting, virtual key management, semantic caching, circuit breakers, and per-team observability into the same control plane. The gateway is the egress point, the policy point, the metering point, and — increasingly — the eval and guardrail integration point.

The vocabulary borrowed from networking is not accidental. The gateway plays the role Envoy played in 2017: a reusable piece of infrastructure that absorbs the cross-cutting concerns nobody wants to write twice, exposes a stable contract upstream, and lets the application teams stop caring about the network. Substitute "model provider" for "service" and the architectural pressures rhyme exactly.

The implementation choices have stabilized too. Most production gateways are HTTP proxies with an OpenAI-compatible API surface, a YAML or database-backed routing config, an adapter layer that translates the canonical request into vendor-specific calls, and a logging tap that writes to a forensic store separate from the standard observability pipeline. The good ones add only single-digit microseconds of overhead per request. The bad ones double tail latency and become the reason engineers route around them.

The centralize-vs-edge decision matrix

The hardest design question is not whether to build a gateway. It is what belongs in the gateway and what belongs in the calling service. Get this boundary wrong and you produce either a chokepoint that kills experimentation or a sprawl that defeats the purpose of having a gateway at all.

The split that holds up under load is roughly this: anything that is a governance property must centralize, and anything that is a product property should remain configurable at the edge.

Governance properties — the ones that must live in the gateway:

  • Auth and credential management. No application service should hold a vendor API key. The gateway authenticates the caller using internal credentials, then attaches the vendor key on egress.
  • Audit logging. Every external call leaves a record in the same forensic store, with the same retention policy, with the same access controls. Distributed audit is no audit.
  • PII redaction. The egress boundary is the only place where redaction can be enforced rather than recommended. A redaction library shipped to fifty teams is a soft suggestion.
  • Cost attribution. The gateway is where token counts can be tagged with the originating team, feature, and tenant. After the request leaves, the data needed for attribution is gone.
  • Compliance enforcement. Region pinning, retention, vendor allowlists — the rules legal cares about must be enforced in one place or they are not enforced.

Product properties — the ones that should live at the edge:

  • Prompt content and version selection. Centralizing this turns the gateway into a product chokepoint. Teams should pick their own prompt versions and ship their own changes.
  • Routing strategy per feature. The platform team should not decide whether the recommendations service uses GPT-4o or Claude. The platform team should provide the capability to switch.
  • Caching policies tuned to the workload. A semantic cache that works for FAQ traffic destroys correctness in a code-review surface.
  • Sampling parameters. Temperature, top-p, max tokens belong with the team that owns the eval suite for that surface.

Teams that try to centralize prompts inside the gateway end up with a Conway's-law museum: every prompt change requires a platform-team review, the platform team becomes the bottleneck, the calling teams clone the gateway to escape it, and the gateway loses the audit property that justified its existence.

The political fight nobody wins cleanly

Ownership of the LLM gateway is a turf war disguised as an architecture question. Three teams have legitimate claims and one of them usually wins for the wrong reason.

Security wants the gateway because it is the egress boundary for sensitive data. PII redaction, vendor allowlisting, audit logging — these are the same controls that justified buying a CASB a decade ago, applied to a new egress surface. Security writes a strong document and is correct about what controls must exist.

Infrastructure wants the gateway because it is a reliability surface. Circuit breakers, retries, fallback routing, capacity headroom — these are the same controls infra teams already operate for every other external dependency. Infra writes a strong document and is correct about how the gateway must behave under load.

AI platform wants the gateway because it is where prompt versioning, model selection, and eval integration live. Without this team, the gateway becomes a dumb proxy that misses the half of the problem that matters most for product velocity.

The fight is settled badly more often than not — by whichever team writes the first design doc, or by whichever VP escalates first. The healthy outcome is co-ownership with explicit boundaries: security owns the policy contract, infra owns the runtime, AI platform owns the routing and prompt-integration layer. The unhealthy outcome is one team owning the whole thing, the other two being unhappy about it, and a forked second gateway appearing within nine months.

The failure mode where engineers route around

A gateway that competes with the direct path on quality loses. This is the most common way the project fails.

The pattern is mechanical: the gateway adds 200ms of latency because it does synchronous PII scanning on every request, or it rejects 0.5% of valid calls under load because the rate limiter has the wrong abstraction, or it lacks support for the new model the product team wants to ship next week. The frustrated team finds out they can still import the vendor SDK and bypass the gateway. They do. The gateway's traffic share starts dropping. The platform team responds by writing a stronger policy document.

This does not work. Voluntary adoption never finishes. The migration path from "every team uses the SDK" to "every team uses the gateway" must be policy-enforced at the network egress layer — vendor API endpoints blocked at the egress firewall except via the gateway, vendor API keys rotated to gateway-only credentials, and the SDK installation path either deprecated or wrapped in a thin shim that points at the gateway. The shape of this enforcement is the same shape as service-mesh adoption a decade ago: you cannot ask, you must make the paved road be the only road.

But the enforcement only works if the paved road is good. A gateway that adds 50ms of overhead, supports every model the product teams need within 48 hours of vendor release, and has better observability than the direct SDK path will be adopted because resisting it costs more than using it. A gateway that adds 200ms, lags on model support by months, and ships less observability than the SDK already does will be routed around no matter how many memos the security team writes.

The architectural realization

The reason this pattern keeps emerging independently is the same reason service mesh emerged independently in 2016 and 2017: the cross-cutting concerns of a thousand-dependency system are not a property of any one application. They are a property of the network the applications share. Solving them in each application means solving them poorly, inconsistently, and seven times.

The companies that recognize this early build the gateway as platform infrastructure with co-ownership across security, infra, and AI platform, treat policy enforcement as an egress-firewall problem rather than a documentation problem, and keep the centralize-vs-edge boundary disciplined enough that product teams keep their velocity.

The companies that recognize it late spend a year on a "consolidation initiative" where seven teams negotiate which features of which existing gateway-shaped artifact survive, while the AI bill keeps growing and the next compliance audit lists "no centralized control over LLM egress" as a finding.

The choice is not whether to build the gateway. The choice is whether to build it once, on purpose, before the audit — or to discover it seven times, accidentally, after.

References:Let's stay in touch and Follow me for more thoughts and updates