The Internal LLM Gateway Is the New Service Mesh
Walk into any company with fifty engineers writing LLM code in production and you will find seven gateway-shaped artifacts. The recommendations team built one to route between OpenAI and Anthropic. The support-bot team wrote one to attach their prompt registry. The platform team has a half-finished proxy that handles auth but not rate limiting. The growth team has a Lambda that does PII redaction on its way out. The data-science team is calling the vendor SDK directly and nobody has told them to stop. There is no shared gateway. There are seven shared problems, each solved poorly in isolation, and a CFO who is about to ask why the AI bill grew 40% quarter over quarter with no clear owner for any of it.
This is the same architectural beat the industry hit with microservices in 2016 and 2017. A thousand external dependencies, the same shared concerns at every team — auth, retries, observability, policy — and a choice between solving them once or rediscovering them everywhere. The answer then was the service mesh. The answer now is the internal LLM gateway, and most companies are still in the rediscovering-everywhere phase.
What every team is independently building
The list is remarkably consistent across companies. Once LLM usage moves past the demo stage, every team rediscovers the same set of cross-cutting concerns:
- Auth to provider APIs — whose key, whose budget, whose audit trail. The shared
OPENAI_API_KEYenv var that started as a hackathon convenience now charges seven business units to one ledger entry. - Routing across providers and models — per-feature, per-tenant, per-cost-tier, with fallback during outages. Nothing in the vendor SDK helps here; the routing logic ends up in application code, copy-pasted.
- Rate limiting — per-user, per-feature, per-tenant. Provider quotas are global to the API key and do not match any unit your business cares about.
- Prompt registry — where the canonical version of a system prompt lives, who can change it, how it rolls back. In the absence of a registry, the prompt lives in source code, in a feature flag, and in a Notion page that all disagree.
- Structured-output normalization — JSON mode means different things at different providers. The fallback path that looked drop-in compatible in the runbook activates during a primary outage and 12% of downstream parsers start throwing.
- Request and response logging — for forensics, evals, and the inevitable incident review six weeks from now.
- PII redaction at the egress boundary — the surface where customer data crosses into a third-party model. Compliance considers this a control point. Without a gateway, it is a control surface that does not exist.
- Cost attribution — back to the team or feature that issued the call. Without it, you have one bill and no one to send it to.
Every team that hits this list independently builds a half-baked subset of these capabilities into their own service. The result is the seven-artifact problem: the company has paid for the gateway pattern seven times, owns it zero times, and has shared none of the work.
The pattern that's emerging
The shape that consistently wins is a dedicated LLM gateway sitting in front of every external provider call, owned by a platform team, with a stable internal API that downstream services call instead of the vendor SDK. Modern enterprise gateways now bundle semantic routing, token-aware rate limiting, virtual key management, semantic caching, circuit breakers, and per-team observability into the same control plane. The gateway is the egress point, the policy point, the metering point, and — increasingly — the eval and guardrail integration point.
The vocabulary borrowed from networking is not accidental. The gateway plays the role Envoy played in 2017: a reusable piece of infrastructure that absorbs the cross-cutting concerns nobody wants to write twice, exposes a stable contract upstream, and lets the application teams stop caring about the network. Substitute "model provider" for "service" and the architectural pressures rhyme exactly.
The implementation choices have stabilized too. Most production gateways are HTTP proxies with an OpenAI-compatible API surface, a YAML or database-backed routing config, an adapter layer that translates the canonical request into vendor-specific calls, and a logging tap that writes to a forensic store separate from the standard observability pipeline. The good ones add only single-digit microseconds of overhead per request. The bad ones double tail latency and become the reason engineers route around them.
The centralize-vs-edge decision matrix
The hardest design question is not whether to build a gateway. It is what belongs in the gateway and what belongs in the calling service. Get this boundary wrong and you produce either a chokepoint that kills experimentation or a sprawl that defeats the purpose of having a gateway at all.
The split that holds up under load is roughly this: anything that is a governance property must centralize, and anything that is a product property should remain configurable at the edge.
Governance properties — the ones that must live in the gateway:
- Auth and credential management. No application service should hold a vendor API key. The gateway authenticates the caller using internal credentials, then attaches the vendor key on egress.
- Audit logging. Every external call leaves a record in the same forensic store, with the same retention policy, with the same access controls. Distributed audit is no audit.
- PII redaction. The egress boundary is the only place where redaction can be enforced rather than recommended. A redaction library shipped to fifty teams is a soft suggestion.
- Cost attribution. The gateway is where token counts can be tagged with the originating team, feature, and tenant. After the request leaves, the data needed for attribution is gone.
- https://www.truefoundry.com/blog/llm-gateway-on-premise-infrastructure
- https://www.truefoundry.com/blog/llm-gateway
- https://www.getmaxim.ai/articles/top-5-llm-gateways-in-2026-for-enterprise-grade-reliability-and-scale/
- https://www.getmaxim.ai/articles/top-5-enterprise-llm-gateways-in-2026/
- https://appscale.blog/en/blog/enterprise-llm-gateway-architecture-routing-rate-limiting-observability-2026
- https://jimmysong.io/blog/ai-gateway-in-depth/
- https://www.solo.io/topics/ai-connectivity/llm-traffic-governance-gateway-strategies-for-secure-ai
- https://collinwilkins.com/articles/llm-gateway-architecture
- https://medium.com/@bijit211987/llm-traffic-control-gateway-or-router-or-proxy-4f8c93ddf67b
- https://www.merge.dev/blog/llm-gateway
- https://portkey.ai/blog/what-is-an-llm-gateway/
- https://istio.io/latest/docs/overview/dataplane-modes/
- https://www.gravitee.io/blog/how-to-prevent-pii-leaks-in-ai-systems-automated-data-redaction-for-llm-prompt
- https://mlflow.org/ai-gateway
- https://www.databricks.com/blog/ai-gateway-governance-layer-agentic-ai
- https://www.zenml.io/llmops-database/building-a-multi-provider-genai-gateway-for-enterprise-scale-llm-access
