Build vs Buy for the AI Gateway: The Decision That Locks in Your Next 18 Months
The build-vs-buy decision for an AI gateway is almost never made on a framework. It is made on instinct in week one by an engineer who likes the problem, and then revisited in month nine by a director who is tired of the bill. Neither moment is when the decision should actually be made, and neither party is evaluating the choice on the axes that matter eighteen months from now.
The seductive thing about the build path is that month one is cheap. A two-hundred-line proxy in front of OpenAI, a switch statement that routes "claude" requests to Anthropic, a retry loop, and the team has shipped what looks like a gateway. Month nine, that proxy is twelve thousand lines of half-finished retry logic, prompt caching with broken invalidation, cost attribution that nobody trusts, fallback routing that triggered the wrong way during the last incident, an observability schema that diverged from the rest of the stack, and per-tenant rate limiting bolted on after the first enterprise customer asked. Every feature is a worse copy of something the buy path would have shipped on day one. The engineer who wrote the original two hundred lines has left.
The seductive thing about the buy path is that month one is uncomfortable. Per-call markup looks visible on the invoice in a way the equivalent engineering hours never will. The "but we want to own this" instinct kicks in, especially in a culture that treats infrastructure ownership as a virtue. Month nine, that team's engineering hours go to the AI features the company is paid for, not to plumbing that every AI team in the industry independently rebuilds.
The framing of "build vs buy" itself hides the real question. The gateway is not one decision. It is roughly a dozen feature decisions bundled into a category, and the right answer is almost always partial buy with surgical build. Knowing which surfaces to keep and which to delegate is what separates the teams that finish their AI product from the teams that finish their AI infrastructure.
The Honest Cost Model for the Build Path
The bill for an internal gateway is not the infrastructure line item. It is the engineering attention that goes into the gateway instead of the product.
A self-hosted LiteLLM-style proxy at production scale runs roughly $200 to $500 per month in infrastructure, plus $100 to $400 in database and storage for logs and metrics, plus $200 to $800 in observability tooling. That is the cheap part. The expensive part is the twenty to thirty percent of a senior engineer's time, which at fully loaded compensation translates to $3,000 to $6,000 a month before any features get written. Production deployments commonly land in the $2,000 to $3,500 total cost-of-ownership range once you stop excluding the parts that hurt to count.
That is the steady-state cost. The build cost is separate and front-loaded. The first version of every feature an internal gateway ships has been shipped a hundred times by other teams: retry with exponential backoff, structured logging of token counts, a circuit breaker around a flaky provider, a key vault for provider credentials, a cost attribution table indexed by tenant and route. Each of these is a one-week project the first time and a four-month project the second time, when the team discovers that the first cut did not handle streaming responses, did not handle tool-calling, did not handle cancellation, did not handle the new model's slightly different error envelope.
The honest TCO question is not "what does the proxy cost to run". It is "how many engineer-quarters of attention have we spent on the gateway that we did not spend on the product the company sells". The teams that estimate this and like the answer almost always have a specific reason: regulated data that cannot leave their VPC, a custom routing policy tied to a proprietary capability, or extreme cost sensitivity at a scale where the markup on a managed gateway becomes meaningful in absolute dollars. The teams that have not estimated it are usually paying the bill anyway, just in a budget line where it is invisible.
What the Buy Path Actually Includes
The reason a managed gateway looks expensive on the invoice is that the invoice is itemizing things the build path tends to forget. A representative feature surface from a production-oriented vendor today covers: a unified API across 100+ providers, structured request and response logging with PII redaction, OpenTelemetry-compatible traces, prompt template versioning with rollback, semantic caching with embedding-based lookup, automatic fallback routing on provider errors or latency thresholds, virtual keys for per-tenant rate limits and budgets, hierarchical spend caps that hard-block at the limit instead of just alerting, guardrails for content moderation and schema enforcement, and an audit log that is acceptable to a SOC 2 auditor.
Pick any single feature on that list and the build estimate is "a week, maybe two". Pick the bundle and the honest estimate is closer to two engineer-years to get to feature parity, and another half-year per year to keep up with model providers that ship breaking changes every quarter.
The cost-of-ownership crossover point shifts with traffic volume. Under roughly $2,000 a month of LLM spend, a managed platform's free tier or per-call markup is materially cheaper than the engineer-hours of running anything self-hosted. Between $2,000 and $10,000 a month, all paths are viable and the decision becomes about feature surface rather than dollar arithmetic. Above $10,000 a month of LLM spend, self-hosting starts to win on raw dollar terms, but only if the team's DevOps capacity is real and only if the alternative is a markup-priced vendor rather than a flat-fee one.
The trap is treating the crossover as the whole question. Cost is one axis. Differentiation, compliance, exit cost, and operational maturity are the others, and they usually dominate.
The Differentiation Question Most Teams Skip
The framework that has to land before either path is chosen is the question: which gateway features are commodity infrastructure that every AI team needs, and which are load-bearing for our specific product?
The commodity list is long and not very interesting. Provider abstraction, retry with backoff, request logging, basic observability, cost tracking by model and team, semantic caching, content guardrails, virtual key management. No team's customers are choosing them over a competitor because their retry logic is better. These are features where the right answer is almost always to buy: the vendor has shipped them, hardened them in production at companies larger than yours, and the team that builds them internally is doing infrastructure tourism.
The product-specific list is shorter and worth fighting over. A proprietary routing policy that uses signals only your application has access to: that is differentiation. A custom safety filter trained on your domain's failure modes: that is differentiation. Eval-on-traffic that runs against your team's specific quality bar before a response goes back to the user: that is differentiation. Multi-tenant cost attribution that ties into your custom billing schema and that finance has to reconcile against the books: that is differentiation. These are features where buying generic versions either does not work or actively gets in the way.
The right architectural posture is to buy the commodity surface and build a thin opinionated layer on top of it for the differentiated pieces. The fork in the road is not "use Portkey or write our own". It is "use the vendor for everything that does not differentiate us and own a 1,500-line adapter for the parts that do". Teams that do this finish their product. Teams that draw the line one feature further in either direction either pay a markup on a custom feature they could have owned, or rebuild a commodity feature they should have rented.
Exit Cost Is the Real Vendor Lock-in Question
Vendor lock-in is the boogeyman the build camp invokes and the buy camp dismisses. Both are usually wrong about what the lock-in actually costs.
The data point worth taking seriously: migration costs for enterprise AI infrastructure projects average around $315,000, and recent surveys put it at 81% of enterprise leaders concerned about AI vendor dependency, with only 6% claiming they could switch providers without disruption. Lock-in is real. It is also not unique to the buy path. A two-engineer-year internal gateway is also a lock-in: rewriting the proxy because the original architect left and the next architect wants a different framework is just as expensive as migrating off a vendor.
The right question is exit cost, not lock-in in the abstract. If the team had to swap from Vendor A to Vendor B, or from a managed gateway to a self-hosted one, in eighteen months, would the answer be "a quarter of disciplined work" or "a six-month rewrite"? If it is a rewrite, the lock-in is structural and worth pricing. If it is a quarter, the lock-in is acceptable and the cost of the alternative path probably exceeds it.
The architectural pattern that keeps exit cost bounded is a stable internal interface that expresses your team's AI capability needs independently of the wire underneath. This is not a thirty-line wrapper that re-exports the vendor's SDK with a different import path. It is an opinionated capability layer where your codebase calls verbs like "summarize", "classify", "extract structured data", and "answer with citations" rather than provider-specific endpoints. The gateway-the-product is what implements those verbs today. If the product changes, only the implementation moves; the calling sites do not. Teams that do this have a quarter-long exit. Teams that wire vendor specifics directly into application code have a rewrite, regardless of whether the gateway is built or bought.
Decision Hygiene: Make the Choice Visible
The failure mode this post is really about is that the gateway decision is rarely revisited on its own terms. The build decision is made in week one and inherits whatever rationale was true then. The buy decision is made in month nine under pressure from a budget line that suddenly grew. Neither moment forces the actual axes onto a single page.
A workable discipline is short and unglamorous: an annual build-vs-buy decision record, written like an architectural decision record, that names the differentiation axis, the cost axis, the compliance axis, and the exit-cost axis explicitly. It lists the features the team is currently building and the features the team is currently buying. It names the assumptions under which the current split is correct, and it commits to revisit the record the next year. The vendor landscape moves, the team's capacity moves, and the AI features that mattered last year are commodity this year. The record is what keeps the decision honest as those parameters change.
The other half of the discipline is the question of who decides. The engineer who likes the problem should not decide alone, and the director who hates the bill should not decide alone. The decision is engineering plus product plus finance, because the axes that matter belong to different functions: cost is finance, differentiation is product, exit cost is engineering, compliance is legal. Any single-function gateway decision is going to be wrong on at least one of the dimensions that matters.
The AI gateway is infrastructure with the blast radius of a database and the differentiation potential of a load balancer. The team that builds without naming what is load-bearing spends a year rebuilding what they could have rented. The team that buys without naming what is strategic discovers the vendor does not ship the thing they actually needed. The teams that do this well sit in between, and they got there because somebody bothered to put the decision on paper.
- https://www.pkgpulse.com/guides/portkey-vs-litellm-vs-openrouter-llm-gateway-2026
- https://www.truefoundry.com/blog/litellm-pricing-guide
- https://relayplane.com/blog/llm-gateway-comparison-2026
- https://www.braintrust.dev/articles/best-llm-gateways-observability-2026
- https://www.getmaxim.ai/articles/best-ai-gateways-in-2026-a-production-ready-comparison/
- https://www.truefoundry.com/blog/vendor-lock-in-prevention
- https://stepto.net/blog/ai-vendor-lock-in-infrastructure-risk-2026
- https://www.kai-waehner.de/blog/2026/04/06/enterprise-agentic-ai-landscape-2026-trust-flexibility-and-vendor-lock-in/
- https://jimmysong.io/blog/ai-gateway-in-depth/
