The Vendor-Portability Tax: Why 'We Can Swap Models' Is a Quarterly Cost Line, Not a Checkbox
Every team I have audited in the last six months claims to be vendor-agnostic. None of them are. The system prompt that scored highest on the eval suite did so because it leaned into a single vendor's tokenizer behavior, JSON-mode contract, refusal cadence, and stop-sequence handling — and the team that wrote it could not name which of those biases were doing the work. When the CFO asks why the cheaper model on the procurement deck cannot just be dropped in, the honest answer is two engineer-quarters of prompt re-tuning and a complete re-baseline of every eval. That is not a checkbox. It is a quarterly cost line.
The mental model that keeps biting teams is treating vendor portability as a one-time architecture decision. You add an abstraction layer, you write a model: field in your config, you congratulate yourself, and you move on. Then a year later the vendor raises prices, ships a deprecation notice, or has a bad week of refusals on a category you care about, and you discover that the abstraction was a thin wrapper around a prompt that only works on one model. The portability you bought was syntactic. The portability you needed was behavioral, and behavioral portability decays the moment you stop paying for it.
This post is about the actual price tag of "we can swap." Not the architecture diagrams that pretend it is free, but the disciplines you have to fund every quarter so that the claim stays true: a tokenizer-agnostic prompt style guide, an eval suite that runs cross-vendor with delta tracking, a routing layer where the prompt is a function of the target model, and a regression budget that treats parity as a maintained property rather than a one-time port.
The Hidden Single-Vendor Tuning In Your "Portable" Prompts
Audit the top ten prompts in your repo against three other providers and you will find specifics that look generic but are not. XML-tagged scaffolding (<format>, <context>) that one model follows precisely while another over-explains around it. Markdown headings and numbered delimiters another model leans on for instruction-following. Implicit refusal phrasings tuned to one vendor's policy surface so that the prompt sidesteps a specific rejection class without ever naming it. Stop-sequence assumptions that align with one tokenizer's whitespace handling. Few-shot examples whose ordering was discovered by trial and error against a single model's attention pattern.
None of these read as vendor-specific in a code review. They read as "the version that scored highest on evals." That is exactly the problem. The eval suite ranks prompts on the model you tuned against, so a prompt that exploits one vendor's biases will always win against a portable prompt on the same eval — even though the portable prompt would beat it on a cross-vendor average. The selection process is the lock-in. You do not have a portability problem because someone made bad architecture decisions; you have a portability problem because your eval loop has been silently selecting for it.
The fix is not to ban vendor-specific prompting. It is to name it. Maintain a tokenizer-agnostic prompt style guide that lists known model-specific tics — XML tags, markdown structure preferences, refusal-avoidance phrasings, stop-sequence patterns — and flags any prompt that uses them as a "tuned" variant. When a tuned variant exists, the portable variant must also exist and must be evaluated alongside it. The cost delta between the two is your real lock-in number, and it should be a metric the platform team reports quarterly.
Cross-Vendor Eval Is a Subscription, Not a Project
Most teams run their eval suite against one model. A few sophisticated teams run it against two. Almost nobody runs it against the providers on their fallback list, on a continuous schedule, with score deltas tracked over time. That last sentence is what cross-vendor parity actually requires, and it costs real money — both in inference spend and in engineer-hours to keep the harness honest.
The shape of the discipline is straightforward. Pick the providers you would consider switching to within twelve months. Run your full eval suite against each of them on a weekly cadence. Track the delta per task and per metric, not just the aggregate. Alert when any vendor's delta to your primary moves more than a configured threshold — that is a behavioral changelog signal, and it usually arrives weeks before the vendor announces it. The eval suite becomes a continuous market signal about portability cost, not a one-time port report.
The expense is real. If your primary eval suite costs X per run and you add three secondary providers on a weekly cadence, you are paying roughly \4X per week in eval inference alone, plus engineer time to maintain provider-specific adapters when each vendor's API drifts. Teams that try to do this on a project basis end up rebaselining everything from scratch every six months because the eval data is stale and the harness has rotted. The teams that treat it as a subscription pay continuously and always have a current portability number on hand.
There is a leverage payoff that justifies the burn. The moment you can walk into a renewal conversation with current eval scores against three competing providers, the negotiating posture changes. Vendor sales teams know which customers can actually move and which are bluffing. The eval suite is the tell. Without it, your discount ceiling is whatever the vendor's standard renewal motion offers. With it, structured negotiations have been documented to deliver discounts in the 25–42% range on multi-year commitments. That is not a soft benefit. It is a direct return on the eval-as-subscription line item.
The Prompt Is A Function Of The Model, Not A Constant
The architecture that lets you swap models without rewriting application code is a routing layer where the prompt itself is a function of the target model. Most teams build the inverse: a single prompt template, a model: parameter, and the assumption that the prompt should be identical regardless of which model receives it. That assumption is what makes the swap break.
The shape of the right abstraction is a prompt-builder function that takes the task plus the target model identifier and returns a model-tuned prompt. The same input task produces a Claude-shaped prompt with XML tags and tone scaffolding for one model, a markdown-heavy structured prompt with explicit delimiter rules for another, and a tightly-constrained schema-first prompt for a third. The application code never sees the difference. The router does.
This sounds like more code, and it is. The trade is that you trap the vendor-specificity in one place where it is auditable, versioned, and testable, instead of letting it leak into every prompt across the codebase. When a new model launches, you write a new prompt-builder branch, run it against the eval suite, and deploy. When a vendor changes refusal behavior, you patch one branch and the rest of the system is unaffected. When the contract negotiation needs leverage, you can produce a credible "we have working prompts for three providers, here are the current eval scores" report in an afternoon, not a quarter.
The same principle applies to structured outputs, which are the most common silent breakage point in any swap. JSON mode is not a standard. OpenAI ships server-side schema enforcement with parse failure rates below 0.1% and structured refusal fields when content policy fires. Anthropic ships tool-use blocks that return structured data at the end of the stream rather than incrementally, with 313–346 tokens of system prompt overhead per tools-included request. Gemini ships a merged JSON-mode-plus-schema feature with parse failures below 0.3% and lower token overhead. The "drop-in replacement" your gateway claims to provide is, in practice, three different contracts wearing the same name. The router has to flatten them into a single internal contract before downstream consumers see the response.
The Quarterly Burn Rate
Putting numbers on it makes the trade-offs concrete. The continuous engineering investment to keep portability real, for a team running one production AI surface against a primary provider with two fallbacks on the eval bench, looks roughly like this on a per-quarter basis. Cross-vendor eval inference: a few hundred to a few thousand dollars depending on suite size and run cadence. Engineer time on adapter maintenance, refusal-style tracking, and tokenizer-tic auditing: half an engineer-week per quarter at steady state, more in any quarter where a vendor ships a major update. Routing-layer maintenance for the prompt-as-function-of-model abstraction: minor at steady state, multi-week when a new model is added to the fallback list. Documentation of the prompt style guide and what each provider's quirks are: a few hours per quarter to reflect drift.
That is the steady-state burn. Now look at the cost when portability has not been funded and a forced migration arrives. The published rule of thumb across migration playbooks is that a vendor-locked production prompt takes one to two engineer-quarters per surface to re-tune, plus a full eval re-baseline, plus a customer-comms drafting cycle when behavior changes leak. A team running five surfaces hits this five times per provider sunset. At a fully-loaded engineer cost of $80–150K per quarter, the unfunded version of this discipline is a six- to seven-figure surprise that lands on whatever quarter the vendor chose, not whatever quarter the roadmap had room for.
The arithmetic is not subtle. The continuous burn is small and predictable. The deferred burn is large and lands on the vendor's calendar. Teams that treat portability as a property they bought once will pay the deferred burn at least once per provider relationship, and the cost of the deferred version compounds because each forced migration disrupts the roadmap and pushes other work later.
What This Means For Contracts And Architecture
Two practical consequences fall out of treating portability as a continuous investment rather than a one-time port.
The first is that contract terms should be rewritten to assume continuous portability is part of the operational discipline. Negotiate explicit exit rights and portable artifact deliverables — your prompt libraries, your eval datasets, your fine-tuning data, your retrieval indexes, your performance logs. The Morgan Lewis framework on AI exit rights treats these as four distinct asset categories that need explicit ownership clauses. If the vendor controls any of them, your portability investment is partially neutralized regardless of how good your eval suite is. The contract is part of the portability stack, not a separate workstream.
The second is that the platform team that owns the routing layer also has to own the portability metric. Splitting the model registry across the platform team and the prompts across the product teams produces the worst possible outcome: nobody owns whether the swap actually works. A single owner of the cross-vendor parity number — reported quarterly to the same audience that sees infrastructure-spend reports — is what keeps the discipline funded. Without that owner, the line item is the first thing cut in any cost-pressure quarter, and the next forced migration becomes a fire drill.
The architectural realization is that vendor portability is not a property of your stack. It is a maintained capability with a quarterly burn rate, sustained by continuous engineering investment in eval coverage, routing abstraction, prompt style discipline, and contract terms. Teams that fund it pay a small, predictable amount and retain the option to switch. Teams that do not fund it pay a large, unpredictable amount on whatever schedule their vendor chooses. The choice is not whether to pay. The choice is whether to pay continuously and keep the leverage, or pay in lump sums and surrender the leverage.
The next time someone in a planning meeting says "we can switch models any time," ask them for the cross-vendor eval delta from the last quarterly report. If the answer is a blank stare, the team does not have portability. It has hope.
- https://customgpt.ai/how-to-avoid-llm-vendor-lock-in/
- https://evolink.ai/blog/llm-tco-hidden-costs-guide
- https://www.saastr.com/the-4-levels-of-prompt-portability-why-some-ai-agents-will-hold-retention-and-most-wont/
- https://www.morganlewis.com/blogs/sourcingatmorganlewis/2026/02/building-exit-rights-and-portability-into-ai-deals
- https://www.promptfoo.dev/docs/guides/gpt-vs-claude-vs-gemini/
- https://medium.com/@rosgluk/structured-output-comparison-across-popular-llm-providers-openai-gemini-anthropic-mistral-and-1a5d42fa612a
- https://www.buildmvpfast.com/blog/structured-output-llm-json-mode-function-calling-production-guide-2026
- https://sureprompts.com/blog/advanced-prompt-engineering-2026-claude-gpt5-gemini
- https://www.saastr.com/the-wave-of-ai-agent-churn-to-come-prompts-are-portable/
- https://blog.qasource.com/software-development-and-qa-tips/what-is-the-real-cost-of-ongoing-ai-model-maintenance-and-management
