Skip to main content

LLM Vendor Lock-In Is a Spectrum, Not a Binary

· 10 min read
Tian Pan
Software Engineer

A team builds a production feature on GPT-4. Months later, they decide to evaluate Claude for cost reasons. They spend two weeks "migrating"—but the core API swap takes an afternoon. The remaining ten days go toward fixing broken system prompts, re-testing refusal edge cases, debugging JSON parsers that choke on unexpected prose, and re-tuning tool-calling schemas that behave differently across providers. Migration estimates that assumed a simple connector swap balloon into a multi-layer rebuild.

This is the LLM vendor lock-in problem in practice. And the teams that get burned aren't the ones who chose the wrong provider—they're the ones who didn't recognize that lock-in exists on multiple axes, each with a different risk profile.

Lock-In Has Six Layers, and They Don't All Matter Equally

Most engineers think of vendor lock-in as an API format problem. It's much more than that.

Layer 1: API call format. The OpenAI-compatible API has become a de facto standard—80%+ of new providers implement it. Changing the base URL and model name is a one-line change. This layer has the lowest switching cost and is largely a solved problem. If you build against the OpenAI SDK shape and your provider supports it, you're mostly fine.

Layer 2: Prompt vocabulary and structure. This is where engineers first get surprised. Claude has been fine-tuned to pay close attention to XML tag structures. GPT models prefer markdown-formatted prompts with sections and emphasis. A system prompt optimized for one model family often produces degraded output on another—not catastrophically wrong, but subtly off in ways that take time to diagnose. The instruction style also diverges: instruction-oriented models benefit from explicit, structured prompts; reasoning-oriented models perform better with sparse, high-level goals and strong verification steps.

Layer 3: Tool-calling schemas. OpenAI uses JSON Schema for function definitions. Claude uses content blocks where tool_use appears separately from text. Google Gemini has its own format. Tool selection accuracy varies: GPT-4o achieves 97–99%, Claude Sonnet runs 96–99%, Gemini sits at 95–98%. These differences are tolerable in isolation, but when you're running thousands of agentic tool calls per day, a 2% divergence in accuracy is a real production difference.

Layer 4: Output length norms and formatting quirks. Claude occasionally prepends prose before a JSON block—"Here's the data you requested:"—which silently breaks parsers that expect raw JSON. GPT-4o is more consistent about following output format instructions exactly. These behaviors are undocumented and change between model versions. Teams discover them in production, not in evaluation.

Layer 5: Refusal patterns and safety alignment. Different models have different thresholds. Tightly safety-aligned models generate very low fulfillment rates on unsafe prompts—but also produce higher over-refusal rates on innocuous ones. If your application touches borderline domains (medical, legal, adult content), the refusal profile of your model is part of your product. Switching providers can change that profile in ways that are invisible until they surface as customer complaints.

Layer 6: Embeddings, fine-tuning, and stored state. This is the hardest layer to escape. Chat histories stored as embeddings with one provider's model are incompatible with another's embedding space. Models fine-tuned on proprietary platforms often can't be exported. If your application uses retrieval or memory, your data is entangled with your provider's representation layer.

Which Dependencies Are Acceptable—and Which Are Debt From Day One

The right mental model isn't "avoid all lock-in." It's "know which lock-in you're taking on deliberately."

Acceptable lock-in is any dependency where:

  • The feature provides genuine, measurable advantage not available elsewhere
  • Migration cost is bounded and you've estimated it honestly
  • The feature is stable and the provider has committed to its longevity
  • You've built an abstraction layer that limits the blast radius downstream

Using Claude's extended thinking for complex reasoning pipelines is acceptable lock-in. The capability is genuinely differentiated, it's available across multiple deployment targets (direct API, Bedrock, Vertex AI), and you can scope its use to specific pipeline stages without it bleeding into the rest of your codebase. Similarly, using GPT-4o's vision capabilities for document understanding tasks where it demonstrably outperforms alternatives is a reasonable engineering tradeoff.

Technical debt from day one looks like:

  • Application code that calls provider APIs directly without an abstraction layer, with no plan to add one
  • Prompt strings littered with provider-specific formatting that's never been tested against alternatives
  • Fine-tuning on a proprietary platform without exporting the weights or validating that the improvement generalizes to parameter-efficient techniques (LoRA, adapters) that could be applied to other base models
  • Storing embeddings and chat histories in provider-managed vector stores without an export path
  • Relying on output behaviors (specific JSON formatting, verbosity levels, reasoning traces) that are neither documented nor tested for consistency across versions

The cost of the debt category isn't just the eventual migration. It's the ongoing tax: you can't do competitive evaluation of new models without rebuilding your eval harness, you can't respond to provider pricing changes, and you can't fall back to an alternative when a provider has an outage.

The Abstraction Architecture That Actually Works

The industry has converged on a few practical approaches to containment.

Use an AI gateway at the edge. Tools like LiteLLM and Portkey sit in front of your application code and translate between providers via a unified interface. Netflix, Lemonade, and Rocket Money use LiteLLM in production to get day-zero access to new models without rewriting application logic. When you want to switch from one provider to another, you change a configuration value, not application code. The gateway also gives you centralized observability, cost attribution, and fallback routing when a provider degrades.

Gartner predicts that by 2028, 70% of organizations building multi-LLM applications will route through an AI gateway—up from under 5% in 2024. The trajectory is clear: direct provider coupling is being recognized as the same kind of technical risk as direct database coupling was in the 2000s.

Abstract your prompt layer. Don't embed provider-specific formatting in business logic. Instead, maintain a prompt builder that accepts model target as a parameter and applies the appropriate structural conventions—XML tags for Claude, markdown sections for GPT, minimal structure for reasoning models. This adds a small layer of indirection but makes cross-model evaluation tractable.

Use MCP for tool integration. Model Context Protocol, originally released in November 2024 and now adopted by both OpenAI and Google, standardizes how models connect to external tools and data sources. Build your tool integration once against the MCP spec and it works across Claude, GPT, and Gemini without per-provider adapter code. This won't eliminate all tool-calling differences, but it significantly reduces the per-provider surface area.

Isolate your embedding layer. If you use retrieval or memory, keep your embedding pipeline behind an interface that can swap out the underlying model. Don't pass provider embedding objects directly into application code. When you need to re-embed because you switched providers, the blast radius should be one module, not a refactor across your entire retrieval path.

The Switching Cost Math

Enterprise teams that treat model switching as a plug-and-play operation routinely discover it's not. The hidden costs are predictable:

Migration consumes 20–50% of the original development time. During evaluation and migration, teams often run overlapping contracts with both vendors, paying double for a period. Minimum annual commitments at enterprise tiers run from $50,000 to $200,000—so a migration that saves $30,000/year in inference costs can take three years to break even if handled badly.

The tokenization difference alone creates budget surprises. The Anthropic tokenizer is more verbose than OpenAI's for the same input—a prompt that costs X tokens on GPT costs more tokens on Claude. Token budgets that worked for one provider don't carry over. Caches built around one tokenization scheme need re-tuning.

More subtly, quality regressions appear in places you didn't test. You test the happy path. The model you're migrating to is subtly different on long-tail inputs—different reasoning patterns, different verbosity on ambiguous instructions, different handling of malformed tool arguments. The gap between "passes evals" and "production-ready" is longer with a model switch than engineers expect.

The teams that migrate cheaply are the ones who built with migration in mind before they needed it. They maintained model-specific prompt variants, used gateways for routing, kept embeddings portable, and ran regular cross-model eval runs so they already knew where the differences were.

When to Embrace Lock-In Deliberately

None of this means you should avoid provider-specific capabilities. It means you should use them with your eyes open.

If Claude's extended thinking is the right tool for a complex reasoning pipeline, use it—but scope the lock-in. Don't let the extended-thinking prompt format bleed into your general-purpose chat code. Keep provider-specific features isolated in modules that have explicit interfaces, clear deprecation paths, and documented alternatives.

If GPT-4o's vision handling is measurably better for your document understanding use case, use it. But maintain the abstraction layer upstream and downstream, so the vision component can be swapped without rebuilding the surrounding pipeline.

Meta's approach to its open-source model strategy is instructive: users stay because the ecosystem is genuinely useful, not because they're trapped. Build your application to satisfy the same constraint. If a user (or your own engineering team) wants to swap out the underlying model, the swap should be a configuration change at the gateway layer, not a refactor that touches ten files and breaks three parsers.

The Axes That Actually Matter

If you're building a new LLM application today, these are the portability decisions worth making deliberately:

  • Route through a gateway. One line to add LiteLLM or Portkey; months of refactoring to remove direct coupling later.
  • Test prompts across providers from day one. Run your eval suite against at least two model families. Lock-in becomes visible before it's embedded.
  • Own your embedding representations. Keep an export path. Never depend on a provider-managed vector store without one.
  • Scope provider-specific features. Use extended thinking, vision, or reasoning modes—but confine them to bounded modules with explicit interfaces.
  • Track output format behaviors. Log cases where model output deviates from expected format. These surface as parser bugs in production.

The goal isn't to be model-agnostic—that's an expensive abstraction that often reduces quality by targeting the lowest common denominator. The goal is to be model-portable: able to make deliberate provider decisions, swap components when circumstances change, and never get forced into a migration by a vendor's pricing decision or outage.

Lock-in exists on a spectrum. The teams that stay portable are the ones who decided, early, which part of that spectrum they were willing to occupy—and built accordingly.

References:Let's stay in touch and Follow me for more thoughts and updates