The Eval Harness, Not the Prompt, Is Your Real Provider Lock-In
Every "we'll just swap providers if needed" plan in the deck has a budget line for prompt rewrites. None of them has a line for the eval suite. That is the bug. The prompts are the visible coupling — the part you wrote, the part you can grep for, the part a junior engineer can rewrite in an afternoon. The eval harness is the invisible coupling, and it is the one that will eat a quarter of your roadmap when you actually try to migrate.
The pattern shows up the moment leverage matters. Your contract is up. A competitor releases a model that benchmarks better on your domain. Pricing on output tokens shifts under you. You go to run the candidate model through your eval suite to make the call, and within a day you discover that you cannot trust any score the harness produces, because the harness itself was written against the incumbent. You are not comparing models. You are comparing one model against a measurement instrument that was calibrated to the other one.
This is not a theoretical risk. Tokenizers from OpenAI, Anthropic, and Google produce 10–20% count discrepancies for the same block of text. A 140-token prompt under one tokenizer can register as 180 under another. Tool-call schemas diverge — OpenAI's tools array with type: "function" is not Anthropic's input_schema content blocks, and the OpenAI compatibility shim explicitly does not enforce strict mode against Claude. Refusal phrasing differs by model family. Reasoning trace formats differ. Pricing math differs. Every one of these differences sits inside your eval harness as a soft assumption, and every soft assumption is a place the comparison silently breaks.
The Soft Couplings Hiding in Your Harness
Walk through a representative production eval suite and count the provider-shaped knobs. You will find more than you expect.
Tokenizer-aware length budgets. Your fixtures cap inputs at 8,000 tokens "to leave room for the response." That cap was computed against one tokenizer. Run the same fixtures through a different one and a meaningful slice of them now overflow the context window in the new model and not the old, or fit comfortably in the new and were artificially truncated in the old. Either way the test no longer measures what it claimed to measure.
Tool-call schema assumptions baked into fixtures. Your golden traces expect tool calls to appear in a tool_calls array on the message. Anthropic returns them as tool_use content blocks inside the message's content array. The compatibility shims paper over the surface, but your fixtures assert against parsed structure, your judges read tool arguments out of specific paths, and your replay tooling expects ordering guarantees one provider gives and another does not.
Refusal-phrase regexes. Somewhere in your safety eval is a regex that matches "I can't help with that" or "I'm not able to" or your incumbent model's specific hedging fingerprint. The next model phrases its refusals differently. Your safety pass rate drops 6 points and you spend two weeks figuring out whether the model got less safe or your detector got blind.
Judge prompts written against a specific reasoning trace format. If your LLM-as-judge expects to see a chain-of-thought before the verdict, and the candidate model emits its reasoning in a different shape — a single tagged block versus interleaved scratchpad versus structured JSON with a reasoning field — your judge starts giving the candidate worse scores for the same underlying answer quality. You have not measured the candidate. You have measured how well the candidate flatters the judge prompt's expectations.
Pricing math hard-coded to one provider's billing units. Your cost regression test asserts "per-call cost stays under $0.012." That number presumed input/output ratios at one provider's per-token rates. The candidate has different rates, possibly a different unit (characters vs. tokens, batched vs. single), and probably a different cache discount structure. Your cost gate fires or fails to fire for reasons unrelated to the candidate's actual economics.
Each of these is a five-line patch in isolation. The damage is that there are dozens of them, scattered across fixtures, judges, regexes, dashboards, and the spreadsheets your finance partner uses to greenlight the migration. None of them was written to be portable, because at write-time portability had no concrete owner and no urgency.
Why You Can't Trust the First Comparison
The pernicious part is that you do not know any of this until you actually try. The harness will happily run the candidate model end-to-end and produce numbers. The numbers will look credible. They will be wrong in directions you cannot predict from the outside. Some metrics will be inflated by an artifact of how the candidate's tokenizer carves up your fixtures. Others will be deflated because your judge is grading on a curve fit to the incumbent's verbosity.
The honest first response is "I can't trust this comparison until I re-validate the harness." That investigation is itself a months-long project — you have to rebuild fixtures with provider-neutral length budgets, normalize tool-call representations, recalibrate refusal detectors against samples from both models, rewrite judge prompts to grade observable behavior rather than format, and re-derive the cost math under both pricing schedules. Until that is done, every score the suite produces is contaminated.
This is the moment when the migration plan slips by a quarter. The team that thought they were doing prompt rewrites discovers they are doing harness archaeology. The exec who said "we have leverage because we can switch" has leverage on paper but not in calendar time, which is the only kind of leverage a renegotiation actually respects.
The Portability Discipline
The fix is not glamorous. It is a set of design choices made early, when no one is forcing you to make them, that pay back the first time you actually want to comparison-shop.
Provider-agnostic eval primitives. Your eval framework should consume a normalized response object — text, tool calls, finish reason, token counts, latency — produced by an adapter layer that knows about provider quirks. The eval logic itself never sees provider-specific fields. If your judge code branches on whether response.tool_calls exists vs. response.content[].type === "tool_use", you have not abstracted; you have leaked.
Abstracted tokenizer interface. Length budgets in fixtures should be expressed in a unit the harness can resolve per provider — "fits in 70% of the candidate's context window," not "8,000 tokens." Fixtures that need exact-token control should declare that they need it and be explicitly marked as provider-pinned, so they are not silently used in cross-provider comparisons.
Normalized tool-call representation. Pick one canonical shape — name, arguments-as-object, optional id, optional ordering position — and translate every provider's response into it before the harness sees it. Translate back when calling the provider. The fixtures, judges, and replay tooling all operate on the canonical form.
Judges that score on observable behavior, not format. A judge that says "did the response correctly identify the user's intent and gather the right slot values" is portable. A judge that says "did the response include a <reasoning> block followed by a verdict in <answer> tags" is not — it is grading the provider's output convention, not the model's competence. Push every format-shaped assertion out of judges and into deterministic post-processors.
Cost math expressed as a function of the provider's rate card, not constants. Your cost gates should compute "this run cost X under provider P's rates as of date D." The threshold the gate compares against should be expressed in the same currency-normalized way. Then a comparison run produces "candidate is 18% cheaper at equivalent quality" rather than "candidate failed cost gate that was set against incumbent rates."
None of this is hard individually. All of it is real engineering work that no roadmap rewards until the day the renegotiation starts.
The Org Failure Mode
There is a second failure mode that is structural, not technical, and it is the one that actually kills most portability efforts.
The team that owns evals is not the team that owns model selection. Evals usually live with the platform or applied science group whose roadmap is dominated by "improve coverage" and "reduce false positives." Model selection lives with a product or strategy group whose roadmap is dominated by feature delivery on the current model. Neither group has portability as a primary objective. Both groups will agree it is important when asked, and both will deprioritize it next quarter when something concrete comes due.
The portability work — the abstractions, the adapter layer, the judge rewrites — is always somebody else's quarter. By the time the migration is on fire, the team that needs the harness to be portable has no time to make it portable, and the team that has the cycles to make it portable has nothing forcing them to.
The fix is to give portability an owner with a deliverable. Not "consider portability when you write evals." A specific deliverable: by date D, the harness can produce comparable scores for two named providers on the same fixture set without a code change between runs. Then someone is on the hook. Without that, "we'll swap if needed" remains a slide, not a capability.
The Tax That Pays Back
The portability discipline costs real engineering time up front and produces no visible artifact for the user. You cannot demo it. You cannot point to a feature it shipped. The only thing it produces is the option to walk into a renegotiation with a credible alternative — and that option only has value the day you exercise it.
This is exactly the kind of investment that gets cut in any quarter where roadmap pressure is high, which is every quarter. The way to keep it alive is to frame it correctly. It is not "future-proofing the eval suite." It is "preserving optionality on the largest line item in your AI cost structure." Treat it like a hedge, not like cleanup. Hedges have an annual budget. Cleanup gets bumped.
The teams that survived the 2025–2026 round of price renegotiations with leverage intact were not the teams with the cleanest prompts. They were the teams whose harnesses produced trustworthy comparisons within a week of a candidate model dropping. The prompts were always going to be rewritten. The harness was the one with the asymmetric payoff, and almost no one was funding it before they needed it.
If your migration plan today still budgets prompt rewrites and assumes the eval suite "just runs against any model," you are writing yourself out of the leverage you are claiming to have. The lock-in is not the prompts. It never was.
- https://onmine.io/swapping-llms-isnt-plug-and-play-inside-the-hidden-cost-of-model-migration/
- https://venturebeat.com/ai/swapping-llms-isnt-plug-and-play-inside-the-hidden-cost-of-model-migration
- https://earezki.com/ai-news/2026-04-24-the-hidden-challenge-of-multi-llm-context-management/
- https://callsphere.ai/blog/prompt-migration-adapting-prompts-switching-llm-providers
- https://evolink.ai/blog/llm-tco-hidden-costs-guide
- https://www.silicondata.com/blog/llm-cost-per-token
- https://github.com/EleutherAI/lm-evaluation-harness
- https://www.morphllm.com/llm-eval-harness
- https://docs.nvidia.com/nemo/microservices/25.6.0/evaluate/evaluation-types/lm-harness.html
- https://langfuse.com/docs/evaluation/evaluation-methods/llm-as-a-judge
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines
- https://platform.claude.com/docs/en/api/openai-sdk
- https://tokenmix.ai/blog/function-calling-guide
- https://www.digitalapplied.com/blog/ai-function-calling-guide-openai-anthropic-google
- https://docs.anthropic.com/en/docs/build-with-claude/tool-use
- https://vercel.com/blog/eval-driven-development-build-better-ai-faster
