Skip to main content

The Model-Preference Fork: Why Your Prompt Library Has Three Versions and No One Is Tracking the Drift

· 11 min read
Tian Pan
Software Engineer

Open the prompt library of any team that has been shipping LLM features for more than a year and you will find the same thing: three slightly different versions of every prompt. One was tuned by the engineer who likes Sonnet for its instruction-following. One was rewritten by the engineer who switched to Haiku for the latency budget. One belongs to the prototype that only ever worked on Opus and never got migrated. Each version has a slightly different system message, a different way of describing the tool catalog, a different formatting nudge — and nobody is tracking how they drift.

This is not a hygiene problem. It is a coordination tax that compounds at every model upgrade, and it is silently breaking the relationship between your eval suite and your production traffic. The library is supposed to be a shared resource. In practice, every feature ships with whichever variant the author last tested, the eval suite runs against the variant the eval-author preferred, and the routing layer chooses among them based on cost rather than on which variant was actually validated against the live eval.

The team that doesn't notice is the team that's already paying.

How the fork happens

The fork is not a single decision. It is a slow accretion of small, locally-rational choices that nobody sees in aggregate.

The first vector is developer preference. Engineers who use a model heavily develop a feel for it — the way it handles negation, how it responds to "DO NOT" versus "avoid," whether it follows numbered lists or runs them together, how it interprets a JSON schema with optional fields. When that engineer authors a new prompt, they author it for the model they have an intuition about. A teammate who prefers a different model picks up the prompt later, finds that it underperforms on their model of choice, and edits it locally rather than upstreaming a model-portable rewrite. Two months later, the prompt has two versions in two different files, and the divergence is permanent.

The second vector is prototype debt. Somebody built a feature in a hackweek using whatever model gave the fastest path to a demo. The demo worked. The feature shipped behind a flag. The flag never got flipped to "migrate to the production model" because the prototype prompt had been tuned to that model's quirks, and rewriting it would risk regressing the demo. So the prototype model stays in production for the lifetime of the feature, and the prompt accumulates inline comments like // only works on opus, do not change that read as warnings to humans but mean nothing to the routing layer.

The third vector is eval-author bias. The person who wrote the eval suite chose a default model to run it against. Every new prompt added to the suite gets validated on that model. The eval scores reflect performance on the eval-author's preferred model — which is not necessarily the model that handles the bulk of production traffic, especially after a cost-driven routing change that nobody told the eval-author about. The eval is green; production behavior diverges from the eval; nobody can explain why because the model column is missing from the analysis.

None of these vectors looks pathological from the inside. Each is a sensible local move. The pathology is in the aggregate: a library that appears to be one shared resource is in fact a parallel monorepo of model-specific forks, and the only artifact that captures the divergence is a Slack thread from six months ago that nobody can find.

The eval blind spot

The danger is not that the variants exist. It is that the eval suite cannot tell you which variant is actually serving production traffic, and the routing layer cannot tell you which variant was actually evaluated.

Most production LLM systems route on cost or latency. The routing logic looks reasonable: send simple requests to a cheap model, send complex ones to a powerful one, fall back when the primary model is rate-limited. The router does not know that prompt variant A was tuned for the cheap model and prompt variant B was tuned for the powerful one. It picks a model based on request shape and pulls whatever variant the gateway has registered as "the prompt for this intent." If the registered variant was tuned for a different model than the one the router just picked, the request runs on a mismatched pair — a model that has never been validated against this exact prompt, and a prompt that has never been evaluated against this exact model.

The eval suite cannot catch this because eval suites are typically structured around prompt-and-input pairs, not prompt-model-and-input triples. You evaluate prompt_v3 against the test set, and the suite reports a single accuracy number. The fact that the eval ran on Sonnet and production runs on Haiku is a footnote in the configuration that nobody checks. When a regression appears in production, the team's first instinct is to debug the prompt — but the prompt is fine. The prompt-model pairing is the regression.

This is why "prompt drift" — the term most teams use for the phenomenon — is misleading. The prompts have not drifted. The prompts are exactly what they were when they were merged. What has drifted is the implicit contract between the prompt and the model that was supposed to render it. The contract was never written down because nobody believed they were depending on a specific model. They were.

The coordination tax compounds at every model upgrade

The fork is annoying in steady state. It becomes expensive every time a model is upgraded, deprecated, or migrated.

When a provider releases a new model, every variant in the library has to be re-evaluated against it. If the library has three forks per intent, that's a 3x evaluation cost, and the comparison is harder because the variants don't share a baseline. You cannot meaningfully compare "variant A on the new model" against "variant B on the new model" because the variants were authored under different assumptions; the comparison conflates two changes (model + prompt) instead of isolating one.

When a provider deprecates a model, the team has to migrate every prompt that was tuned for that model. Migration is not a search-and-replace. It is a re-tuning exercise — the prompt's prefix, its formatting nudges, its tool descriptions all have to be rewritten for the replacement model. If the deprecation window is 90 days and the team has fifty model-specific prompt forks, the migration becomes a full-quarter project that nobody planned for. The deprecation announcement was scheduled; the migration tax was not.

When the cost-routing layer is tuned to send more traffic to a cheaper model, the team that owns the routing layer assumes the prompts will follow. The team that owns the prompts assumes the router will use the validated variants. Neither team realizes that "validated" was implicit, that the variants are model-specific, that flipping the routing weights silently moves traffic onto unvalidated prompt-model pairs. The cost-savings dashboard goes up; the quality dashboard slowly goes down; the post-mortem six weeks later discovers the disconnect.

Every one of these events is predictable. Every one of these events catches teams unprepared. The reason is the same: the fork is invisible until something forces a global recalibration, and by then it is too late to reorganize.

The discipline that has to land

The fix is not "use one model." Routing across models is a legitimate optimization, and forcing every feature onto one model surrenders real cost and latency wins. The fix is to make the model dimension explicit in the prompt library and in the eval suite, so that the divergence stops being invisible.

One canonical prompt per intent, with model-conditional sections instead of three forks. The prompt is a single artifact whose body has explicit branches: a section that runs only when the model is Sonnet-class, a different section for Haiku-class, a shared core that applies everywhere. The branches are small and named. When somebody wants to add a model-specific tweak, they edit the conditional section, not a separate file. The diff against the canonical prompt makes the model-specific delta legible to reviewers, instead of burying it in a forked file that no one cross-references.

A CI job that fails if a prompt variant ships without a matching eval slice that ran against that exact model. The eval suite has to be aware of model identity, not just prompt identity. The configuration is a triple — (prompt_version, model_id, dataset_slice) — and the CI gate refuses to ship the prompt if any used (prompt_version, model_id) combination doesn't have a passing eval. Tools like promptfoo and the modern prompt-evaluation platforms support side-by-side multi-model evaluation natively; the discipline is using that capability as a gate, not as an exploration tool.

A model-portability score per prompt. For each prompt, run the eval against every model the routing layer might dispatch to and report the variance. A prompt that scores 92% on Sonnet, 91% on Haiku, and 90% on Opus is portable. A prompt that scores 92% on Sonnet, 67% on Haiku, and 78% on Opus is model-specific by accident, and the team needs to know that explicitly so they can either (a) add the missing model-conditional sections or (b) constrain the routing layer to only dispatch this prompt to the model it actually works on. The portability score turns "implicit model dependency" into a visible attribute that the rest of the system can react to.

A quarterly fork-detection audit. Run a similarity scan across the prompt library and surface near-duplicates that the team didn't realize had drifted. The scan is cheap — embedding similarity on the prompt bodies, with a threshold tuned to flag prompts that share more than 80% of their content but aren't formally linked. The output is a list of probable accidental forks, ranked by usage. The team picks the top ten each quarter and either consolidates them into canonical prompts with model-conditional sections, or formally documents why the divergence is intentional. Either is fine. The pathology is the silent middle case where the divergence is undocumented.

A routing-aware deprecation playbook. When a model is being deprecated, the team that owns the routing layer cannot decommission the model until every prompt that was validated against that model has been re-validated against its replacement. The eval triple — (prompt, model, dataset) — gives them a precise list of what needs to be re-validated. Without that triple, deprecation is a fishing expedition; with it, deprecation is a checklist.

A library is a contract, not a folder

The mental model that makes all of this click is to stop thinking of the prompt library as a folder of text files. A prompt library is a multi-author monorepo for a stochastic system, and every other discipline that monorepos require — single source of truth per intent, explicit interface contracts, CI gating, deprecation policy, drift detection — applies to it. The fact that the artifacts are natural-language strings rather than typed code does not exempt the library from the discipline. It makes the discipline harder, because the contracts that strings encode are implicit, and the consumers of those contracts (the models) are not deterministic.

The team that ships its first LLM feature on a single model rarely sees the fork forming. The team that ships its tenth feature across three models almost certainly has the fork. The intermediate stage is the dangerous one — the team is still treating the library as a shared folder, the model dimension is still treated as a routing-layer concern, and the eval suite is still treating prompts as if they had no model-specific behavior. By the time the symptoms appear in production, the divergence is months deep and the cleanup is expensive.

The cheapest place to add the discipline is at the second feature. The next-cheapest place is right now. Every quarter you wait, the library accretes another layer of accidental forks, and the model upgrade six months from now becomes a little more expensive than the last one.

A shared prompt library that doesn't enforce a single-source-of-truth invariant isn't a library. It's a coordination debt with a folder name.

References:Let's stay in touch and Follow me for more thoughts and updates