The Eval Migration Tax: Why a Prompt Schema Change Wrecks 800 Test Cases
Every AI team I've watched ship a "small" output schema change has lived through the same week. Someone renames a field in the system prompt — say, summary becomes tldr, or the tool catalog gains a required confidence parameter — and the next CI run lights up red across 800 eval cases that have nothing to do with the change. The prompt diff is fifteen lines. The eval diff is a four-day migration project nobody scoped, owned, or budgeted.
This is the eval migration tax. It is the maintenance cost no roadmap accounts for, paid in delayed releases that get blamed on "flaky tests" rather than the architectural choice that actually caused them. Most teams pay it for years before they recognize the pattern, because each individual incident looks like ordinary churn. The compounding only becomes visible when you tally the engineering hours spent migrating evals across a quarter and realize they exceed the hours spent improving the model behavior the evals were supposed to measure.
The reframe that fixes this is borrowed from a problem the database community solved thirty years ago: when your data has structure that evolves, you cannot treat the data as static fixtures. You need migrations, versioning, and tooling that makes schema change cheap. Eval suites are versioned data. The teams that ship AI features fastest in 2026 are the ones who figured this out first, and the teams stuck in eval-migration debt are the ones still treating their golden sets like checked-in JSON blobs that will live forever in their original shape.
How the Tax Accumulates
The eval suite usually starts small and clean. A prompt engineer writes thirty examples that capture the behaviors that matter, encodes the expected output as a literal JSON string, and wires up a CI check that compares actual to expected with strict equality. It works. The team ships. Confidence rises.
Then the suite grows. Production incidents add cases. Stakeholders request coverage for their corner of the workflow. The number of test cases doubles every quarter. By month nine you have 800 cases, and roughly none of them have been re-examined since the day they were added. They are accreted, not curated.
Now consider what happens when the system prompt's output schema changes. Maybe you added a new required field for downstream parsing. Maybe an LLM upgrade interprets your tool calling schema more strictly than the previous model did and you have to tighten parameter names. Maybe a product requirement renamed a category. The change to the prompt is small. The change to the expected outputs of every test case touches all 800 records.
Three failure modes emerge:
- Brittle assertions tied to literal output strings. The eval doesn't care whether the model still reasons correctly; it cares that byte-for-byte the new output equals the old fixture. Renaming a field invalidates the entire fixture.
- Tool-call shape coupling. Eval cases assert that the model called
search_users(filter="active")with that exact parameter name. Add alimitparameter and every assertion that checked the old shape breaks, even though the model's intent is unchanged. - Implicit temporal coupling. A case from eight months ago expected a response that referenced "Q3 2025." The eval still asserts that string today. The prompt has nothing wrong with it; the case is stale.
When all three layers fail at once, the team faces a choice: spend a week migrating the eval suite by hand, or skip the schema change. Both are expensive. The team usually picks the cheaper-looking option in the moment — skip the change — and the prompt accumulates technical debt that compounds the next time someone tries to touch it.
The Architectural Mistake
The mistake at the bottom of all of this is treating eval suites as static fixtures instead of versioned data. A static fixture is a frozen artifact: the value is the value, and any change to the surrounding system that breaks the value is a problem to be solved by rewriting the value. A versioned dataset is the opposite: the data has a schema, the schema has a version, and changes are mediated by migrations that operate on the data programmatically.
The database community recognized this pattern decades ago. A production database with a million rows cannot be migrated by hand-editing rows when the schema changes. Tools like Flyway and Alembic exist because every meaningful schema change requires a script that knows how to transform the data — and those scripts become first-class artifacts in the repository, reviewed, tested, and version-controlled alongside the schema itself.
Eval suites have the same change-amplification problem. One schema change touches every record. Without migration tooling, every change becomes an O(N) hand-edit project where N is your test case count. With migration tooling, it becomes an O(1) script that runs over the dataset once.
The architectural realization is that the prompt schema and the eval dataset are coupled the same way an application's ORM models and its database are coupled. You don't manage that coupling by promising never to change the models. You manage it by making schema change a normal, instrumented operation.
Decoupling Inputs from Output Assertions
The first concrete practice is to keep eval inputs and eval assertions in separate layers. The input is what you send to the model: the user's query, the conversation history, the tool catalog, the system prompt. The assertion is what you check about the response.
Most home-grown eval suites collapse these into a single record:
{
"input": "What's my balance?",
"expected_output": "{\"answer\": \"Your balance is $42.\", \"confidence\": 0.9}"
}
This is convenient on day one and catastrophic by month nine. The input describes a behavior that should remain stable across schema changes — "the user asked for their balance, the model should answer with the balance." The expected output describes one specific encoding of that behavior at one specific point in time. When you couple them, you forfeit the ability to change the encoding without rewriting the behavior description.
The fix is to make assertions semantic rather than literal. Instead of asserting that the response equals a JSON blob, assert that the response contains a balance field whose value is "$42" or that the model called the get_balance tool with the user's account ID. Frameworks like Promptfoo and DeepEval support this with composable assertion types — JSON path checks, contains-substring, structural matching, LLM-as-judge — that survive schema changes the literal-equality approach cannot.
The harder discipline is writing assertions that describe intent rather than form. "The model should refuse this request" is more durable than "the model should output {\"refused\": true}." Refusal can be expressed as a flag, a natural-language sentence, or a tool call to escalate_to_human — the intent is the same and only the form changes. Anchor your assertion to the intent.
Schema-Versioned Eval Records
The second practice is to record schema version metadata on every eval record. Each test case carries a schema_version field (and ideally a prompt_version and model_version) that names the schema it was authored against.
When the schema changes, you bump the version. Old records continue to live in the dataset under their old schema version — a CI job runs them against the corresponding old prompt to detect regressions, or quietly retires them when the underlying behavior is no longer supported. New records are authored against the new schema version. A migration script operates on the boundary, transforming old records to the new shape where the transformation is mechanical.
This sounds like overhead until the first time you do a model migration. When you swap the underlying model and a hundred eval cases break, you need to know whether they broke because the new model is worse or because the old assertions were tied to incidental quirks of the old model's output. Schema-versioned records make that diagnosis tractable: you can replay each version of the suite against each version of the prompt and see which behaviors are stable and which were accidents.
A practical implementation: every eval record is a row in a versioned table or a JSON document with a schema_version field. The schema for each version is checked into the repository as a JSON Schema or TypeScript type. CI rejects any new record whose shape doesn't match the latest schema. The migration tool — and there should be exactly one — knows how to transform records from version N to version N+1 by reading both schemas.
This is, structurally, exactly what Alembic and Flyway do for relational data. The pattern transfers cleanly.
Migration Scripts as First-Class Artifacts
The third practice follows from the second: when the schema changes, the engineer making the change writes a migration script. Not a one-off find/replace. Not a manual editing pass. A real script, checked into the repository, code-reviewed alongside the prompt change, and run as part of the schema change PR.
The shape of an eval migration is usually simple. Rename summary to tldr: a function that walks every record and renames the field. Add a required confidence parameter to a tool call: a function that walks every tool-use case and adds a default value, or marks the case as needing manual review. Drop a category that no longer exists: a function that finds and removes records referencing the old category, or rewrites them to a sensible substitute.
What makes this transformative is that the migration code becomes a permanent part of the prompt's history. Six months later, when a new engineer wonders why a particular field is named tldr, they can git log the prompt and find the migration script that renamed it from summary, with a commit message explaining why. The schema isn't documented in a separate wiki page that goes stale; it's documented in the executable record of how the schema got to its current shape.
A useful test for whether your team has internalized this practice: ask whoever shipped the last prompt schema change whether they wrote a migration script. If the answer is "no, I just updated the test fixtures by hand," you do not yet have a migration discipline. You have a manual-labor discipline that happens to work right now because your dataset is small enough that hand-editing is tractable.
Eval Ownership Lives With the Prompt
The fourth practice is organizational rather than technical. Eval ownership must live with whoever owns the prompt — not in a separate QA backlog, not on a downstream test team, not on a "platform" group that reviews PRs from twenty product teams.
The reason is simple: every prompt change implies an eval change, and any handoff between the prompt owner and the eval owner adds latency and drops information. If the prompt engineer ships a schema change and someone else has to migrate the evals, the schema change is half a release and the migration is the other half. They will be done out of order, by people with different mental models, and the bug surface lives in the gap between them.
The functional alternative is co-location. The same engineer (or pair) who writes the prompt also writes the eval. The same PR that changes the prompt schema changes the eval schema. The same review that scrutinizes the prompt scrutinizes the migration script. CI runs the new evals against the new prompt before merge. There is no handoff because there is no separation.
This is uncomfortable for organizations that grew up with a clean dev/QA split. But the dev/QA split assumed software whose behavior is deterministic enough that QA can write tests in a separate workstream and trust them to remain valid as the software evolves. LLM-driven systems do not satisfy that assumption. The expected behavior shifts continuously with the prompt, the model, and the tool catalog. Tests that aren't co-located with the artifact they describe go stale almost immediately.
What This Looks Like in Practice
A team that has internalized these practices has eval suites that grow at the same pace as their prompt complexity, not faster. Schema changes ship as small diffs that include a prompt change, an eval schema change, and a migration script — reviewable in a single PR. Old eval records continue to exist and run against old prompt versions, providing regression coverage that survives schema evolution. Model migrations become a few-day exercise rather than a multi-week archaeological dig through fixtures nobody understands.
A team that has not internalized them has a CI suite that has been mostly red for three weeks because nobody has had time to migrate the evals after the last prompt change. The team learns to ignore CI. Real regressions slip through the noise. Eventually someone proposes rewriting the eval suite from scratch — and the cycle restarts.
The eval migration tax is the price of treating versioned data like static fixtures. Pay it now, in tooling and discipline, or pay it forever, in delayed releases nobody knows how to attribute.
- https://hamel.dev/blog/posts/evals/
- https://newsletter.pragmaticengineer.com/p/evals
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://docs.langchain.com/langsmith/evaluation-concepts
- https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies
- https://www.promptfoo.dev/docs/configuration/expected-outputs/
- https://safjan.com/the-real-cost-of-model-migration-what-swapping-llms-actually-requires/
- https://collinwilkins.com/articles/structured-output
- https://www.braintrust.dev/articles/llm-evaluation-guide
- https://callsphere.ai/blog/upgrading-llm-models-production-gpt35-gpt4-gpt5-migration
- https://www.datafold.com/blog/modern-data-migration-framework/
- https://www.red-gate.com/hub/university/learning-pathways/database-devops-learning-pathway/database-versioning-source-control/level-0/the-what-why-and-how-of-database-versioning-with-flyway/
