The Specification Translation Tax: When Spec, Prompt, and Eval Drift Apart
A PM writes a feature spec in English. An engineer translates it into a system prompt with idiomatic LLM patterns — chain-of-thought scaffolding, output format coercion, a few hedge clauses to cover failure modes the spec never mentioned. An eval author opens the same spec, re-reads it cold, and writes JSON test cases against their interpretation. Three weeks later, all three artifacts disagree, and nobody can tell whether a regression is a prompt bug, a spec-implementation gap, or an eval that was wrong from day one.
This is the specification translation tax. Traditional software has it too — the gap between PRD and code, between code and tests — but compilers and type systems narrow it. AI features have no such backstop. The prompt is documentation that the system actually reads. The eval is a contract that nobody signed. The spec is a description of intent that nobody enforces. Each is a translation of the same intent into a different medium, and without bidirectional consistency, behavior leaks in through whichever artifact is easiest to edit.
Three artifacts, three editors, no canonical source
The structural problem is ownership. Prompts sit at the intersection of product intent, legal interpretation, and technical execution — no existing role owns them naturally. The PM owns the spec because they wrote it. The engineer owns the prompt because they ship it. The eval author — sometimes a third person, sometimes the same engineer wearing a different hat — owns the test suite because they built the harness. None of them owns the agreement between the three.
In a healthy traditional system, the spec describes intent, the code implements it, and the tests verify the implementation. The flow is one-directional: spec → code → test. When tests fail, you look at the code first because the spec is rarely a meaningful place to start debugging behavior. Reviewers catch spec/code drift in PRs by reading the diff against the linked ticket.
In an LLM feature, the relationship is triangular and circular. The prompt encodes behavior the spec didn't specify (because hedge clauses emerged during prompt engineering). The eval verifies behavior the prompt doesn't actually promise (because the eval author projected their reading of the spec onto the test cases). The spec describes intent that neither the prompt nor the eval reflects (because the spec was never updated when the prompt and eval evolved). All three are living documents. None of them is canonical. There is no compiler error when they disagree — there is only a future regression that no one can attribute.
A team I talked to recently described this exact failure: an eval started failing after a prompt edit, and the post-mortem took three days because the team had to re-derive what the intended behavior was from scratch — the spec was 18 months old and the prompt had absorbed a dozen tribal patches. The fix wasn't a code change. It was a meeting.
How the drift accumulates
Prompts drift faster than specs because they're cheaper to edit. An engineer notices the model returning markdown when the downstream parser expects plain text — they add "respond in plain text only" to the system prompt. The change ships. The spec doesn't mention output format because the original PM didn't think to. The eval doesn't catch the change because the test fixtures were built when the model still returned markdown and the eval normalizes formatting before comparison. Six months later, a new engineer sees the "respond in plain text only" line and removes it, thinking it's a redundant instruction. The downstream parser breaks in production. Whose fault?
Evals drift in the opposite direction — they're harder to edit, so they tend to encode an older understanding of correctness. When a behavior shifts (intentionally), the eval is updated to match the new behavior, but the assertion is often loosened rather than rewritten — "the response should mention the user's name" becomes "the response should be relevant to the user," because the eval author found a fast way to make the failing test pass. The contract is silently weakened. The spec, the prompt, and the eval now describe three different things, and the eval describes the most permissive one.
Specs drift slowest because nobody reads them after launch. The spec is the artifact written before behavior existed. Once the feature ships, the spec's job is done in most teams' working model. The PM moves to the next feature. The engineer never reopens it. The eval author refers to it once, when building the test fixtures, and never again. The spec calcifies into legacy documentation, and the canonical understanding of what the feature should do migrates into tribal knowledge — usually living in the engineer who shipped it, who will leave the team within 18 months.
The Gulf of Specification — the gap between what we want the LLM to do and what our prompts actually instruct it to do — exists at every step of this chain, and each translation widens it. Underspecified prompts force the model to guess intent, which leads to inconsistent outputs, which leads to evals that try to encode whatever the model happens to do, which becomes the new de facto spec. The system optimizes itself toward whichever artifact is easiest to query, and that's almost never the document the PM wrote.
What "spec as source of truth" actually requires
Spec-driven development as a movement gets the diagnosis right: the spec must be the source of truth, not the code. But moving the source of truth doesn't help unless the other artifacts are generated from or checked against the canonical one. A spec that lives in a separate document, edited at a different cadence, with no enforced consistency to the prompt and eval, is just another translation that drifts.
- https://addyosmani.com/blog/good-spec/
- https://thoughtworks.medium.com/spec-driven-development-d85995a81387
- https://martinfowler.com/articles/exploring-gen-ai/sdd-3-tools.html
- https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/
- https://technologuy.medium.com/spec-driven-development-is-the-missing-layer-for-deterministic-evals-in-llm-powered-apps-2bcb45ec45f0
- https://agenta.ai/blog/prompt-drift
- https://www.statsig.com/perspectives/prompt-versioning-managing-history
- https://www.braintrust.dev/articles/what-is-prompt-versioning
- https://www.v2solutions.com/blogs/promptops-for-engineering-leaders/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://developers.openai.com/blog/eval-skills
- https://arxiv.org/html/2602.22302v1
- https://arxiv.org/html/2411.13768v3
- https://newsletter.pragmaticengineer.com/p/evals
