Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change
A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."
This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.
A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.
Prompts Have Consumers, and the Consumers Are Pinned
When you publish a REST endpoint, the consumers are loud. They write integration tests, they show up in your error logs when you change a field, and if you rename userID to user_id they file a Jira ticket inside an hour. When you publish a prompt, the consumers are silent — and there are more of them than you think.
Walk through a typical AI feature and count:
- Other prompts that quote you. A retrieval-aware system prompt that says "follow the format the planner uses" depends on the planner's format being stable; a chat-mode prompt that pastes in the agent's instructions verbatim depends on the agent's wording being stable. These quotations are usually invisible — copy-pasted at integration time, not linked.
- Tool descriptions. A function-calling tool's natural-language description ("call this when the user wants to look up an order the way the assistant was instructed to phrase it") inherits assumptions from the system prompt. Edit the system prompt's vocabulary and the tool description is now describing a behavior the model is no longer producing.
- Evals and judge rubrics. An LLM-as-judge rubric that scores "did the model respond in a strictly valid JSON envelope" has been calibrated against thousands of past outputs. Change the system prompt to say "clean parseable JSON" and the model emits the same content, but the judge — anchored on the literal phrase "strictly valid" — starts marking it ambiguous. Calibration drift is a known failure of judge-based evaluation, and prompt edits are one of its most reliable triggers.
- Parsers. Regex extractors, JSON-shape validators, and chain-of-thought extractors all depend on stable preambles. "Sure, here's the response:" being replaced with "Here you go:" is a one-character diff for the author and a parse failure for the consumer.
- Chained agents. When agent A's output is agent B's input, the second agent's prompt has been tuned to expect a particular shape from the first. A "polish pass" on agent A becomes a regression for agent B that nobody catches because the eval is per-agent.
Each of these consumers is a contract you shipped without realizing it. The wording is the API. The wording is what the next system parses, scores, or quotes.
Why Eval-Green Is Not a Safe-to-Ship Signal
The intuition most teams ship with — "if the eval suite is green, the prompt change is safe" — is wrong in a way that's hard to see until you've eaten the incident.
The eval grades the output. The consumer pins the wording. Those are different surfaces.
A prompt change can leave the output distribution effectively unchanged on the eval set (so the score doesn't move) while shifting the surface form just enough that a downstream parser stops matching. The judge sees an answer that's still correct; the parser sees text it can no longer extract a field from. The eval reports green. Production reports a 4xx spike from the consuming service.
This is the same shape as a backwards-incompatible API change that doesn't break the API's behavior — only its interface. A REST team would never ship a rename of created_at to creationTime and call it a patch release. A prompt team ships the equivalent every week and calls it a "small cleanup."
The fix is not bigger evals. The fix is recognizing that prompts have two contracts — a behavioral contract the eval can measure, and a wording contract the eval cannot. Both have to be versioned.
What a Prompt Deprecation Contract Looks Like
Borrowing the API discipline directly:
A prompt-version registry with semver-style classification. Every prompt gets a version. Edits get classified at PR time:
- https://www.getmaxim.ai/articles/prompt-versioning-and-its-best-practices-2025/
- https://agenta.ai/blog/prompt-versioning-guide
- https://blog.promptlayer.com/5-best-tools-for-prompt-versioning/
- https://www.braintrust.dev/articles/what-is-prompt-versioning
- https://launchdarkly.com/blog/prompt-versioning-and-management/
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://dev.to/novaelvaris/prompt-contracts-treat-prompts-like-apis-inputs-outputs-errors-1i7k
- https://understandingdata.com/posts/prompt-contracts-specification-before-code/
- https://www.langchain.com/articles/llm-as-a-judge
- https://deepchecks.com/llm-judge-calibration-automated-issues/
- https://mlflow.org/prompt-registry
- https://testrigor.com/blog/why-devops-needs-a-promptops-layer/
