Skip to main content

Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change

· 9 min read
Tian Pan
Software Engineer

A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."

This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.

A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.

Prompts Have Consumers, and the Consumers Are Pinned

When you publish a REST endpoint, the consumers are loud. They write integration tests, they show up in your error logs when you change a field, and if you rename userID to user_id they file a Jira ticket inside an hour. When you publish a prompt, the consumers are silent — and there are more of them than you think.

Walk through a typical AI feature and count:

  • Other prompts that quote you. A retrieval-aware system prompt that says "follow the format the planner uses" depends on the planner's format being stable; a chat-mode prompt that pastes in the agent's instructions verbatim depends on the agent's wording being stable. These quotations are usually invisible — copy-pasted at integration time, not linked.
  • Tool descriptions. A function-calling tool's natural-language description ("call this when the user wants to look up an order the way the assistant was instructed to phrase it") inherits assumptions from the system prompt. Edit the system prompt's vocabulary and the tool description is now describing a behavior the model is no longer producing.
  • Evals and judge rubrics. An LLM-as-judge rubric that scores "did the model respond in a strictly valid JSON envelope" has been calibrated against thousands of past outputs. Change the system prompt to say "clean parseable JSON" and the model emits the same content, but the judge — anchored on the literal phrase "strictly valid" — starts marking it ambiguous. Calibration drift is a known failure of judge-based evaluation, and prompt edits are one of its most reliable triggers.
  • Parsers. Regex extractors, JSON-shape validators, and chain-of-thought extractors all depend on stable preambles. "Sure, here's the response:" being replaced with "Here you go:" is a one-character diff for the author and a parse failure for the consumer.
  • Chained agents. When agent A's output is agent B's input, the second agent's prompt has been tuned to expect a particular shape from the first. A "polish pass" on agent A becomes a regression for agent B that nobody catches because the eval is per-agent.

Each of these consumers is a contract you shipped without realizing it. The wording is the API. The wording is what the next system parses, scores, or quotes.

Why Eval-Green Is Not a Safe-to-Ship Signal

The intuition most teams ship with — "if the eval suite is green, the prompt change is safe" — is wrong in a way that's hard to see until you've eaten the incident.

The eval grades the output. The consumer pins the wording. Those are different surfaces.

A prompt change can leave the output distribution effectively unchanged on the eval set (so the score doesn't move) while shifting the surface form just enough that a downstream parser stops matching. The judge sees an answer that's still correct; the parser sees text it can no longer extract a field from. The eval reports green. Production reports a 4xx spike from the consuming service.

This is the same shape as a backwards-incompatible API change that doesn't break the API's behavior — only its interface. A REST team would never ship a rename of created_at to creationTime and call it a patch release. A prompt team ships the equivalent every week and calls it a "small cleanup."

The fix is not bigger evals. The fix is recognizing that prompts have two contracts — a behavioral contract the eval can measure, and a wording contract the eval cannot. Both have to be versioned.

What a Prompt Deprecation Contract Looks Like

Borrowing the API discipline directly:

A prompt-version registry with semver-style classification. Every prompt gets a version. Edits get classified at PR time:

  • Patch — typo fixes, comment-only changes, whitespace. Cannot change observable wording.
  • Minor — additive sections, new constraints, new examples that don't replace existing ones. Cannot remove or rename phrases other artifacts may quote.
  • Major — anything that removes, renames, or restructures wording that has been quoted, parsed, judged, or chained against. Requires a deprecation entry.

The classification belongs in the PR template, the same way a database migration declares whether it's expand or contract.

A consumer index for every prompt. A list — checked into the same repo as the prompt — of every artifact pinned to a phrase in it. Tool descriptions that quote the system prompt. Judge rubrics that anchor on specific adjectives. Parsers whose regex includes a literal preamble. Other prompts that copy a section. The consumer index is the blast radius the author needs to see before merging. The index is maintained by adding a backlink at the consumer side ("this regex depends on prompt X v3.2 saying 'strictly valid'") and reverse-indexed by tooling.

A deprecation window for major changes. When you remove or rename wording that consumers depend on, the old wording stays alongside the new for a defined cycle — long enough for every consumer to migrate. The model is given both phrasings, the consumers are notified, and the old wording is retired only after the consumer index reports zero remaining pins. This is the same pattern as keeping created_at and creationTime both populated for a quarter while clients update.

A CI gate that fails on undeclared breaking changes. If a PR removes a phrase that appears in the consumer index without bumping the major version and adding a deprecation entry, the PR fails. Not warns — fails. This is the load-bearing piece, because the rest of the discipline is voluntary and humans will skip voluntary discipline under shipping pressure.

None of this is exotic. It is, almost line-for-line, how a mature backend team treats a public REST endpoint.

The Hardest Part Is the Consumer Index

The registry and the CI gate are mechanical. The consumer index is the part that takes organizational work, because consumers add themselves opportunistically and authors usually have no incentive to register.

Three patterns that make it tractable:

Backlinks at the consumer site, not the prompt site. A judge rubric that depends on "strictly valid JSON" gets a header comment: # depends on system_prompt v3.x phrase "strictly valid". A regex extractor gets the same. The consumer carries the dependency declaration, because the consumer is the one that breaks. Tooling then walks the repo and reverse-indexes these declarations into the prompt's consumer view.

Phrase-level fingerprinting. Hash the load-bearing phrases of a prompt and detect when a hash changes between versions. The CI gate can then ask: "phrase X changed; the consumer index lists three artifacts pinned to phrase X; is the deprecation entry present?" This is the same idea as a content-hash on a public asset — you don't have to enumerate consumers to detect a breaking change, you just have to detect that the contract surface moved.

A no-quote rule for new consumers. New consumers that want to depend on a prompt's wording must do so through a named, versioned excerpt — not a copy-paste. The excerpt becomes the API. The prompt author can change everything around the excerpt freely; the excerpt itself follows the deprecation rules. This is structurally identical to forcing API consumers to depend on a typed client rather than reverse-engineering the JSON shape.

The teams that get this right are the ones that treat the consumer index as a first-class repo artifact, reviewed in PRs the same way a CODEOWNERS file or a OpenAPI spec would be.

What Changes When You Treat Prompts This Way

Two things, mostly.

The first is that "small wording cleanups" stop being free. They get costed against the deprecation cycle the way any other breaking change does. Most of them don't survive that costing — and that's correct, because most of them weren't worth a parser outage either. The cleanups that are worth it move forward with a deprecation window and clean migrations, and the team learns the difference between cosmetic edits and contractual edits.

The second is that the prompt repo starts to look like a normal API repo. There's a registry, there's a versioning convention, there's a CI gate, there's a consumer index, there's a deprecation policy. The same engineers who would never ship a /v1/users rename without a migration plan stop shipping a system-prompt rename without one.

The architectural realization underneath all of this is that prompts have all the contract-stability problems of public APIs without any of the tooling. They're more brittle than typed APIs, because the consumers are pinned to natural-language phrases instead of structured field names. They're more diffuse, because the consumers are spread across prompts, tools, evals, parsers, and chained agents. And they're more invisible, because nothing in the standard prompt-engineering workflow surfaces the consumer side.

Until the team that ships prompts treats them with the same engineering discipline they apply to a /v1/users endpoint, the wording cleanup that ships on Thursday will keep paging the on-call engineer at 4am Friday — and the postmortem will keep concluding, falsely, that the model was the unstable part.

References:Let's stay in touch and Follow me for more thoughts and updates