Skip to main content

Contract Tests for Prompts: Stop One Team's Edit From Breaking Another Team's Agent

· 9 min read
Tian Pan
Software Engineer

A platform team rewords the intent classifier prompt to "better handle compound questions." One sentence changes. Their own eval suite goes green — compound-question accuracy improves 6 points. They merge at 3pm. By 5pm, three downstream agent teams are paging: the routing agent is sending refund requests to the shipping queue, the summarizer agent is truncating at a different boundary, and the ticket-tagger has started emitting a category that no schema recognizes. None of those downstream teams were in the review. Nobody was on call for "the intent prompt."

This is not a hypothetical. It is what happens when a prompt becomes a shared dependency without becoming a shared API. A prompt change that improves one team's metric can silently invalidate the assumptions another team built on top. And unlike a breaking API change, there is no deserialization error, no schema mismatch, no 500 — the downstream just starts making subtly worse decisions.

Traditional API engineering solved this decades ago with contract tests. The consumer publishes the shape of what it expects; the provider is obligated to keep that shape working. Pact, consumer-driven contracts, shared schemas — this is release-engineering orthodoxy for HTTP services. Prompts deserve the same discipline, and most organizations still treat them like sticky notes passed between teams.

Why Prompts Behave Like Unversioned APIs

A prompt has every property of an API contract. It has inputs (the variables interpolated into the template), outputs (the model's response, which downstream code parses), side effects (tool calls, routing decisions, retrieval queries), and behavioral invariants (tone, format, refusal conditions). Downstream consumers build mental models against all four. When any of them shift, consumers break.

The difference is that HTTP APIs fail loudly. Change user_id to userId in a JSON response and the consumer throws a deserialization error in the first integration test. Change "respond concisely" to "respond in a friendly tone" in a prompt and every downstream test still passes — the response is still valid JSON, still parseable, still within token limits. The break surfaces weeks later as a drift in some user-facing metric that nobody traces back to that PR.

Analysis of production LLM deployments has repeatedly pointed at prompt updates as the leading cause of unexpected behavior — ahead of model version changes and infrastructure failures. That ranking only makes sense once you accept that prompts are the least-tested, least-versioned, most-shared artifact in the stack.

The organizational failure mode is straightforward. Team A owns the prompt. Team A has a suite that tests what Team A cares about. Team B's agent calls Team A's prompt and parses the output, but Team B's tests mock the prompt or use fixtures from three months ago. Team A ships a "minor wording tweak." The test matrix never intersects. Production is the integration test.

Consumer-Driven Contracts, Translated to Prompts

Pact's core insight is that the consumer — not the provider — defines the contract. The downstream service writes a test that says, "when I call you with X, I expect a response matching Y." The provider is then obligated to run that test against their code before releasing. Contract drift gets caught at the producer's CI, not at the consumer's 3am incident.

For prompts, the equivalent is downstream-consumer eval pins: golden fixtures authored by each consuming team, checked into a shared location, and run automatically when any upstream prompt changes. The agent team that routes based on intent classification writes a fixture file that encodes their routing assumptions — "this input must classify as refund, not billing," "the confidence score must be parseable as a float," "the response must not exceed 50 tokens." When the platform team edits the classifier prompt, CI runs every downstream pin and refuses the merge if any fail.

The critical word is "authored by." Platform teams writing their own downstream tests defeats the purpose — they will test what they care about, which is what caused the problem in the first place. Consumer-driven means consumer-owned. The agent team knows that a missing confidence score breaks their bucket sort; the platform team does not, and should not be expected to.

A practical setup looks like this. A shared repository or registry holds prompt definitions keyed by name and version. Each consumer contributes a test file under a consumers/ directory, named after their service. The prompt author's CI loads every consumer file, executes the pinned inputs against the proposed prompt, and validates against each consumer's assertions. A failure tells the author exactly which downstream service will break and why — no paging, no archaeology.

The Three Gates That Actually Catch Regressions

Full-stack eval is expensive and slow. You do not want to run a thousand-row evaluation on every PR, and you definitely do not want to block a typo fix on a 40-minute test suite. A tiered approach, borrowed from promptfoo and prompt-contracts-style tooling, works in practice:

  • Structural gates (milliseconds, no model calls): validate that the template renders with the expected variables, that the output schema is preserved, that required tool definitions still compile. These catch renamed parameters, deleted placeholders, and malformed JSON schemas before anything hits a model.
  • Fixture gates (seconds, limited model calls): run the consumer-owned pins against a small, fast model or replay cached responses. Promptproof-style "deterministic contract checks" with recorded fixtures and zero live calls in CI are the fastest path here. These catch the 80% of breakages that are behavioral but obvious: format drift, missing fields, routing inversions.
  • Semantic gates (minutes, full eval): run the representative golden dataset with the real production model, compute embedding similarity against reference outputs, and flag anything beyond the drift threshold. Semantic diff tools surface meaning-level shifts that character-level git diff cannot — a rewording that changes intent without changing length. These run on main-branch merges or nightly, not every PR.

The layering matters. Structural gates should run on every push and fail in seconds. Fixture gates should run on every PR against the prompt path. Semantic gates are expensive enough to be scheduled rather than blocking — they exist to catch the subtle cases that fixture gates miss, not to gate merges.

Versioning Discipline: Semver for Prompts, Pinning for Consumers

Contract tests only work if consumers can choose when to adopt breaking changes. That requires versioning, and prompts need the same semver discipline code gets.

A workable convention: major version bump when the input schema changes (new required variable, renamed field) or the output contract changes (new format, different field names, removed tool). Minor bump for backward-compatible additions — an extra optional instruction, a new few-shot example, a refined refusal clause. Patch for wording-level edits that cannot plausibly change behavior: typos, grammar, whitespace.

Consumers pin to a major version. They get minor and patch updates automatically; a major bump requires them to run their own eval and update the pin explicitly. The prompt registry maintains at least two versions in parallel during migration windows — typically 30 to 90 days depending on blast radius. This is identical to how good HTTP API teams handle deprecation, and for the same reasons.

The discipline that breaks down in practice is the "patch" category. Teams convince themselves that "just fixing the wording" cannot possibly affect behavior — until the embedding similarity score for the production dataset drops 12%. The rule of thumb worth enforcing: if the change modifies any instruction or example, it is at least a minor version, regardless of how small the diff looks. Only comments, whitespace, and documentation strings are safe patch-level edits.

A shared prompt registry — with provenance metadata, version pins, and compatibility labels — is what makes all of this operable. Teams reference prompts by name@version rather than by copy-pasted string. Registry tooling is now common: MLflow, promptlayer, braintrust, agenta, and homegrown systems built on a git repository plus a lookup API all work. The choice of tool matters less than the invariant: no prompt in production has an unknown version or unknown consumers.

The Migration Path for Teams Already in the Mess

Most teams reading this already have prompts in production with no contracts, no versioning, and no visibility into who consumes what. The useful sequence is:

  1. Inventory the consumers. For each prompt used in production, find every call site — including prompts that are embedded in code, constructed dynamically, or passed as configuration. The output of this exercise is usually more alarming than expected; a single "intent classifier" often has five or six implicit consumers.
  2. Freeze the current version. Pin every consumer to a specific prompt version before anyone edits anything. This buys time to build the contract layer without adding new risk.
  3. Author one consumer pin per team. Each consuming team writes one fixture file covering their highest-risk assumption. Not a comprehensive suite — just the single "if this breaks I get paged" case. The goal is coverage breadth, not depth.
  4. Wire the gates into prompt CI. Structural gates first, because they are free. Fixture gates second. Semantic gates last, because they require an eval harness and a golden dataset you may not have yet.
  5. Adopt the versioning convention. Retroactively apply semver to existing prompts based on their edit history. Most will land at v1.x; a handful with wildly incompatible historical versions get multiple major releases documented.

None of these steps requires a full eval platform or a dedicated prompts team. What they require is the cultural shift from "prompts are text" to "prompts are interfaces." The engineering follows.

Prompts Need the Same Rigor Your Public API Gets

The question worth asking the next time somebody proposes a "quick wording tweak" in a shared prompt: would you make the equivalent change to a public REST endpoint without a version bump, a deprecation notice, and a consumer-visible migration plan? The instinct for HTTP APIs is reflexively cautious. The instinct for prompts is still reflexively casual — and every incident caused by an unreviewed prompt edit is evidence that the casual instinct is wrong for the blast radius.

Contract tests for prompts are not exotic. They are the same consumer-driven pattern HTTP APIs have used for a decade, ported to a new artifact. The tooling already exists. The registries already exist. What is usually missing is the organizational decision that prompts — shared, production prompts — are owned interfaces, not editable scratch space. Make that decision, and the engineering becomes straightforward. Skip it, and you will keep shipping the same outage with a different wording.

References:Let's stay in touch and Follow me for more thoughts and updates