Skip to main content

Prompt Contract Testing: How Teams Building Different Agents Coordinate Without Breaking Each Other

· 10 min read
Tian Pan
Software Engineer

When two microservices diverge in their API assumptions, your integration tests catch it before production does. When two agents diverge in their prompt assumptions, you find out when a customer gets contradictory answers—or when a cascading failure takes down the entire pipeline. Multi-agent AI systems fail at rates of 41–87% in production. More than a third of those failures aren't model quality problems; they're coordination breakdowns: one agent changed how it formats output, another still expects the old schema, and nobody has a test for that.

The underlying problem is that agents communicate through implicit contracts. A research agent agrees—informally, in someone's mental model—to return results as a JSON object with a sources array. The orchestrating agent depends on that shape. Nobody writes this down. Nobody tests it. Six weeks later the research agent's prompt is refined to return a ranked list instead, and the orchestrator silently drops half its inputs.

Prompt contract testing applies the ideas behind consumer-driven contract testing—familiar from tools like Pact in microservices—to the interfaces between LLM agents. The core idea is simple: downstream consumers define what they need, upstream providers commit to delivering it, and automated checks enforce the agreement before anything ships.

Why Multi-Agent Coordination Fails at Scale

A 2025 failure taxonomy study analyzed over 1,600 execution traces across production multi-agent systems and found three categories of failure. Specification problems—ambiguous role definitions and unclear task boundaries—account for 41.8% of failures. Coordination failures, where agents interpret shared context differently, cause another 36.9%. The remaining 21.3% are verification gaps: the system works in isolation but breaks at the seams.

The reliability math is brutal. If five agents each operate at 95% individual success rates, the system-level reliability compounds to about 77%. That sounds acceptable until you're running thousands of pipelines per day. And 95% is optimistic—agents interacting with other agents, not humans, often perform worse because they lose the error-correction benefit of a patient human in the loop.

The failures aren't random. They cluster at agent boundaries: the point where one agent's output becomes another agent's input. This is exactly where software systems failed before API contracts and schema validation became standard practice. The solution is the same—you have to make the interface explicit and test against it—but the implementation looks different when the interface is a natural language prompt rather than a typed function signature.

What a Prompt Contract Actually Is

A prompt contract is a machine-readable specification of what one agent will produce and what another agent requires. Think of it as an OpenAPI spec for inter-agent communication.

A minimal contract for a research agent's output might specify:

  • Output must be a JSON object
  • Must contain a sources array of at least one item
  • Each source must have url (string), relevance_score (float 0–1), and excerpt (string, max 500 chars)
  • Must include a summary string under 200 words
  • Must not include raw HTML tags in any string field

The consuming agent—the orchestrator—defines this contract based on what it actually needs to function. The producing agent—the researcher—is responsible for satisfying it. This is the consumer-driven inversion: the consumer is authoritative about requirements, not the producer.

When the research agent's prompt changes, automated tests run its output against the contract before the change deploys. If the new prompt produces a ranked_results array instead of a sources array, the contract check fails. The team that owns the research agent learns about the breaking change before the team that owns the orchestrator does.

The JSON Schema Layer

The practical implementation usually uses JSON Schema as the enforcement mechanism for structured output contracts. JSON Schema is declarative, language-agnostic, and understood by most LLM APIs' structured output features.

Tools like Promptfoo, LangSmith, and Arize Phoenix can all evaluate agent outputs against schemas automatically. But the schema itself isn't where the hard thinking happens. The hard thinking is defining the conformance test suite: a set of canonical inputs paired with expected output properties that a correct implementation should satisfy.

A conformance test for a customer support agent might specify that given a refund request mentioning an order number, the agent must:

  1. Extract the order number into a structured field
  2. Set action_required to true
  3. Not promise a specific refund timeline without a lookup
  4. Not contradict a preceding compliance check result in the same context

These aren't unit tests in the traditional sense—you can't assert exact string equality against an LLM output. They're behavioral contracts: properties that must hold regardless of the specific wording the model chooses. Evaluation frameworks use a mix of JSON Schema validation for structure and LLM-as-judge scoring for semantic properties.

Consumer-Driven Flow in Practice

The workflow mirrors how Pact works in microservices, adapted for the nondeterminism of LLMs.

The consumer team—the agent that calls the other—writes the contract first. This includes example inputs they'll send and properties they require in responses. They commit this to a shared contracts repository or a broker service like PactFlow.

The provider team—the agent being called—runs their agent against the consumer's test cases as part of their CI pipeline. If the provider's current prompt satisfies all the consumer's contracts, the pipeline passes. If a prompt change breaks a contract, the pipeline fails before deployment.

This inverts the typical conversation. Instead of the provider saying "here's what we output, build around it," the consumer says "here's what we need, make sure you provide it." Provider teams can make any internal changes they want—refine prompt phrasing, switch models, add chain-of-thought steps—as long as the observable contract remains satisfied.

The Model Context Protocol (MCP), standardized in late 2024 and now adopted across major AI providers, formalized this pattern at the tool layer. MCP defines how agents expose tools to each other, with typed parameters defined once in a schema that all consumers reference. A tool definition in MCP is a machine-readable contract: name, description, input schema, and expected output format. Any agent implementing that tool must satisfy that schema.

Versioning Prompt Changes Like API Changes

Unversioned prompts are the equivalent of unversioned REST APIs: any change is potentially breaking, and there's no mechanism to signal that. Teams that get this right treat prompt changes with the same discipline as API changes.

Semantic versioning maps well: a patch changes phrasing without affecting output structure; a minor adds optional output fields that consumers can ignore; a major changes required output structure or removes fields. Any major version bump triggers a contract renegotiation process—the provider team notifies all known consumers, consumers update their contracts and tests, and the old version continues serving traffic until all consumers have migrated.

Tools like LangSmith and Lilypad now support automatic prompt versioning, capturing not just the prompt text but the linked evaluation results and the model configuration. Every version has an associated test suite result, so you can see not just what changed but whether the change broke anything.

Rainbow deployments handle the operational challenge of updating long-running pipelines. When an agent update deploys, both old and new versions run simultaneously. Traffic shifts gradually—10% to the new version, then 25%, then 50%, then 100%—with rollback available at each step. Agents mid-process when the update deploys continue on the old version without interruption. This pattern is especially important for research or data processing pipelines where individual runs span minutes or hours.

The Security Angle Nobody Budgets For

Prompt injection across agent boundaries deserves its own place in any contract testing strategy. OWASP's 2025 Top 10 for LLM applications ranks prompt injection first, and 73% of production LLM applications are currently vulnerable.

In multi-agent systems, the attack surface is larger than in single-agent deployments. A malicious actor doesn't need to compromise the orchestrator directly—they can inject instructions into data that a subagent processes, which then propagates through the pipeline. One agent's tool output becomes another agent's context, and a sufficiently crafted tool output can redirect what the downstream agent does.

The contract testing equivalent here is an adversarial test suite: known injection patterns that providers must handle without breaking their behavioral contracts. Promptfoo includes a vulnerability scanner that generates injection, jailbreak, and PII-leakage test cases automatically. These should run on every prompt update alongside the functional contract tests.

Building the Coordination Infrastructure

Teams doing this well share a few structural choices.

They maintain a contracts repository separate from any individual agent's codebase. Contracts are owned by consumers but visible to providers. Changes to contracts go through a review process, not just a unilateral edit.

They run contract tests in CI on both sides: when a consumer changes their contract, provider pipelines re-run against the new expectations. When a provider updates their prompt, all known consumer contracts re-run.

They instrument coordination specifically. Aggregate agent success rates hide where failures occur. Logging each inter-agent call with structured fields—calling agent, target agent, contract version, validation result—lets you identify which boundaries are breaking most often.

They build explicit fallback handling into contracts. A contract should specify not just what a successful response looks like, but what happens on failure: what structure the error response takes, whether retries are permitted, what the consuming agent does when the provider returns a contract-invalid response.

The MAST failure taxonomy finding that 41.8% of failures are specification problems—not execution problems—points to the most important investment: writing specs before writing prompts. Teams that define the contract first, then implement the agent to satisfy it, consistently outperform teams that implement first and document informally afterward. A 2025 study found that interactive test-driven workflows for underspecified agent tasks improved performance by 74% on ambiguous inputs. The discipline of specifying expected behavior before implementation carries the same benefit for multi-agent coordination as it does for traditional TDD.

Where This Is Heading

Contract testing for prompts is still early. The tooling is maturing but not consolidated—teams stitch together JSON Schema validation, eval frameworks, versioning tools, and deployment infrastructure from separate products. A coherent stack comparable to what Pact provides for REST APIs doesn't exist yet.

The regulatory tailwind is real. Federal agencies began requiring model cards and evaluation artifacts in early 2026. Enterprise procurement increasingly asks for evidence of systematic evaluation. That external pressure is pushing teams to formalize what they previously left implicit—and prompt contracts are the natural artifact to produce.

MCP's industry-wide adoption means the tool layer now has a standardized contract format. The next logical extension is the same standardization for prompt interfaces between agents: a versioned, machine-readable spec that captures behavioral expectations, not just parameter schemas. Some teams are already building internal tooling that looks like this. It's the natural endpoint of applying software engineering discipline to agent coordination.

The underlying insight isn't new—it predates LLMs by decades. Distributed systems fail at their boundaries. The way you make boundaries reliable is by making them explicit, versioning them, and testing them continuously. The nondeterminism of LLMs adds complexity to the testing layer, but it doesn't change the architectural principle.

References:Let's stay in touch and Follow me for more thoughts and updates