Contract Tests for LLM Tool Surfaces: When the Vendor Changes a Field and Your Agent Silently Adapts
A vendor flipped "items" to "results" in a tool response last Tuesday. The agent didn't crash. It re-planned around the new shape, returned a confident-looking answer that was missing two-thirds of the rows, and the on-call engineer found out three days later when a customer asked why their export was short. No exception fired. No alert tripped. The eval suite, which runs against a frozen fixture from before the vendor change, was green the whole time.
This is the failure mode that contract testing was invented to catch in microservices a decade ago, and the one that almost no agent stack has any equivalent for today. HTTP services have Pact, schemathesis, and oasdiff to sit between consumer and provider and refuse to let breaking changes ship. The tools you hand to your agent — REST endpoints, internal RPCs, vendor SDKs, MCP servers — have nothing comparable. The model absorbs the change, adapts gracefully, and produces a degraded answer that looks correct.
The architectural insight worth internalizing before going any further: the tool's API is part of your agent's prompt. Every breaking change downstream is a silent prompt change upstream. If the procurement tool starts returning a string where it used to return an integer, you didn't just break a parser — you mutated the latent context the planner uses to choose its next step. And unlike a prompt change you'd ship through your prompt-versioning pipeline, this one ships without a PR, without a review, without a rollback button.
Why Exceptions Don't Save You
In a deterministic service, a breaking field rename causes a JSON parse error or a KeyError, your handler catches it, you page someone, you roll back. The error budget burns and the system tells you. Agents are designed to do the opposite: when a tool returns something unexpected, the planner reasons over the new shape and tries to recover. That recovery is the feature people pay for. It's also exactly what hides the contract violation.
Three concrete patterns I keep seeing in production traces:
- Field rename:
items→results. The tool used to return{"items": [...]}. Now it returns{"results": [...]}. The agent reads the response, noticesitemsis missing, and decides "this query returned nothing." It moves on. Downstream summary says "no matching records found." There never were no matching records — there are 412. - Type widening: an integer field starts arriving as a string. The agent stringifies it again into the next tool call and the receiving tool either coerces it back (silent) or treats it as text and returns nothing useful.
- Null where empty array used to live:
"tags": []becomes"tags": null. The planner's next step wasfor each tag in tags. Withnull, the iteration is skipped, and the entire branch of work the user asked for doesn't happen.
None of these throw. None of these page anyone. All of them are bugs that ship to your users at LLM speed.
What Contract Testing Means for Tools
The microservices version of contract testing has two flavors and you want both for tools.
Consumer-driven contracts (the Pact lineage) make the consumer write down what it expects from the provider. The contract is generated as a byproduct of a consumer test, then handed to the provider, who runs it against their actual service in CI. If the provider's response shape doesn't match what the consumer encoded, the provider's build fails. This catches a vendor breaking your agent before the vendor's release reaches production.
Schema-based contracts (the OpenAPI / schemathesis lineage) compare a versioned spec to live behavior. You hold the vendor's OpenAPI document, you record traffic, you assert the traffic conforms to the spec. Tools like oasdiff also let you compare today's spec to yesterday's, classify each diff as breaking, non-breaking, or deprecation, and fail the build on a high-risk delta.
For agent tools, you want the consumer-driven part because only your agent knows which fields it actually depends on. A tool with twenty response fields doesn't have twenty equally-load-bearing fields — your agent's planner reads four, the formatter reads two, and the rest are dead weight. The contract you want to enforce is the four-field subset, not the full schema. If the vendor wants to add fields, fine. If they want to rename one of your four, you want a red build, not a degraded answer.
The schema-based part is what you bolt on to the vendor's published spec when they have one. Pact and OpenAPI are not competitors here. They answer different questions: Pact answers "does this provider satisfy this specific consumer's expectations," OpenAPI conformance answers "does this provider's runtime match its own documentation." For an agent that calls ten tools across four vendors, you'll end up using both.
Evals Tell You the Model Behaved; Contracts Tell You It Could Have
The most common pushback I hear from teams who already have a reasonable eval suite: "we run end-to-end evals against a curated set of conversations, isn't that already contract testing?"
It is not. Evals tell you the model produced an acceptable output on inputs you've already seen. Contracts tell you the model could have produced an acceptable output on inputs the model has never seen but that the tool layer guarantees it can produce. The distinction is in coverage of the input space.
A frozen eval set runs against frozen tool responses — usually mocked, sometimes recorded from a fixture that's months old. The eval is green when the recorded response is what the model expects. The eval is also green when the live response has drifted away from the recording, because the eval doesn't talk to the live tool. So a vendor can ship a breaking change to production at noon, your nightly eval at 3am can be all green, and your users can be hitting the regression in real time.
The discipline that fixes this is two-layered:
- Contracts run against a representative set of tool inputs the vendor has staged or that your contract recording asserts must continue to work. They do not need the model in the loop. They do not need a $0.50 generation per case. They run cheap and fast on every PR and at intervals against vendor staging.
- Evals run against the model with mocked or recorded tool responses, and exercise the parts of the input space the model has to reason over. They are expensive but bounded.
Contracts catch the silent prompt mutation. Evals catch the model regression. You need both, and they cover non-overlapping failure surfaces.
Schema Fixtures, Adversarial Mutators, and the Tests You Actually Write
The tactical playbook borrows directly from microservices and adds a layer for LLMs.
Versioned schema fixtures, per tool. Every tool your agent calls has a directory of fixture files keyed by tool name and version. The fixture asserts the response shape your agent depends on, not the full vendor schema. When the vendor publishes a new spec, your CI diffs the fixture against it (oasdiff does this for OpenAPI) and either auto-promotes (additive change), opens a PR (deprecation), or fails the build (breaking change).
Golden-trace replay against new tool responses. Take a frozen agent run from production — the trajectory of tool calls and responses that produced a known-good answer. When the vendor's response shape might have changed, replay the trajectory with the new live response substituted in and assert the agent still produces the equivalent final answer. This is the same idea as snapshot testing, ported to the agent tool boundary, and several observability platforms have started shipping it as a primitive in the last year.
Adversarial mutators between fixture and live. You generate near-misses of the live response on purpose: drop a field, inject null where the schema says the field is non-nullable, swap an integer for a string, return an empty array instead of a populated one, return more rows than the agent's context window can hold. Run the agent against the mutated response. Assert it does the right thing — which is usually one of: produce a structured error, ask for clarification, or use a fallback path. Frameworks like ToolFuzz and the various LLM-fuzzing harnesses do exactly this; the trick is wiring them into CI rather than running them once at design time.
The CI pipeline that runs against vendor staging. The single best place to catch a breaking change is before it ships. Most B2B vendors with non-trivial APIs publish staging or sandbox environments and version their releases. A nightly contract job that points at vendor staging, replays your golden traces, and posts a diff to a Slack channel turns a four-day silent regression into a same-day "their next release breaks our procurement tool, file a ticket." This is not theoretical — it's how every serious microservices shop has worked for ten years. We just haven't ported the discipline to the agent stack yet.
Where Teams Get This Wrong
A few failure patterns from the trenches.
Blaming the model for hallucinating a field name. Your team sees the agent reference a total_count field that the live tool doesn't return, and someone files a hallucination ticket against the model. Then someone digs into the trace and realizes total_count was returned by the tool last week and got renamed to count in a vendor patch release that nobody read. The model isn't hallucinating; it's working from a system prompt that was implicitly extended by months of consistent tool responses. The fix isn't a better model — it's a contract that would have failed the vendor's release.
Asserting on the wrong layer. Teams that do think to add tool-response validation often write assertions at the parsing layer: "the response was JSON and had a results key." That catches catastrophic breakage but not the interesting cases. The contract you want is closer to "if the request is query=X, the response has at least one row whose region field equals Y," because that's the level the agent actually reasons over.
Contracts as documentation, not gates. The contract sits in a repo, no CI runs it, the schema drifts, and three engineers separately rediscover it during incident reviews. If the contract doesn't fail a build, it's not a contract — it's a folk tale.
Pretending tools are deterministic. Some tools — search, recommendations, anything ranked — are non-deterministic by design. Treating their response as a fixed schema you can pin a contract to is correct; treating their response content as a fixed value is not. Separate the schema contract (always enforce) from the content assertion (sample-based, fuzzy match, or skip).
The Architectural Bet Worth Making
If you internalize one thing from this, make it this: every tool response is a silent prompt mutation. The mental model that says "tool calls are I/O" is correct at the bytes layer and wrong at the cognition layer. At the cognition layer, the schema your tool returns is fused into the planner's working context every step it runs. A field rename is a prompt edit nobody approved.
Contract testing in microservices took off because the cost of a contract violation was a 500 error a customer noticed in seconds. In the agent era, the cost is a quietly degraded answer that ships at machine speed and gets caught days later if it gets caught at all. The economic incentive to invest in tool contracts is therefore higher in agents than in microservices, even though the discipline is less mature.
A reasonable place to start this week: pick the one tool your agent depends on most, write down the four to six fields it actually reads from the response, generate a fixture, wire a CI job that replays a golden trace against the live tool nightly, and add three adversarial mutators that drop, null, and type-shift the most load-bearing field. You will catch a regression before your users do. After that, repeat for the next tool. The platform you end up with looks a lot like Pact for agents, and somebody is going to build the open-source version of it before the year is out.
- https://docs.pact.io/
- https://docs.pact.io/blog/2025/05/28/pact-open-source-update-may-2025
- https://www.speakeasy.com/blog/pact-vs-openapi
- https://pactflow.io/blog/contract-testing-using-json-schemas-and-open-api-part-3/
- https://www.oasdiff.com/
- https://github.com/oasdiff/oasdiff
- https://github.com/eth-sri/ToolFuzz
- https://aws.amazon.com/blogs/machine-learning/toolsimulator-scalable-tool-testing-for-ai-agents/
- https://langwatch.ai/scenario/testing-guides/mocks/
- https://medium.com/@meryemmsakinn/end-vibe-driven-development-testing-ai-agents-in-ci-pipelines-promptfoo-golden-traces-b9b222b23d72
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://latitude.so/blog/why-ai-agents-break-in-production
- https://medium.com/data-science-collective/why-ai-agents-keep-failing-in-production-cdd335b22219
- https://nimblebrain.ai/why-ai-fails/agent-governance/agent-failure-modes/
- https://dev.to/aws/how-to-stop-ai-agents-from-hallucinating-silently-with-multi-agent-validation-3f7e
