API Contracts for Non-Deterministic Services: Versioning When Output Shape Is Stochastic
Your content moderation service returns {"severity": "MEDIUM", "confidence": 0.85}. The downstream billing system parses severity as an enum with values ["low", "medium", "high"]. A model update causes the service to occasionally return "Medium" with a capital M. No deployment happened. No schema changed. The integration breaks in production, and nobody catches it for six days because the HTTP status codes are all 200.
This is the foundational problem with API contracts for LLM-backed services: the surface looks like a REST API, but the behavior underneath is probabilistic. Standard contract tooling assumes determinism. When that assumption breaks, it breaks silently.
Why Traditional Contract Testing Fails
Consumer-Driven Contract Testing (CDCT) — the Pact-style model — works beautifully for deterministic systems. The consumer writes tests that specify expected interactions. The provider verifies it can satisfy those interactions. The contract is a static artifact that either passes or fails.
This model collapses for LLM-backed services for three reasons:
Output is probabilistic at the token level. The same prompt at temperature 0 produces the same output most of the time, but not always — floating-point non-determinism, batching effects, and hardware differences across provider regions introduce variance even in "deterministic" configurations. At higher temperatures, the output space is genuinely stochastic.
Behavior changes without schema changes. A model update can shift output tone, verbosity, reliability, or factual accuracy without touching a single field name. Semver was designed for breaking structural changes. It has no vocabulary for "same schema, 8% higher hallucination rate."
The failure mode is silence. When a traditional API breaks a contract, you get a 4xx or 5xx. When an LLM-backed service violates a semantic contract, it returns 200 with plausible-looking JSON that fails downstream. The median time to detect this in production is days, not seconds.
A content moderation system that changes from "MEDIUM" to "medium" to "Medium" across three prompt versions has silently broken its callers three times. None of those breaks generated an alert.
The Four Failure Modes You'll Actually Hit
Understanding where contracts break in practice is more useful than theoretical frameworks. Here are the four patterns that recur across production systems:
Schema drift from model updates. When you upgrade from one model version to another, output structure can shift subtly — increased verbosity wraps previously clean JSON in explanatory text, field ordering changes, optional fields that were always present start being omitted. None of this is in a changelog.
Prompt drift cascades. Small prompt changes that seem purely cosmetic — swapping "respond with valid JSON" for "always return parseable JSON" — change the distribution of outputs in ways that matter. Research shows prompt updates are the primary driver of production incidents in LLM systems. The change is in your VCS, but its effect on downstream consumers is invisible until it isn't.
Silent quality degradation. Your API returns well-formed, schema-valid JSON that is wrong. Confidence scores that cluster at 0.9 regardless of actual signal. Entity extraction that silently drops low-confidence entities rather than flagging them. This class of failure doesn't trip any schema validation; it requires semantic monitoring.
Compositional failures in multi-agent pipelines. Agent A and Agent B individually satisfy their contracts. When A's output feeds into B's prompt, the combined behavior violates a contract that neither individual test exercised. Production research identified 73+ distinct contract types across LLM systems, with compositional failures being among the hardest to catch.
Semantic Versioning for Prompt-Driven APIs
Traditional semver maps to structural changes: major for breaking schema changes, minor for additive changes, patch for bug fixes. This works when the contract is the schema.
For LLM-backed services, the contract includes behavioral properties that schema versioning can't capture. A new model that's 8% more likely to refuse certain inputs is a breaking change for some consumers. A prompt tweak that changes output tone from neutral to opinionated is a breaking change for callers that downstream the text to users. Neither shows up in a JSON Schema diff.
A more useful versioning model:
| Trigger | Version bump |
|---|---|
| Breaking structural change (field removed, type changed, required added) | Major |
| New optional output fields, improved accuracy, reduced latency | Minor |
| Typo fixes, formatting adjustments, no semantic impact | Patch |
| Same schema, behavioral change affecting reliability or semantics | Minor at minimum; major if caller behavior needs updating |
The critical addition is the last row. A model upgrade that changes refusal rate, hallucination frequency, or output verbosity should increment the minor version even if the JSON Schema is identical. Callers deserve the signal.
Two operational rules follow from this:
Immutability. Once a version is published, it doesn't change. A v1.2.3 endpoint always behaves like v1.2.3 — it doesn't silently change behavior when you upgrade the underlying model. If the model changes, the version changes. This forces you to be explicit about what changed and gives callers a stable target.
Version the full execution context. A prompt version isn't just a string. It's the prompt + model + temperature + retrieval configuration. When any of those change, the output distribution changes. Version them together, track them together.
Replacing Exact-Match Contracts with Behavioral Invariants
- https://treblle.com/blog/api-contracts-in-llm-workflows
- https://nordicapis.com/how-llms-are-breaking-the-api-contract-and-why-that-matters/
- https://www.braintrust.dev/articles/what-is-prompt-versioning
- https://openai.com/index/introducing-structured-outputs-in-the-api/
- https://simmering.dev/blog/openai_structured_output/
- https://www.aidancooper.co.uk/constrained-decoding/
- https://arxiv.org/html/2407.09435v2
- https://arxiv.org/html/2510.25297v1
- https://deepchecks.com/llm-production-challenges-prompt-update-incidents/
- https://markaicode.com/future-proofing-llm-applications-model-updates/
- https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies
- https://arxiv.org/html/2508.20737v1
