Skip to main content

Testing the Untestable: Integration Contracts for LLM-Powered APIs

· 10 min read
Tian Pan
Software Engineer

Your test suite passes. The CI is green. You ship the new prompt. Three days later, a user reports that your API is returning JSON with a trailing comma — and your downstream parser has been silently dropping records for 72 hours. You never wrote a test for that because the LLM "always" returned valid JSON in development.

This is the failure mode that ruins LLM-powered products: not catastrophic model collapse, but quiet, intermittent degradation that deterministic test suites are structurally incapable of catching. The root cause isn't laziness — it's that the whole paradigm of "expected == actual" breaks when your system produces non-deterministic natural language.

Fixing this requires rethinking what you're testing and what "passing" even means for an LLM-powered API. The engineers who've figured this out aren't writing smarter equality assertions — they're writing fundamentally different kinds of tests.

Why Temperature=0 Doesn't Save You

The standard advice is to set temperature to 0 for reproducible outputs. This advice is wrong in subtle ways that matter for testing.

A 2025 study on LLM non-determinism found accuracy variations up to 15% across runs with identical settings, and performance gaps reaching 70% between best and worst outcomes. The culprit is floating-point non-associativity in GPU computation: different hardware cores execute operations in different orders, producing numerically distinct (though semantically similar) intermediate states. When load balancing routes your request to a different server, the output changes.

OpenAI's seed parameter explicitly guarantees only "mostly deterministic" outputs — not identical ones — because the provider reserves the right to update infrastructure between calls. If your test suite runs a prompt today and records the exact output, that snapshot may be stale by next week without any change on your end.

The deeper issue is that even if outputs were truly identical, exact-match testing would still be the wrong approach. Consider a summarization endpoint: "The report showed declining Q3 revenue" and "Q3 revenue declined per the report" are the same correct answer, but string equality marks one as a failure. Natural language has no canonical form.

Engineers who don't recognize this write tests that are simultaneously too brittle (failing on valid paraphrases) and too loose (passing on outputs with subtle factual errors). You need a different testing model.

The Inversion Principle: Test What Must Never Happen

The most durable tests for LLM systems aren't assertions about what the output must say — they're assertions about what the output must never do. This inversion is the core insight that makes LLM testing tractable.

A few concrete examples make this concrete:

  • A customer-support bot must never reveal internal system prompt contents
  • A structured data extraction endpoint must never return invalid JSON
  • A summarization service must never introduce factual claims absent from the source document
  • A code generation tool must never include obvious security vulnerabilities (SQL injection, hardcoded credentials)
  • A classification endpoint must never return a label outside the defined enum

These invariants are easy to specify, easy to assert, and don't require you to anticipate every possible valid output. They encode the failure modes that actually cause production incidents rather than academic correctness concerns.

The implementation varies by invariant type. Format invariants are deterministic: parse the JSON, validate against your schema, check enum membership. Semantic invariants require either heuristics (regex for known secrets patterns, NER for hallucinated entities) or a secondary LLM evaluation. Safety invariants often need both: a lightweight classifier for latency-sensitive paths plus periodic deep evaluation against a curated adversarial dataset.

When you wire these as CI gates — not just monitoring alerts — you catch regressions before they ship rather than hours after they've affected users.

Contracts at the Interface Layer

For any API that consumes or produces LLM-generated content, you need two layers of specification: a structural contract and a behavioral contract. Most teams only build the first.

The structural contract is your OpenAPI spec or JSON schema. It defines what fields are present, what types they carry, what enum values are valid. For a synchronous LLM API, this layer should be enforced before the response leaves your service — not by hoping the model follows your prompt, but by parsing and validating the output and retrying or failing explicitly if it doesn't conform. If your LLM returns prose when you asked for JSON, that's a recoverable error you handle in code, not a test failure you catch in CI.

The behavioral contract is harder. It defines properties that must hold for any valid response: the answer must be grounded in the provided context, the tone must match the defined persona, the length must fall within acceptable bounds for the UX. These properties can't be statically validated — they require evaluation.

The practical pattern is to encode behavioral contracts as a rubric your CI evaluator checks before promoting a new prompt or model version. Tools like PromptFoo and DeepEval let you express these as YAML configuration or Python assertions respectively, then run them against a fixed dataset of representative inputs. A build fails not when the output doesn't match a snapshot, but when it scores below threshold on properties you've explicitly defined.

This approach lets you ship model updates and prompt changes with the same confidence you'd have shipping code changes with a regression test suite — but without the brittleness of exact-match assertions.

Property-Based Testing in Practice

Property-based testing (PBT) was developed for deterministic systems — give me a generator for inputs, and I'll find inputs that violate your stated properties. For LLMs, the idea adapts but the execution changes significantly.

The useful properties to test fall into a few categories:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates