Skip to main content

Testing the Untestable: Integration Contracts for LLM-Powered APIs

· 10 min read
Tian Pan
Software Engineer

Your test suite passes. The CI is green. You ship the new prompt. Three days later, a user reports that your API is returning JSON with a trailing comma — and your downstream parser has been silently dropping records for 72 hours. You never wrote a test for that because the LLM "always" returned valid JSON in development.

This is the failure mode that ruins LLM-powered products: not catastrophic model collapse, but quiet, intermittent degradation that deterministic test suites are structurally incapable of catching. The root cause isn't laziness — it's that the whole paradigm of "expected == actual" breaks when your system produces non-deterministic natural language.

Fixing this requires rethinking what you're testing and what "passing" even means for an LLM-powered API. The engineers who've figured this out aren't writing smarter equality assertions — they're writing fundamentally different kinds of tests.

Why Temperature=0 Doesn't Save You

The standard advice is to set temperature to 0 for reproducible outputs. This advice is wrong in subtle ways that matter for testing.

A 2025 study on LLM non-determinism found accuracy variations up to 15% across runs with identical settings, and performance gaps reaching 70% between best and worst outcomes. The culprit is floating-point non-associativity in GPU computation: different hardware cores execute operations in different orders, producing numerically distinct (though semantically similar) intermediate states. When load balancing routes your request to a different server, the output changes.

OpenAI's seed parameter explicitly guarantees only "mostly deterministic" outputs — not identical ones — because the provider reserves the right to update infrastructure between calls. If your test suite runs a prompt today and records the exact output, that snapshot may be stale by next week without any change on your end.

The deeper issue is that even if outputs were truly identical, exact-match testing would still be the wrong approach. Consider a summarization endpoint: "The report showed declining Q3 revenue" and "Q3 revenue declined per the report" are the same correct answer, but string equality marks one as a failure. Natural language has no canonical form.

Engineers who don't recognize this write tests that are simultaneously too brittle (failing on valid paraphrases) and too loose (passing on outputs with subtle factual errors). You need a different testing model.

The Inversion Principle: Test What Must Never Happen

The most durable tests for LLM systems aren't assertions about what the output must say — they're assertions about what the output must never do. This inversion is the core insight that makes LLM testing tractable.

A few concrete examples make this concrete:

  • A customer-support bot must never reveal internal system prompt contents
  • A structured data extraction endpoint must never return invalid JSON
  • A summarization service must never introduce factual claims absent from the source document
  • A code generation tool must never include obvious security vulnerabilities (SQL injection, hardcoded credentials)
  • A classification endpoint must never return a label outside the defined enum

These invariants are easy to specify, easy to assert, and don't require you to anticipate every possible valid output. They encode the failure modes that actually cause production incidents rather than academic correctness concerns.

The implementation varies by invariant type. Format invariants are deterministic: parse the JSON, validate against your schema, check enum membership. Semantic invariants require either heuristics (regex for known secrets patterns, NER for hallucinated entities) or a secondary LLM evaluation. Safety invariants often need both: a lightweight classifier for latency-sensitive paths plus periodic deep evaluation against a curated adversarial dataset.

When you wire these as CI gates — not just monitoring alerts — you catch regressions before they ship rather than hours after they've affected users.

Contracts at the Interface Layer

For any API that consumes or produces LLM-generated content, you need two layers of specification: a structural contract and a behavioral contract. Most teams only build the first.

The structural contract is your OpenAPI spec or JSON schema. It defines what fields are present, what types they carry, what enum values are valid. For a synchronous LLM API, this layer should be enforced before the response leaves your service — not by hoping the model follows your prompt, but by parsing and validating the output and retrying or failing explicitly if it doesn't conform. If your LLM returns prose when you asked for JSON, that's a recoverable error you handle in code, not a test failure you catch in CI.

The behavioral contract is harder. It defines properties that must hold for any valid response: the answer must be grounded in the provided context, the tone must match the defined persona, the length must fall within acceptable bounds for the UX. These properties can't be statically validated — they require evaluation.

The practical pattern is to encode behavioral contracts as a rubric your CI evaluator checks before promoting a new prompt or model version. Tools like PromptFoo and DeepEval let you express these as YAML configuration or Python assertions respectively, then run them against a fixed dataset of representative inputs. A build fails not when the output doesn't match a snapshot, but when it scores below threshold on properties you've explicitly defined.

This approach lets you ship model updates and prompt changes with the same confidence you'd have shipping code changes with a regression test suite — but without the brittleness of exact-match assertions.

Property-Based Testing in Practice

Property-based testing (PBT) was developed for deterministic systems — give me a generator for inputs, and I'll find inputs that violate your stated properties. For LLMs, the idea adapts but the execution changes significantly.

The useful properties to test fall into a few categories:

Structural properties are the easiest: the response is valid JSON, required fields are present, types match the schema. These are deterministic checks and should run on every request in your integration test suite.

Consistency properties test that responses are stable under semantically equivalent transformations. If you rephrase the question without changing its meaning, the answer should agree. If you reorder items in a list, classification results shouldn't flip. These are metamorphic tests — you don't know the correct output, but you know what relationship two outputs must satisfy.

Monotonicity properties test directional behavior: adding more context to a summarization request should produce a longer or at least not shorter summary. Providing a more specific system prompt should decrease response variance, not increase it. These let you validate that model behavior moves in the right direction as you tune prompts.

Safety properties are non-negotiable invariants: no PII leakage, no unsafe content, no hallucinated citations. These are expensive to evaluate comprehensively (often requiring an LLM judge), so the practical approach is to run them on a representative sample in CI and more exhaustively in periodic offline evaluations.

Academic research synthesizing property-based tests with GPT-4 found that valid tests could be generated for about 21% of properties derivable from API documentation, with correct synthesis occurring in an average of 2.4 samples. The key finding: combining PBT with example-based testing improved edge-case bug detection from 68.75% to 81.25%. The two approaches are complementary, not competing.

Semantic Snapshots Instead of String Snapshots

Traditional snapshot testing is a trap for LLM outputs. You capture a response on day one, commit it to the repo, and now every valid paraphrase is a test failure. Within a week you're ignoring snapshot failures because they're always false positives.

Semantic snapshot testing captures the meaning, not the text. The implementation depends on what you need to verify:

For factual correctness, embed both the expected and actual response using a sentence encoder and compute cosine similarity. Set a threshold — typically 0.85–0.92 depending on how much paraphrase tolerance you need. This catches meaningful degradations (the model now answers a different question) without failing on stylistic changes.

For structured content — summaries, bullet-point extractions, classification rationales — use an LLM judge with a defined rubric. The judge gets the input, the reference response, the actual response, and a set of criteria to evaluate. It returns a score and explanation. This is more expensive than embedding similarity but catches subtler quality regressions.

For agent trajectories — tool call sequences, reasoning chains, decision paths — record the execution trace, not just the final output. An agent that arrives at the right answer via a nonsensical tool-call sequence is a latent failure waiting to compound in a more complex case. Structural testing of traces lets you assert properties like "the agent always calls the validation tool before the write tool" independently of what the final output says.

The key shift is treating evaluation as a first-class engineering concern rather than a QA step at the end. Your test infrastructure needs to answer: "did this response satisfy our properties?" not "does this response match what we wrote down?"

Wiring It Into CI Without Breaking the Bank

The practical obstacle is cost. Running an LLM judge on every test case for every CI run is expensive enough that teams disable the checks under time pressure and never re-enable them.

The solution is a tiered evaluation strategy:

Tier 1 (every run, milliseconds): Deterministic checks — format validation, schema conformance, enum membership, latency thresholds. These are cheap and catch the most impactful failures. Wire them as hard gates.

Tier 2 (every PR, seconds to low minutes): Semantic evaluation against a curated golden set of 50–200 representative inputs. Use embedding similarity for structural correctness and a lightweight judge model for behavioral properties. This catches prompt regressions before they ship.

Tier 3 (nightly or on release, minutes to hours): Exhaustive evaluation against your full test dataset, adversarial probing, safety testing. Use your best judge model. Generate human-review queues for cases near your quality thresholds.

Production traffic should feed back into your golden set. The hardest inputs your API receives in production — the ones that cause retries, corrections, or abandonment — are the most valuable test cases. Route a sample of those into your evaluation pipeline automatically. Within a few weeks, your test suite is calibrated to your actual failure modes rather than the failure modes you imagined during development.

One critical warning about LLM-as-judge: forcing judges to output only numeric scores degrades their reliability. Prompting judges to explain their reasoning before scoring improves alignment with human judgment from roughly 75% to 85%. The explanation is cheap additional tokens that substantially improve signal quality — pay for them.

The Transition in Practice

The teams that navigate this well make one key organizational shift: they stop treating LLM quality as a deployment concern and start treating it as a merge concern.

When a model update, prompt change, or retrieval configuration change lands in a PR, it triggers the same tiered evaluation pipeline that any code change would trigger. Regressions on your behavioral contracts fail the build. The engineer who made the change sees a quality diff — which inputs degraded, by how much, and why — before their code ships.

This makes quality regressions visible at the right moment: when someone is actively making a change and can immediately investigate the cause. The alternative — catching regressions through production monitoring — means you're always reacting to user impact rather than preventing it.

The infrastructure investment is real. Building a golden dataset takes time. Calibrating judge prompts and score thresholds requires iteration. But teams that skip this work don't avoid the cost — they pay it reactively, in the form of incidents, trust erosion, and the kind of silent data corruption that runs for 72 hours before anyone notices a JSON parser silently dropping records.

Your test suite is green. The question is whether "green" means anything.


The practical starting point: pick the one invariant that, if violated, would cause the worst production incident. Write a test that asserts it. Run it on every PR. That's a more durable foundation than a thousand string-match snapshot tests.

References:Let's stay in touch and Follow me for more thoughts and updates