Skip to main content

Property-Based Testing for LLM Systems: Invariants That Hold Even When Outputs Don't

· 12 min read
Tian Pan
Software Engineer

A product team at a fintech company shipped an LLM-powered document summarizer. Their eval dataset — 200 hand-curated examples with human ratings — scored 87% quality. In production, the system occasionally returned summaries longer than the original documents when users uploaded short memos. The eval set had no memos under 300 words. The property "output length ≤ input length for summarization tasks" was never tested. Nobody noticed until a customer screenshotted the absurdity and posted it online.

This is the fundamental gap that property-based testing (PBT) fills. Eval datasets measure accuracy on what you thought to test. Property-based tests measure whether your system obeys a contract across the entire space of what could happen.

What Property-Based Testing Is, and Why It's Different

Example-based testing (the dominant paradigm in software engineering) works by asserting: "given input X, output should be Y." You pick inputs deliberately — typical cases, edge cases, boundary values — and write expected outputs by hand or curate them from labeled data. For deterministic code, this works well enough. For LLM systems, it breaks down in two ways: the output space is infinite and natural language, and the function itself is probabilistic.

Property-based testing inverts the approach. Instead of specifying input-output pairs, you specify invariants — structural or semantic properties that must hold for any valid input. The framework then generates hundreds or thousands of inputs automatically and checks whether the property holds across all of them.

The classic example in traditional software: rather than testing sort([3,1,2]) == [1,2,3], you test the property "the output is always non-decreasing AND contains the same elements as the input." This property is true regardless of what list you feed in. Hypothesis (Python), fast-check (JavaScript/TypeScript), and PropEr (Erlang) are the canonical tools for this approach. They generate random inputs, find failures, then automatically shrink failing cases to their minimal form — making bugs easy to diagnose.

The mental shift is significant. Writing good properties requires reasoning about what your system must always guarantee, not just what it usually produces. For LLMs, this turns out to be more tractable than it sounds — because there are several classes of invariants that hold even for systems with non-deterministic outputs.

The Invariants That Actually Hold for LLMs

Length bounds. The most mechanical invariant, but frequently violated in production. A summarizer should produce output shorter than its input (with some tolerance). A tweet generator should stay under 280 characters. A structured extraction prompt asking for "at most five bullet points" should never return six. These properties require no judgment about quality — they're binary, testable, and often broken by long-tail inputs that eval sets don't cover. Testing them across thousands of generated inputs at varying lengths catches truncation bugs, token-limit edge cases, and prompt instruction failures that curated evals miss.

Schema conformance. When your LLM is expected to return structured output — JSON, YAML, a specific object shape — schema conformance is an invariant. It must hold across every input, not just the 50 examples in your eval set. Constrained decoding (via Outlines, Guidance, or OpenAI's structured output mode) helps, but schema conformance testing reveals edge cases: deeply nested schemas, large enum fields, cross-field constraints, and prompts that push the model toward verbose hedging that breaks JSON structure. Research on JSONSchemaBench found that different constrained decoding engines fail on different schema patterns — what passes in development may time out or fail in a different deployment environment.

Semantic monotonicity. This is a subtler but powerful invariant: systems where more relevant context should produce better (or at least not worse) output can be tested for monotonic degradation. A RAG system retrieving five relevant documents should produce a more accurate answer than one with three. A classification prompt with the full question should outperform a truncated version. You can't assert the exact output, but you can assert directionality — score(more context) ≥ score(less context) — using embedding similarity or an LLM-as-judge evaluator. Violations of this property reveal context integration bugs, retrieval failures, and prompt templates that inadvertently suppress useful signal.

Idempotency under rephrasing (metamorphic testing). Perhaps the most powerful property class for catching hallucinations and reasoning failures. The invariant: semantically equivalent inputs should produce semantically equivalent outputs. "What is the capital of France?" and "Name the capital city of France" should both return Paris, not one returning Paris and the other returning Lyon because the model recognized a superficially different phrasing. This is the foundation of metamorphic testing for LLMs — a technique that has recently produced rigorous hallucination detection frameworks. MetaQA (ACM 2025) uses synonym and antonym mutations of LLM responses to verify factual consistency. If "Baseball is popular in Japan" is asserted, the model should also verify the antonym "Baseball is not popular in Japan" as false. Violations surface fact-conflicting hallucinations that consistency-based methods like repeated sampling miss entirely.

Research on Drowzee (OOPSLA 2024) extended this to build a fact-conflicting hallucination detector using metamorphic relations derived from a Wikipedia knowledge base, catching hallucinations that model self-consistency checks reinforced rather than corrected.

Content safety properties. The invariant: no prompt from your defined input space should elicit a response containing toxic content, PII disclosure, or policy-violating material. This maps naturally to property-based testing — the framework generates diverse inputs (including adversarial mutations) and a safety classifier checks the output. The key insight is that safety properties are boolean: either the output violates a constraint or it doesn't. Property tests can be run overnight against thousands of generated inputs, catching safety failures that a curated red-teaming dataset might miss due to coverage gaps.

Structural invariants in agentic systems. When LLMs are used as orchestrators — calling tools, generating plans, producing structured arguments — there are invariants that go beyond output content. Plans should be acyclic. Tool arguments should satisfy schema validation. Actions should be idempotent when marked as such. A production failure documented in 2026 involved a multi-step agent where truncated tool output caused incomplete file writes; the agent had no property asserting "tool call result completeness before proceeding," and the silent failure cascaded into a 20-hour debugging session.

Tools and Frameworks: What to Reach For

Hypothesis (Python) remains the most mature tool for this work. Its stateful testing capability (RuleBasedStateMachine) lets you model multi-turn LLM interactions as state machines, generating sequences of actions and verifying invariants hold across the full interaction, not just individual turns. The @settings(max_examples=1000) decorator scales up test coverage for CI overnight runs.

fast-check (TypeScript/JavaScript) is the right tool for LLM applications built on Node.js — common in full-stack AI products. Its arbitrary composability makes it straightforward to generate structured prompts, user personas, or document inputs as typed inputs to LLM wrapper functions.

PropEr (Erlang/Elixir) is less common in AI applications but worth knowing for systems built on BEAM — particularly for testing conversational state machines in Phoenix applications.

For LLM-specific invariants that require semantic evaluation, you'll typically combine a PBT framework with an evaluator: a sentence-transformer model for embedding similarity, a rule-based safety classifier, or an LLM-as-judge. The PBT framework handles input generation and shrinking; the evaluator produces the boolean pass/fail signal that the framework needs.

DeepEval integrates with pytest and provides 14+ evaluation metrics (hallucination, toxicity, answer relevance, schema correctness) as callable functions — which means they can be used as property checkers inside a Hypothesis test. This combination unlocks semantic property testing without building evaluation infrastructure from scratch.

Why This Is Different From Eval Datasets

The distinction matters practically. Eval datasets answer the question "how does my system perform on a representative sample?" Property-based testing answers "does my system ever violate a contract?" These are different questions, and teams that conflate them ship bugs with high eval scores.

Eval datasets are static. A 500-example dataset covers 500 inputs, no matter how carefully curated. Property-based tests explore the input space continuously, generating cases the test author never anticipated. The research from HumanEval property testing is telling: PBT and example-based testing each caught 68.75% of bugs individually, but together caught 81.25%. The bugs each approach caught were different bugs — PBT found performance-related failures and edge cases in structural inputs; EBT found specific boundary conditions requiring precise examples. Neither approach subsumes the other.

Eval datasets require labeled expected outputs. For complex tasks, this requires expensive human annotation or model-as-judge alignment. Property-based tests require only a verifiable invariant — often automatable without human annotation. For many properties (length bounds, schema conformance, content safety), the checker is deterministic and cheap to run at scale.

Eval datasets go stale. When your model changes, your prompt changes, or your input distribution shifts, eval scores become misleading if the dataset doesn't reflect the new distribution. Property tests are distribution-agnostic — they test contracts, not examples, so they remain valid across model versions.

The practical workflow: use eval datasets to assess quality and compare model versions; use property tests to enforce contracts and catch regressions.

Failure Modes and False Confidence

Property tests have failure modes that are worth understanding before betting on them.

Weak properties give false confidence. The property "output is non-empty" will pass for nearly any LLM on nearly any input. This technically counts as a property test passing, but it catches almost nothing useful. Writing useful properties requires domain knowledge — you have to know what your system must guarantee, not just what it tends to produce.

Coverage gaps are invisible. A property test explores a sampled slice of the input space, not all of it. With 1000 generated inputs, rare failure modes below 0.1% occurrence might not appear. Statistical sampling doesn't guarantee coverage the way that manual test case selection sometimes does for known edge cases.

Evaluators introduce their own errors. When your property checker is a semantic similarity function or an LLM-as-judge, false positives and false negatives in the evaluator propagate into the test results. A safety classifier that misses some toxic outputs will give you false confidence. Calibrate your evaluators on known examples before trusting them as property checkers.

Shrinking is hard for LLM tests. One of Hypothesis's most valuable features is automatic shrinking — when it finds a failing input, it finds the smallest input that still fails. For LLM tests where the "input" is a document or conversation, shrinking is less meaningful and the minimal failing case may not be illuminating.

Cost. Running 1000 LLM invocations per test function is expensive. The practical mitigation: run cheap property tests (schema, length, safety classifiers) in CI, and run expensive semantic property tests (LLM-as-judge evaluation) on a nightly schedule or per release.

What Property Testing Catches That Eval Datasets Don't

The research evidence and production experience converge on a consistent picture:

Eval datasets catch quality regressions on known task types. Property tests catch contract violations on the long tail of inputs. The failure modes that property tests specifically surface include:

  • Input-length sensitivity bugs: prompts that fail when inputs are very short, very long, or contain unusual distributions of tokens — inputs that look nothing like curated eval examples.
  • Instruction-following failures under paraphrase: a model that follows "return JSON only" for the exact phrasing in your system prompt but fails when the user adds a polite prefix, changing the effective instruction slightly.
  • Hallucination consistency failures: the MetaQA approach revealed that many hallucinations are inconsistently produced — the model will assert a falsehood in one phrasing but correctly deny it in another. Eval datasets with single prompts never surface this.
  • Safety bypasses under distribution shift: safety classifiers and guardrails evaluated on curated adversarial prompts miss novel jailbreaks that property-based generation stumbles onto.
  • Schema failures under constraint stress: large LLM-facing schemas with many cross-field constraints fail in ways that only appear when testing against hundreds of different input shapes — not the 20 examples in an eval set.

Getting Started Without Rewriting Your Test Suite

The practical entry point is to pick two or three invariants specific to your application and write property tests for those only. Start with the cheapest to verify: output length bounds and schema conformance require no LLM calls to evaluate. Add content safety checks using an off-the-shelf classifier. Then, if your application has clear semantic monotonicity expectations (a retrieval system, a summarizer), add a threshold-based semantic property test.

The key insight from both the research literature and production practice is that property-based testing for LLMs is not a replacement for good engineering judgment about what to test — it's a force multiplier for that judgment. You still need to identify the invariants. The framework then exhaustively explores whether they hold.

The systems that catch the most bugs in production are the ones that treat LLMs as software components with contractual obligations, not as black boxes evaluated only by sampling. Property-based testing is how you write those contracts down and enforce them automatically.

The frameworks mentioned: Hypothesis (Python), fast-check (JavaScript/TypeScript), PropEr (Erlang/Elixir), DeepEval (Python, LLM-specific metrics).

References:Let's stay in touch and Follow me for more thoughts and updates