Skip to main content

Property-Based Testing for LLM Outputs: Finding the Bugs Your Eval Set Never Imagined

· 11 min read
Tian Pan
Software Engineer

Your eval suite says 94% accuracy. Users report the feature is broken for names that aren't "John" or "Alice." Both things are true, and the gap between them has a name: your curated test set encodes only the failure modes you already imagined.

Property-based testing (PBT) was invented in 1999 to expose exactly this class of blind spot in deterministic software. Applied to LLM outputs, it generates tens of thousands of adversarial input variants automatically, probing domain boundaries that hand-written test cases structurally cannot reach. A 2025 OOPSLA study found that on average each property-based test discovers approximately 50 times as many mutant bugs as the average unit test. A separate study measured that PBT and example-based testing (EBT) fail on different bugs — combining both raised detection rates from 68.75% to 81.25%. That 12.5-point gap is not rounding error; it represents an entire class of failure invisible to one approach.

This article is for engineers who already have eval suites and want to find the bugs those suites structurally cannot find.

Why Your Eval Set Has a Blind Spot by Design

When you write a curated eval set, you enumerate scenarios you consider important. You test happy paths, known edge cases, and the specific failures that burned you before. This is valuable — but it is an exercise in testing your imagination, not the full input space.

The structural problem: curated sets encode the inputs you thought of. Real user traffic encodes the inputs you didn't. A customer service bot tested on customer_name: "Alice" may behave differently on customer_name: "张伟" — not because of any intent in the code, but because the training data distribution created an implicit coupling the developer never considered. A classifier tested with the question "What is the refund policy?" may behave differently on the semantically identical "Can I get a refund?" — not because the intent is different, but because the lexical surface is.

Property-based testing inverts the model. Instead of asserting f(input_1) == expected_1, you assert a universal invariant: f(transform(input)) should satisfy property P for all generated input. The framework generates thousands of inputs and tries to falsify your property. When it does, it shrinks the counterexample to the minimal failing case.

For deterministic software, this is already powerful. For LLM-backed software, it is arguably more necessary — and more challenging to implement correctly.

What "Properties" Mean for Non-Deterministic Text Outputs

The standard objection is that LLM outputs are non-deterministic strings with no single correct answer. You cannot assert output == expected_string. This is true, and it misses the point.

Properties for LLM outputs are not equality assertions. They are invariants about the relationship between inputs and outputs, or about structural characteristics of outputs.

Structural conformance is the easiest entry point. Output must be valid JSON. Output must contain required fields. Response must match a declared schema. These are binary, deterministic, and catch a high fraction of regressions after model or prompt changes. They require zero LLM calls to evaluate — just a schema validator.

Semantic invariance is the most powerful class. Define a transformation T of the input such that the output should satisfy a known relationship. Examples:

  • Equivalence under paraphrase: "What is your return policy?" and "How do I return a product?" should yield equivalent answers. Test with cosine similarity ≥ 0.8 between embeddings of both outputs.
  • Demographic invariance: Swapping customer_name: "Michael" for customer_name: "Wei" in a non-gendered scenario should not change the sentiment or outcome of the response.
  • Order invariance: Reordering facts presented in a prompt should not change the conclusion. A model that reaches different answers based on which supporting document it sees first — when all documents say the same thing — is exhibiting order sensitivity that will cause unpredictable production behavior.
  • Addition invariance: Adding irrelevant padding text should not flip a classification.

A 2026 study tested seven models with eight semantic-preserving transformations across reasoning problems. The best model (30B parameters) violated semantic invariance 20.4% of the time. The worst violated it 73% of the time. More strikingly, the 30B model significantly outperformed models with ten to a hundred times more parameters on this axis. Scale is not a reliable proxy for semantic robustness — a finding invisible to capability benchmarks.

Monotonicity is applicable to confidence or quality scores: adding more relevant evidence should not decrease answer quality. If your system provides a confidence score, adding a second corroborating source should not lower it.

Length and format bounds apply when outputs have known structural requirements: response must be fewer than N tokens, list must contain exactly K items, output must begin with a specific prefix. These are fully deterministic properties testable without LLM judges.

Parametric Input Generation: Three Approaches That Work

The mechanical challenge is generating the inputs that will falsify your properties. Three strategies cover most practical scenarios.

Entity substitution is the most accessible starting point. Replace named entities with semantically equivalent alternatives — same role, different referent. For a customer service bot: vary customer_name across ["Alice", "Maria", "Zhang Wei", "Mohammed", "Priya"], vary product across the full SKU catalog, vary issue_type across known request categories. This is a Cartesian product that generates coverage no human would write by hand, and it reliably exposes demographic or product-specific behavioral drift.

In practice, Python's Hypothesis library handles this with st.sampled_from(entity_list). Promptfoo achieves the same through YAML vars declarations. The same inputs run against your structural and semantic invariant properties.

Numeric perturbations target threshold sensitivity. LLMs trained on financial, medical, or analytical tasks frequently exhibit behavioral changes at psychologically round numbers that do not correspond to actual domain thresholds. Test amount at $0.99, $1.00, $1.01, $999.99, $1000, $1000.01. Test confidence_threshold at 0.49, 0.50, 0.51. A QA study of an LLM-RAG application found that varying generation temperature and top-p parameters caused sentiment estimation to drop from 0.99 to 0.35 at maximum values — a 64% degradation invisible to fixed-input eval suites.

Instruction reordering targets the premise order sensitivity documented in multiple studies. A multi-step prompt — "Summarize the document, then classify the sentiment" — may behave differently from the reordered version "Classify the sentiment, then summarize the document." Research from ICLR 2024 demonstrated that reordering premises in reasoning problems significantly degraded accuracy across tested models. For any system prompt with multiple instructions, reordering should be a standard invariant: output quality should not depend on instruction sequence.

The Oracle Problem: Why PBT for LLMs Requires Thought

Classic property-based testing benefits from cheap, automatic oracles. sorted(list) is obviously correct if every adjacent pair satisfies a ≤ b. There is no equivalent for open-ended LLM output.

This is the primary engineering challenge, not the input generation. You have three oracle options, each with tradeoffs.

Structural oracles are fast, cheap, and deterministic — but incomplete. JSON schema validation tells you the response is well-formed; it says nothing about whether the content is accurate or appropriate. Teams that rely only on structural properties develop false confidence.

Embedding similarity oracles are the practical workaround for semantic invariance testing. Compute cosine similarity between the embeddings of two outputs that should be semantically equivalent. Assert similarity ≥ 0.8. This is stable across model versions (unlike exact string match) and does not require running another LLM call to evaluate. The limitation: cosine similarity is a proxy for semantic equivalence, not a guarantee.

LLM-as-judge oracles are the highest-fidelity option for subjective properties, but they add inference cost and introduce another source of non-determinism. Use them selectively — for high-stakes behavioral properties where structural and similarity checks are insufficient. Semgrep uses GPT-4o as a judge for 14 prompt chains in CI; they cache LLM responses keyed on prompt hash to keep CI deterministic across runs that don't touch the prompt.

Researchers have catalogued 191 distinct metamorphic relations for NLP LLM tasks across five categories: equivalence, discrepancy, set equivalence, distance, and set distance. This is a practical library that teams can implement incrementally, starting with the subset relevant to their task type.

Three-Tier CI Integration That Doesn't Break Your Pipeline

The practical challenge is that running 100 input variants per property is 100× the inference cost. The solution is a tiered architecture that matches cost to commit frequency.

Tier 1: Structural properties on every commit. JSON schema conformance, required field presence, length bounds, format assertions. These require zero LLM calls — use recorded/cached responses. Binary pass/fail. Fast enough to block a PR in under a minute.

Tier 2: Statistical property assertions on PR merge. For semantic invariance and parametric input properties, run each property test N times (10–50 is typical) and assert a minimum pass rate — "must pass 8 of 10 runs" rather than binary pass. Use semantic similarity against embeddings rather than exact string match. Gate on this before merging to main, not on every commit.

Tier 3: Full parametric sweeps on schedule. Comprehensive entity substitution, numeric perturbation, and instruction reordering runs against the full property suite. Run nightly or triggered by model version changes. Failure here blocks model upgrades, not feature deploys.

The threshold design matters. Setting temperature=0 does not make LLMs deterministic — research shows up to 15% accuracy variance across identical runs and best-to-worst gaps reaching 70% on some benchmarks within a single session. This means a property test that fails 3 of 10 runs is not necessarily a flaky test; it may be exposing genuine, reproducible variance. Set your pass rate thresholds based on observed baseline variance, not a round number.

One workflow that works: use Hypothesis with a VCR-style HTTP recorder so that generated inputs run against recorded LLM responses for Tier 1 replays, falling back to live API calls only for Tier 2 and 3 runs. This gives you the input generation power of PBT with the determinism of snapshot testing for the fastest feedback loop.

Where PBT Fails to Give Confidence

Two failure modes warrant explicit acknowledgment.

The completeness ceiling. There is no way to know if your property suite is complete. Research found that even the best LLMs can automatically synthesize correct property tests for only 21% of the properties extractable from API documentation. The bugs invisible to your chosen properties remain invisible. PBT expands your coverage but does not close it.

Compound system behavior. Property testing at the unit level — individual prompt calls — does not surface issues that emerge from multi-step composition. Four LLM calls at 95% individual stability yield roughly 81% system-level stability. The cascade behavior of an agent pipeline requires integration-level properties, not just unit-level ones. This is tractable but requires designing properties at the pipeline level, not just at individual call boundaries.

Both limitations argue for combining PBT with other testing approaches, not replacing them. The 2025 study finding — that PBT and example-based testing fail on different bugs — is the key insight: they are complements, not substitutes.

What to Start With

The ROI case for beginning small is strong. Pick the three highest-risk prompt calls in your pipeline — the ones handling the widest variety of user input, touching the most sensitive business logic, or with the most observed production variance. Write one structural property (schema conformance) and one semantic invariance property (demographic or paraphrase invariance) for each.

Run them in Tier 2 on every PR merge. This covers six properties, costs roughly 300 additional LLM calls per deploy, and will reliably surface behavioral drift that your current eval set cannot. Add entity substitution sweeps for any prompt that handles user-supplied named entities — this is where demographic invariance failures most commonly appear.

The 50× bug-finding multiplier does not arrive instantly. It accumulates as you add properties and as your parametric coverage expands to match the actual input distribution of your production traffic. Start narrow, instrument aggressively, and let the failure reports guide where to expand next.

Your eval set is a map of the territory you've already explored. Property-based testing is how you find what's beyond the edge.

References:Let's stay in touch and Follow me for more thoughts and updates