We’ve been using GitHub Copilot and Claude for code generation across our team at TechFlow for the past eight months, and I want to share something that completely changed how we think about testing AI-generated code.
The Problem Nobody Talks About
AI-generated code has a surprisingly high defect rate. Various studies and industry reports put it at 50% or higher for non-trivial functions. The insidious part? The code looks correct. It’s well-structured, follows conventions, has reasonable variable names, and handles the obvious cases. Traditional unit tests — the kind we write based on expected inputs and outputs — often pass because we tend to test the same “happy path” that the AI was trained on.
Our team experienced this firsthand. After adopting Copilot, our unit test coverage stayed at 85% (our internal target), but production bugs increased by 40% over a three-month period. We were writing more code faster, our tests were green, and yet our bug tracker was filling up. Something was fundamentally broken in our quality feedback loop.
Enter Property-Based Testing
Property-based testing (PBT) flips the testing paradigm on its head. Instead of writing specific example-based tests like “given input X, expect output Y,” you define properties — invariants that must always hold true for any valid input — and the testing framework generates hundreds or thousands of random inputs to find violations.
For example, instead of testing sort([3, 1, 2]) equals [1, 2, 3], you define properties like:
- The output length must equal the input length
- Every element in the output must exist in the input
- Each element must be less than or equal to the next element
The framework then generates random lists — empty lists, single-element lists, lists with duplicates, lists with negative numbers, lists with MAX_INT values — and checks if your properties hold.
The Tools
The ecosystem is more mature than most developers realize:
- Hypothesis (Python) — The gold standard. Incredibly powerful generators and shrinking.
- fast-check (TypeScript/JavaScript) — Excellent for full-stack TS teams. Integrates with Jest and Vitest.
- QuickCheck (Haskell) — The original. Still influential in how PBT frameworks are designed.
- PropCheck (Elixir) — Great integration with ExUnit, feels native to the language.
- jqwik (Java/Kotlin) — JUnit 5 integration, great for enterprise Java shops.
Why PBT Catches What Unit Tests Miss
AI-generated code excels at the common case because that’s what dominates training data. Stack Overflow answers, GitHub repositories, and documentation examples all focus on typical usage. But production failures happen at the boundaries:
- Empty inputs: AI-generated parsers that crash on empty strings
- Unicode: String manipulation functions that break on emoji or RTL text
- Negative numbers: Math functions that assume positive inputs
- Concurrent access: Race conditions in shared state that only manifest under specific timing
- Boundary values: Off-by-one errors at
MAX_INT, zero-length arrays, null values
PBT randomizes exactly these categories. The generators are designed to produce adversarial inputs — the exact kind of edge cases that AI training data under-represents.
Our Results
After our production bug spike, we added property-based tests for every AI-generated function touching business logic. In the first week, we caught three critical bugs — all boundary condition failures:
- A currency conversion function that returned
NaNfor zero amounts (dividing by a rate that could be zero) - A date range filter that included the end date in some time zones but not others
- A pagination function that returned duplicate results when the total count was exactly divisible by the page size
All three had passing unit tests. All three would have reached production without PBT.
The Generator + Tester Pattern
Here’s the workflow we’ve settled on:
- AI generates the implementation — Copilot or Claude writes the function
- Human defines the properties — The engineer identifies what invariants must hold
- PBT framework finds the gaps — Hypothesis/fast-check generates inputs and reports violations
- AI fixes the implementation — Feed the failing property back to the AI for a fix
- Repeat until all properties hold
This is a powerful loop. The AI is fast at generating code, the human is good at reasoning about correctness properties, and the PBT framework is thorough at finding violations.
Why Adoption Is Still Low
Here’s the honest answer: defining properties is hard. It requires a different kind of thinking than example-based testing. Writing expect(add(2, 3)).toBe(5) is straightforward. Writing “for all integers a and b, add(a, b) should equal add(b, a)” requires you to think abstractly about what the function guarantees.
Most developers find this intellectually demanding, especially mid-level engineers who haven’t been trained in formal reasoning. But here’s the twist — AI tools themselves can help. Ask Claude “what properties should this sorting function always satisfy?” and you’ll get a solid starting list. The AI is actually good at reasoning about properties even when it’s bad at implementing them correctly.
My Question to the Community
Is anyone else combining property-based testing with AI code generation? I’m particularly curious about:
- Which PBT frameworks work best with your stack?
- How do you decide which functions deserve PBT vs. traditional unit tests?
- Have you tried using AI to generate the property definitions themselves?
I genuinely believe PBT is the missing piece in the AI-assisted development puzzle. We’re generating code faster than ever, but our quality assurance hasn’t kept pace. PBT is how we close that gap.