I came across some industry research recently that stopped me in my tracks: 67% of engineers say they would trust AI-generated tests, but ONLY with human review. At Anthropic, where we’re at the bleeding edge of AI development, I’m seeing this tension play out every single day.
The data is sobering. AI-generated code introduces 1.7x more total issues than human-written code. Logic and correctness errors appear 1.75x more often. And here’s the kicker - these issues often don’t show up in coverage metrics. You can hit 95% coverage with AI-generated tests and still miss critical edge cases.
The Trust Paradox
We’re in this weird paradox right now. AI can generate tests faster than we can write them manually. The velocity gains are real - our team has seen 3-4x speedups in getting test coverage for new features. But that speed comes with a catch: someone still needs to review those tests, and reviewing takes almost as long as writing them from scratch.
The question I keep wrestling with: What makes an AI-generated test trustworthy?
Is it:
- Coverage metrics? (We know those can be gamed)
- Mutation testing scores? (More reliable but expensive to compute)
- Code review by senior engineers? (Doesn’t scale)
- Production bug escape rates? (Lagging indicator)
What We’re Seeing in Practice
On my team, we’ve started treating AI-generated tests differently based on the code path:
- Critical business logic: 100% human review, often rewriting the tests entirely
- Standard CRUD operations: Spot-check review, about 20% of tests
- UI component tests: Light review, mostly checking for obvious gaps
But I’m not convinced this is the right approach. We’re essentially using “developer intuition” to decide what needs review, which feels pretty unscientific for a data team.
The Measurement Problem
The real issue is that test coverage is a vanity metric. It always has been, but AI has made this blindingly obvious. An AI can generate tests that hit every line of code but validate nothing meaningful. I’ve seen tests that literally assert expect(result).toBeDefined() - technically covered, completely useless.
What we should be measuring:
- Defect density: Bugs found in production per KLOC
- Test effectiveness: What % of bugs do tests actually catch before production?
- False negative rate: How often do tests pass when they should fail?
- Maintenance burden: How often do tests break when implementation changes?
But honestly, most teams (including ours) aren’t tracking these metrics systematically.
My Question for You
What’s your threshold for trusting AI-generated tests?
Do you review every single one? Only critical paths? Do you have specific patterns you watch for? Have you found metrics that actually predict whether an AI test is going to catch real bugs?
And maybe more importantly: Are we solving the wrong problem? Should we be focusing less on “can we trust AI tests” and more on “what systemic changes do we need to make testing trustworthy regardless of who/what writes it”?
I’d love to hear how other teams are approaching this. Especially curious about folks in regulated industries - I imagine the stakes are even higher when compliance is on the line.