67% trust AI tests only with review - what's your threshold?

I came across some industry research recently that stopped me in my tracks: 67% of engineers say they would trust AI-generated tests, but ONLY with human review. At Anthropic, where we’re at the bleeding edge of AI development, I’m seeing this tension play out every single day.

The data is sobering. AI-generated code introduces 1.7x more total issues than human-written code. Logic and correctness errors appear 1.75x more often. And here’s the kicker - these issues often don’t show up in coverage metrics. You can hit 95% coverage with AI-generated tests and still miss critical edge cases.

The Trust Paradox

We’re in this weird paradox right now. AI can generate tests faster than we can write them manually. The velocity gains are real - our team has seen 3-4x speedups in getting test coverage for new features. But that speed comes with a catch: someone still needs to review those tests, and reviewing takes almost as long as writing them from scratch.

The question I keep wrestling with: What makes an AI-generated test trustworthy?

Is it:

  • Coverage metrics? (We know those can be gamed)
  • Mutation testing scores? (More reliable but expensive to compute)
  • Code review by senior engineers? (Doesn’t scale)
  • Production bug escape rates? (Lagging indicator)

What We’re Seeing in Practice

On my team, we’ve started treating AI-generated tests differently based on the code path:

  1. Critical business logic: 100% human review, often rewriting the tests entirely
  2. Standard CRUD operations: Spot-check review, about 20% of tests
  3. UI component tests: Light review, mostly checking for obvious gaps

But I’m not convinced this is the right approach. We’re essentially using “developer intuition” to decide what needs review, which feels pretty unscientific for a data team.

The Measurement Problem

The real issue is that test coverage is a vanity metric. It always has been, but AI has made this blindingly obvious. An AI can generate tests that hit every line of code but validate nothing meaningful. I’ve seen tests that literally assert expect(result).toBeDefined() - technically covered, completely useless.

What we should be measuring:

  • Defect density: Bugs found in production per KLOC
  • Test effectiveness: What % of bugs do tests actually catch before production?
  • False negative rate: How often do tests pass when they should fail?
  • Maintenance burden: How often do tests break when implementation changes?

But honestly, most teams (including ours) aren’t tracking these metrics systematically.

My Question for You

What’s your threshold for trusting AI-generated tests?

Do you review every single one? Only critical paths? Do you have specific patterns you watch for? Have you found metrics that actually predict whether an AI test is going to catch real bugs?

And maybe more importantly: Are we solving the wrong problem? Should we be focusing less on “can we trust AI tests” and more on “what systemic changes do we need to make testing trustworthy regardless of who/what writes it”?

I’d love to hear how other teams are approaching this. Especially curious about folks in regulated industries - I imagine the stakes are even higher when compliance is on the line.

Rachel, this hits incredibly close to home. At our Fortune 500 fintech company, we made AI-generated testing mandatory review about 8 months ago - not as a best practice, but as a compliance requirement. Let me share what we learned.

The Regulatory Reality

In financial services, we can’t take shortcuts with testing. Our regulators don’t care whether a human or an AI wrote the test - they care whether it actually validates the business logic. When we started using AI test generation tools last year, our compliance team flagged a major issue: we had no audit trail for test coverage decisions.

The AI would generate tests, developers would approve them, and six weeks later we’d find edge cases in production that should have been caught. When auditors asked “why wasn’t this scenario tested?”, we couldn’t answer. The AI didn’t document its reasoning, and neither did the developers who reviewed it.

What We Found Through Mandatory Review

We instituted 100% human review of AI-generated tests for anything touching customer data, financial transactions, or regulatory reporting. Over six months, our senior engineers found issues in about 35% of AI-generated tests:

  • Subtle logic bugs: Tests that passed but validated the wrong thing (e.g., checking that a calculation completed, not that it was correct)
  • Missing edge cases: Obvious to humans but not covered by AI (negative balances, leap years, timezone boundaries)
  • Incorrect assumptions: AI tests that assumed happy-path data structures without validating error handling
  • Compliance gaps: Missing audit logging, data retention checks, or regulatory requirements

The most disturbing category was “tests that would fail in production but pass in test environments” - about 8% of AI tests. These relied on test data that didn’t match production constraints.

The Cost-Benefit Question

Here’s the uncomfortable truth: mandatory review eliminates most of the velocity gains from AI test generation. We’re essentially using AI as a “first draft” generator, then rewriting significant portions.

But for regulated industries, we don’t have a choice. The cost of a compliance failure or audit finding is orders of magnitude higher than the cost of thorough test review. We’ve had to accept that AI testing is about consistency and coverage breadth, not speed.

My Question for Others

For those of you in regulated industries (healthcare, finance, critical infrastructure): How are you handling audit and compliance requirements with AI-generated code and tests?

Are your compliance teams comfortable with AI-generated tests? What documentation do they require? Have you found ways to automate compliance validation without sacrificing the speed gains?

And for those not in regulated industries: Are you sure you’re not creating technical debt or risk that’ll surface later? I’m genuinely curious whether our mandatory review approach is overkill, or whether everyone will eventually need this level of scrutiny as AI-generated code becomes ubiquitous.

This is the exact problem my team is wrestling with right now. Rachel, your 3-tier approach mirrors what we tried to implement, but we’re hitting a wall: AI generates tests faster than we can possibly review them.

The Velocity Trap

Here’s our situation at TechFlow: We adopted GitHub Copilot and started using it for test generation about 4 months ago. Within weeks, our test coverage jumped from 60% to 85%. Management loved it. We hit our OKR for “improve code quality.”

But then the reality set in. Our senior engineers were spending 3-4 hours per day just reviewing AI-generated tests. That’s time they weren’t spending on:

  • Architecture design
  • Code reviews of actual features
  • Mentoring junior developers
  • Fixing actual bugs

We had to make a choice: either slow down feature development to review all the tests, or accept some level of risk with AI-generated tests.

Our Tiered Approach (and Its Limits)

We settled on a risk-based review strategy:

Tier 1 - Critical Paths (100% review)

  • Payment processing
  • Authentication/authorization
  • Data persistence for user content
  • API contracts with external services

Tier 2 - Standard Features (20% spot-check)

  • Business logic for non-critical features
  • UI state management
  • Internal APIs
  • Background jobs

Tier 3 - Low Risk (AI-approved automatically)

  • Utility functions
  • Formatters and validators
  • Static content components

The problem? We’re essentially making gut-feel decisions about what’s “critical” vs “standard” vs “low risk.” And those categorizations shift over time. A “low risk” utility function becomes critical when it’s used in payment processing.

The Question I Can’t Answer

Is selective review creating technical debt? I honestly don’t know.

We haven’t seen an increase in production bugs (yet), but that might just mean the AI is good at writing tests for happy paths. We won’t know if we have coverage gaps until we hit an edge case in production.

Luis mentioned 35% of AI tests had issues in mandatory review. If that’s true for us too, then our Tier 2 and Tier 3 tests (80% of total tests) could have hundreds of problematic tests that we’re just… not catching.

What I Wish Existed

What I really want is AI that can evaluate AI-generated tests. Some kind of meta-validation that could:

  • Check if the test actually validates the right assertions
  • Identify missing edge cases
  • Flag tests that are too brittle or too loose
  • Estimate the “risk” of accepting this test without review

Basically, I want AI tooling that helps us make smarter decisions about where to invest human review time, rather than just generating more tests for us to review manually.

Has anyone seen tools like this? Or are we still in the “AI generates, humans review everything” phase?

Rachel, Luis, Alex - you’re all describing symptoms of the same problem: we’re treating AI test governance as a binary trust decision instead of building organizational processes around it.

This conversation reminds me of when code review first became mandatory at Google back in the day. Engineers initially pushed back because “it slows us down.” But over time, we built processes, tooling, and cultural norms that made code review an accelerator, not a bottleneck.

We need the same evolution for AI-generated tests.

The Framework We Built

At our EdTech startup, we implemented a 3-tier AI test governance system last quarter. The key insight: trust isn’t binary - it’s contextual and measurable.

Tier 1: Auto-Approve (Simple, deterministic code)

  • Pure functions with clear inputs/outputs
  • Data transformations and formatters
  • Unit tests for utility functions
  • Criteria: Cyclomatic complexity < 5, no external dependencies

Tier 2: Async Review (Standard features)

  • Business logic with moderate complexity
  • Integration tests for internal APIs
  • UI component tests
  • Process: AI generates, engineer reviews within 48 hours, tests run but don’t block deploys

Tier 3: Synchronous Review (Critical/complex)

  • Payment, authentication, data privacy features
  • Tests involving external services or databases
  • Anything that touches student data (we’re EdTech, this is sacred)
  • Process: Mandatory review before merge, senior engineer sign-off required

The Key: Anti-Pattern Detection

Alex asked about AI that evaluates AI tests - we’re partway there. We trained our team to spot AI test anti-patterns:

  1. “Assert existence” tests: expect(result).toBeDefined() - useless
  2. “Mocking overkill”: Tests that mock so much they validate nothing
  3. “Happy path only”: No error cases, edge cases, or boundary conditions
  4. “Implementation-coupled”: Tests that break whenever implementation changes
  5. “Flaky time bombs”: Tests with timing assumptions or random data

We created linting rules and code review checklists specifically for AI-generated tests. Now when engineers review, they’re not starting from scratch - they’re checking for known patterns.

Results After 3 Months

The data is promising:

  • 60% of tests auto-approved (Tier 1), saving ~15 hours/week of review time
  • 30% async review (Tier 2), averaged 8 minutes per test
  • 10% synchronous review (Tier 3), averaged 25 minutes per test
  • Zero increase in production bugs related to test coverage gaps
  • Actually caught 2 security issues in Tier 3 review that AI tests missed

The Real Issue: Training Engineers

Here’s what nobody talks about: most engineers don’t know how to review AI-generated tests effectively.

They treat it like code review - looking for style issues and obvious bugs. But AI-generated tests require a different mindset:

  • Is this test validating behavior or implementation?
  • What edge cases is this missing?
  • Will this test catch regressions or just break on refactors?
  • Does this test add value or just coverage points?

We spent 2 weeks training our engineers on AI test review. Best investment we made.

To Luis’s Compliance Question

For those in regulated industries: our Tier 3 approach might work for you. You’re essentially saying “critical systems require human judgment” while allowing AI to accelerate non-critical paths.

The key is documenting the decision framework. When auditors ask “why wasn’t this tested?”, you can show:

  1. The test tier classification
  2. The review process that was followed
  3. The engineer who signed off

That’s an audit trail. Better than “an AI wrote it and we assumed it was fine.”

My Challenge to the Community

Can we create an open-source “AI Test Review Checklist”? A community-driven set of patterns to watch for, organized by language, framework, and test type?

I’m tired of every team reinventing this wheel. Let’s build shared knowledge.

Coming from the security side, I need to add a warning to this conversation: AI-generated tests are particularly dangerous for security validation.

At my fintech, we build fraud detection systems. I’ve reviewed hundreds of AI-generated tests over the past year, and the pattern is consistent: AI tests miss security edge cases almost systematically.

What AI Tests Miss in Security Context

Edge case blindness: AI generates tests for normal user behavior. It doesn’t think adversarially. In fraud detection, that’s fatal.

Example: AI wrote tests for our transaction monitoring that validated “user sends money to valid account.” What it didn’t test: user sends 100 micro-transactions to probe rate limits, user exploits race conditions in balance checks, user manipulates timestamp to bypass fraud windows.

False sense of security: The worst part is the coverage metrics looked great. 90%+ coverage on our fraud detection module. But we were testing the happy path of fraud detection, not the actual fraud scenarios.

The Coverage Metric Lie

Rachel’s point about coverage being a vanity metric is 10x more true for security. I can write one test that hits every line of security code:

test('user can process transaction', () => {
  const result = processTxn(validUser, validAmount, validAccount);
  expect(result.success).toBe(true);
})

That hits 100% code coverage. It validates nothing about security.

What we actually need:

  • Boundary condition tests (max int, negative values, zero, null)
  • Permission/authorization tests (can user X access resource Y?)
  • Injection attack tests (SQL, XSS, command injection)
  • Rate limiting and DoS resilience tests
  • Data leakage tests (can user see data they shouldn’t?)

AI doesn’t generate these unless you explicitly prompt for security testing. And even then, it often misses the subtle attack vectors.

Our Policy: Zero Trust for Security Tests

At my company, we do not accept AI-generated tests for any security-critical code. Period.

Critical paths require:

  1. Human-written security tests by security engineers
  2. Adversarial thinking (red team mindset)
  3. Review by multiple security engineers
  4. Penetration testing validation

AI can assist in generating test scaffolding, but the security assertions must come from humans who understand attack vectors.

Warning to Non-Regulated Industries

Luis asked if mandatory review is overkill for non-regulated industries. From a security perspective: no, it’s not overkill.

A security vulnerability can destroy a company, regulated or not. The Equifax breach cost $4 billion+. The SolarWinds hack compromised thousands of organizations. These weren’t regulated financial services - they affected everyone.

If you’re building software that handles user data, financial transactions, or sensitive information, you cannot rely on AI-generated security tests without rigorous human review.

Coverage metrics will lie to you. Your tests will pass. And you’ll ship vulnerabilities to production.