Our test coverage went to 95% with AI. Our bugs went up 40%. We're measuring wrong

Three months ago, my EdTech startup hit a major milestone: 95% test coverage across our codebase. We celebrated. We put it in our all-hands slides. Our board was impressed.

Last month, I had to explain to that same board why our production bug rate increased by 40% over the same period.

We achieved our coverage goals. We shipped more features. And our quality got worse. Something is fundamentally broken about how we measure testing in the AI era.

How We Got Here

We adopted AI test generation aggressively:

  • Engineers used Copilot to generate unit tests
  • We set up automated PR checks requiring 80%+ coverage
  • Teams raced to hit coverage targets
  • Every sprint review showed green coverage metrics going up

Leadership loved it. Engineering loved it. Our customers… not so much.

What We Didn’t Measure

While we obsessed over coverage percentage, we completely missed:

Bug Escape Rate: The percentage of bugs that make it to production. This went UP by 40%.

Test Effectiveness: Of the bugs found in production, how many could have been caught by our tests? We never measured this.

False Negative Rate: How often tests pass when they should fail. We had no idea.

Test Maintenance Burden: Hours spent fixing broken tests during refactoring. This TRIPLED.

Mean Time to Detection: How quickly tests catch regressions. This actually got WORSE.

The Coverage Metric Lie

Here’s what happened: AI optimized for our stated goal (coverage percentage), not our actual goal (bug prevention).

We had tests like this:

test('user service loads', () => {
  const service = new UserService();
  expect(service).toBeDefined();
});

test('user service has methods', () => {
  const service = new UserService();
  expect(typeof service.getUser).toBe('function');
});

These tests hit lines of code. They made coverage go up. They caught zero bugs.

But worse than useless tests were tests that created false confidence:

test('user update validates email', async () => {
  const result = await updateUser({ email: '[email protected]' });
  expect(result.success).toBe(true);
});

Coverage: 100% of the updateUser function.
Actual validation: None. What about invalid emails? What about SQL injection? What about race conditions?

The Bugs That Escaped

Our production bugs fell into patterns:

  1. Edge cases not tested: Null values, empty arrays, boundary conditions
  2. Integration failures: Unit tests passed, integration failed
  3. Race conditions: Tests assumed sequential execution
  4. Security gaps: Tests validated happy path, not attack vectors
  5. Performance regressions: Tests checked correctness, not speed

The kicker? All of these could have been prevented by better tests. But AI-generated tests don’t think about edge cases or security unless explicitly prompted.

What We Should Have Measured

After this painful lesson, we’re shifting our metrics:

1. Mutation Testing Score

  • Introduce bugs into code, see if tests catch them
  • Current reality: 60% of our “high coverage” code has 0% mutation score
  • Target: 80% mutation score on critical paths

2. Defect Density

  • Bugs found per 1000 lines of code
  • Track separately for AI-generated vs human-written tests
  • Currently: 2.3x higher for AI-test-only modules

3. Test Effectiveness Rate

  • % of production bugs that could have been caught by existing tests
  • Requires post-mortems on every bug
  • Current baseline: 65% of bugs should have been caught

4. Test Quality Gates

  • Tests must include negative cases
  • Tests must validate error handling
  • Tests must cover boundary conditions
  • Automated linting to enforce these patterns

5. Time to Detect Regressions

  • How quickly do tests catch when features break?
  • Some of our tests passed even when features were completely broken

The Hard Conversation

I had to tell my team: We’re removing coverage percentage from our goals.

Instead:

  • Each feature must have tests for 3 positive cases, 3 negative cases, and 3 edge cases
  • All critical paths require mutation testing
  • Production bugs trigger mandatory test gap analysis
  • Test quality is now part of code review, not just test existence

Unsurprisingly, our coverage percentage dropped. We deleted hundreds of useless tests.

And our bug rate started decreasing.

My Question for the Community

What metrics actually predict quality when AI is writing your tests?

Coverage is clearly broken. What should we measure instead? What’s worked for you?

And has anyone successfully implemented mutation testing at scale? We’re finding it slow (adds 10x to CI time) and noisy (lots of false positives).

I’m also curious: Are we the only ones who got burned by chasing coverage metrics? Or is this a common pattern that nobody talks about?

Because if AI test generation is going to be the norm, we need new metrics that actually correlate with quality. The old playbook doesn’t work anymore.

Keisha, this is exactly the conversation we need to have. Coverage has always been a vanity metric, but AI has made this painfully obvious. Let me share some statistical rigor on what actually predicts quality.

Coverage Was Never The Goal

In statistics, we distinguish between leading indicators and lagging indicators. Coverage is neither - it’s a completeness metric that correlates weakly with quality.

Research shows:

  • Coverage above 80% has diminishing returns
  • No correlation between coverage and bug density beyond baseline
  • High coverage with poor test quality is worse than low coverage with good tests (false confidence)

Your 95% coverage with 40% more bugs is a textbook example of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Metrics That Actually Predict Quality

Based on industry research and our work at Anthropic, here are metrics that correlate with code quality:

1. Defect Density (bugs per KLOC)

  • Target: Under 1.0 for production code
  • Track by module and over time
  • Separate AI-generated code from human-written
  • Current research shows AI code has 1.7x higher defect density

2. Bug Escape Rate

  • % of bugs found in production vs caught in testing
  • Target: Under 10% for critical systems
  • Requires discipline in bug tracking and categorization
  • This is a LAGGING indicator but it’s honest

3. Mutation Testing Score

  • Introduce defects, see if tests catch them
  • Target: 80%+ for critical paths
  • Tools: Stryker (JS), PITest (Java), mutmut (Python)
  • Yes, it’s slow, but it’s the best quality signal we have

4. Test Flakiness Rate

  • % of test runs that produce inconsistent results
  • Target: Under 1%
  • Flaky tests indicate poor test quality and brittleness
  • AI-generated tests tend to be more flaky (timing, mocking issues)

5. Code Churn in Tests

  • How often tests change when implementation changes
  • Target: Under 10% churn rate
  • High churn indicates brittle tests coupled to implementation
  • Measure separately for test code vs production code

The Measurement Framework

Here’s what I recommend:

Stop measuring: Coverage percentage as a goal
Start measuring: Coverage as a baseline (should be >70% but don’t optimize for it)

Stop measuring: Number of tests
Start measuring: Test effectiveness (% of bugs caught before production)

Stop measuring: Test pass/fail rates
Start measuring: Mean time to detect regressions

Add measurement: Mutation score for critical paths

On Mutation Testing At Scale

You mentioned mutation testing is slow and noisy. We faced the same issues. Here’s what worked:

1. Selective Mutation Testing

  • Don’t run on every commit
  • Run on critical modules only
  • Run full mutation suite nightly, not in CI
  • Focus on high-risk code paths

2. Mutation Operator Tuning

  • Not all mutations are valuable
  • Disable noisy operators (string literal mutations, magic number mutations)
  • Focus on logic mutations (conditionals, arithmetic, return values)

3. Incremental Mutation

  • Only mutate changed lines
  • Tools like Stryker support incremental mode
  • Reduces runtime from hours to minutes

4. Mutation Coverage as a Gate

  • Require 80%+ mutation score for new code in critical paths
  • Grandfather legacy code
  • Use as a PR check, not a merge blocker (too slow)

The Real Question: What’s The Cost of Poor Quality?

Your bug rate went up 40%. What did that actually cost?

  • Customer support tickets?
  • Engineering time to fix?
  • Customer churn?
  • Reputation damage?

If you can quantify that cost, you can justify the investment in better metrics and slower (but higher quality) testing practices.

My hypothesis: The cost of that 40% bug increase far exceeds the cost of proper mutation testing infrastructure.

To Your Team’s Resistance

Removing coverage from goals is brave. Expect pushback. Here’s the data to support it:

Google’s research (2014): Coverage beyond 80% shows no correlation with reduced defects
Microsoft’s research (2018): Test quality matters more than test quantity
Recent AI code quality research (2026): AI-generated code with high coverage but low mutation score has 2.1x defect density

Show your team the alternative: measure what matters, not what’s easy to measure.

Keisha, your story is a warning we all need to hear. At our Fortune 500 fintech, we almost made the same mistake. Let me share how compliance requirements actually saved us from ourselves.

The Compliance Forcing Function

In financial services, we can’t just measure coverage. We have to demonstrate that our tests actually validate regulatory requirements. That forced us to think differently about test quality from day one.

Our auditors ask questions like:

  • “How do you verify that interest calculations are correct?”
  • “How do you test that PII is properly encrypted?”
  • “What prevents unauthorized account access?”

Answering “we have 95% test coverage” gets you laughed out of the audit. They want to see tests that explicitly validate regulatory requirements.

This compliance burden turned out to be a blessing. We couldn’t game metrics even if we wanted to.

Beyond Coverage: Compliance Test Gates

Here’s what we implemented (and I think non-regulated industries should copy):

1. Requirement Traceability

  • Every test must trace to a requirement
  • Every requirement must have tests
  • Gap analysis tooling shows uncovered requirements
  • AI can generate tests, but humans must map to requirements

2. Test Categories with Different Quality Bars

  • Functional tests: Verify features work (can be AI-generated)
  • Compliance tests: Verify regulatory requirements (human-written and reviewed)
  • Security tests: Verify attack resistance (security team only)
  • Performance tests: Verify SLAs (separate infrastructure)

Each category has different quality gates. Compliance tests require 100% mutation score.

3. Manual Test Audits

  • 10% random sample of tests audited quarterly
  • Check: Does this test actually validate what it claims?
  • Found issues: 18% of tests had misleading names or assertions
  • AI-generated tests failed audits 3x more often

4. Negative Test Requirements

  • Every positive test must have a negative counterpart
  • If you test “valid email accepted”, must also test “invalid email rejected”
  • Automated linting checks for this pattern
  • Shockingly, AI rarely generates negative tests unless prompted

The Metrics We Actually Track

Your new metrics are spot-on. Here’s what we’ve added:

Test Effectiveness Index (TEI)

  • Formula: (Bugs caught in testing) / (Total bugs found)
  • Target: >85% for production releases
  • Tracks quality of test suite, not quantity

Critical Path Coverage

  • Coverage % but ONLY for critical business flows
  • Ignore utility code, focus on money movement
  • Target: 100% on critical paths, don’t care about overall %

Audit Finding Rate

  • Issues found in audits per 1000 tests
  • Separate human vs AI tests
  • AI tests have 2.8x higher audit finding rate

Test Debt Ratio

  • (Tests that break during refactoring) / (Total tests)
  • High ratio indicates brittle tests
  • Our AI-generated tests have 40% higher debt ratio

On Mutation Testing Performance

Rachel mentioned tuning. We went further: risk-based mutation testing.

Only run mutation testing on:

  1. Code that handles money
  2. Code that handles PII
  3. Code that makes access control decisions
  4. Code flagged as “high complexity” (cyclomatic complexity >10)

This covers ~15% of our codebase but ~80% of our risk. Mutation testing runs in 20 minutes instead of 6 hours.

For the other 85%? We accept that coverage + code review is “good enough.”

The Cultural Challenge

Hardest part: convincing engineers that deleting tests is a good thing.

We had teams with 3000+ tests, 92% coverage, and terrible quality. I told them: “Delete every test that doesn’t validate meaningful behavior.”

They deleted 1200 tests. Coverage dropped to 78%. Bug rate decreased by 22%.

Turns out, those deleted tests were creating more harm than good:

  • Maintenance burden (broke on every refactor)
  • False confidence (high coverage, low quality)
  • Slow CI (more tests = longer feedback loops)

To Your Question

You asked if others got burned by chasing coverage. Yes. Everyone in regulated industries has a coverage horror story.

The pattern is universal:

  1. Set coverage target
  2. Engineers game the metric
  3. Quality suffers
  4. Expensive production incident
  5. Re-evaluate metrics

Some teams learn this lesson through compliance audits (like us). Others through production outages (like you). Either way, the lesson is expensive.

Advice for Non-Regulated Teams

You don’t have auditors forcing quality. That’s both freedom and danger.

My suggestion: Act like you’re regulated. Pick your critical paths and treat them as if the SEC is watching. For everything else, be pragmatic.

And please, remove coverage percentage from your dashboards. It’s doing more harm than good.

This thread is both validating and terrifying. Validating because I’ve suspected our coverage obsession was wrong. Terrifying because we might be heading down the same path.

Our Current Reality

At TechFlow, we have:

  • 87% test coverage (and climbing with AI help)
  • Coverage gate in CI (won’t merge below 80%)
  • Weekly dashboards showing coverage trends
  • Engineers incentivized to increase coverage

But nobody measures if those tests actually catch bugs. And now I’m realizing we’re building a house of cards.

The Practical Question: How Do I Fix This?

Rachel and Luis, your frameworks are great but also overwhelming. I’m a senior engineer, not a VP. I can’t overhaul our entire testing strategy.

What can I do at the team level?

Here’s what I’m thinking:

1. Start Tracking Bug Sources

  • Every bug we find: Could our tests have caught this?
  • Tag bugs: “missing test”, “bad test”, “test bypassed”, “infrastructure”
  • Build data to show tests aren’t working

2. Mutation Testing on My Feature

  • I own the payment module
  • Start running Stryker on just that code
  • Show the team what “good tests” look like
  • Use as example to advocate for better metrics

3. Implement “3x3 Rule” in Code Review

  • Require 3 positive, 3 negative, 3 edge case tests for new features
  • Make this a review checklist item
  • Don’t care about coverage %, care about meaningful tests

4. Delete Useless Tests

  • Audit my team’s test suite
  • Find tests that assert .toBeDefined() or similar nonsense
  • Delete them and document the coverage drop
  • Show that lower coverage can mean better quality

Questions for the VPs

Keisha, Luis - how do I sell this to leadership who love their green coverage dashboards?

I can’t just say “coverage is a vanity metric.” They’ll ask “then what should we measure?” and I don’t have good answers yet.

Also: How do I avoid bikeshedding about what counts as a “meaningful test”?

I can already hear the debates:

  • “But that edge case is unlikely, why test it?”
  • “Integration tests are slow, can’t we just mock everything?”
  • “This test does validate something, just not the happy path”

Without clear guidance, we’ll spend hours arguing about test philosophy instead of actually improving quality.

The Mutation Testing Challenge

I tried running Stryker on our payment module. Results:

  • 85% line coverage
  • 42% mutation score
  • CI time went from 3 minutes to 47 minutes

I can’t merge that into our pipeline. It would block every PR for 45 minutes.

Rachel mentioned selective mutation testing - can you share more details on tooling and configuration?

The Real Fear

Here’s what keeps me up at night: What if we’ve already shipped critical bugs that our “comprehensive” test suite missed?

We have 2000+ tests. 87% coverage. All green. And after reading this thread, I have zero confidence they actually protect us.

How do I audit an existing test suite for quality? Where do I even start?