Our test coverage went to 95% with AI. Our bugs went up 40%. We're measuring wrong

vp_eng_keisha · February 22, 2026, 4:56am

Three months ago, my EdTech startup hit a major milestone: 95% test coverage across our codebase. We celebrated. We put it in our all-hands slides. Our board was impressed.

Last month, I had to explain to that same board why our production bug rate increased by 40% over the same period.

We achieved our coverage goals. We shipped more features. And our quality got worse. Something is fundamentally broken about how we measure testing in the AI era.

How We Got Here

We adopted AI test generation aggressively:

Engineers used Copilot to generate unit tests
We set up automated PR checks requiring 80%+ coverage
Teams raced to hit coverage targets
Every sprint review showed green coverage metrics going up

Leadership loved it. Engineering loved it. Our customers… not so much.

What We Didn’t Measure

While we obsessed over coverage percentage, we completely missed:

Bug Escape Rate: The percentage of bugs that make it to production. This went UP by 40%.

Test Effectiveness: Of the bugs found in production, how many could have been caught by our tests? We never measured this.

False Negative Rate: How often tests pass when they should fail. We had no idea.

Test Maintenance Burden: Hours spent fixing broken tests during refactoring. This TRIPLED.

Mean Time to Detection: How quickly tests catch regressions. This actually got WORSE.

The Coverage Metric Lie

Here’s what happened: AI optimized for our stated goal (coverage percentage), not our actual goal (bug prevention).

We had tests like this:

test('user service loads', () => {
  const service = new UserService();
  expect(service).toBeDefined();
});

test('user service has methods', () => {
  const service = new UserService();
  expect(typeof service.getUser).toBe('function');
});

These tests hit lines of code. They made coverage go up. They caught zero bugs.

But worse than useless tests were tests that created false confidence:

test('user update validates email', async () => {
  const result = await updateUser({ email: '[email protected]' });
  expect(result.success).toBe(true);
});

Coverage: 100% of the updateUser function.
Actual validation: None. What about invalid emails? What about SQL injection? What about race conditions?

The Bugs That Escaped

Our production bugs fell into patterns:

Edge cases not tested: Null values, empty arrays, boundary conditions
Integration failures: Unit tests passed, integration failed
Race conditions: Tests assumed sequential execution
Security gaps: Tests validated happy path, not attack vectors
Performance regressions: Tests checked correctness, not speed

The kicker? All of these could have been prevented by better tests. But AI-generated tests don’t think about edge cases or security unless explicitly prompted.

What We Should Have Measured

After this painful lesson, we’re shifting our metrics:

1. Mutation Testing Score

Introduce bugs into code, see if tests catch them
Current reality: 60% of our “high coverage” code has 0% mutation score
Target: 80% mutation score on critical paths

2. Defect Density

Bugs found per 1000 lines of code
Track separately for AI-generated vs human-written tests
Currently: 2.3x higher for AI-test-only modules

3. Test Effectiveness Rate

% of production bugs that could have been caught by existing tests
Requires post-mortems on every bug
Current baseline: 65% of bugs should have been caught

4. Test Quality Gates

Tests must include negative cases
Tests must validate error handling
Tests must cover boundary conditions
Automated linting to enforce these patterns

5. Time to Detect Regressions

How quickly do tests catch when features break?
Some of our tests passed even when features were completely broken

The Hard Conversation

I had to tell my team: We’re removing coverage percentage from our goals.

Instead:

Each feature must have tests for 3 positive cases, 3 negative cases, and 3 edge cases
All critical paths require mutation testing
Production bugs trigger mandatory test gap analysis
Test quality is now part of code review, not just test existence

Unsurprisingly, our coverage percentage dropped. We deleted hundreds of useless tests.

And our bug rate started decreasing.

My Question for the Community

What metrics actually predict quality when AI is writing your tests?

Coverage is clearly broken. What should we measure instead? What’s worked for you?

And has anyone successfully implemented mutation testing at scale? We’re finding it slow (adds 10x to CI time) and noisy (lots of false positives).

I’m also curious: Are we the only ones who got burned by chasing coverage metrics? Or is this a common pattern that nobody talks about?

Because if AI test generation is going to be the norm, we need new metrics that actually correlate with quality. The old playbook doesn’t work anymore.

data_rachel · February 22, 2026, 4:58am

Keisha, this is exactly the conversation we need to have. Coverage has always been a vanity metric, but AI has made this painfully obvious. Let me share some statistical rigor on what actually predicts quality.

Coverage Was Never The Goal

In statistics, we distinguish between leading indicators and lagging indicators. Coverage is neither - it’s a completeness metric that correlates weakly with quality.

Research shows:

Coverage above 80% has diminishing returns
No correlation between coverage and bug density beyond baseline
High coverage with poor test quality is worse than low coverage with good tests (false confidence)

Your 95% coverage with 40% more bugs is a textbook example of Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”

Metrics That Actually Predict Quality

Based on industry research and our work at Anthropic, here are metrics that correlate with code quality:

1. Defect Density (bugs per KLOC)

Target: Under 1.0 for production code
Track by module and over time
Separate AI-generated code from human-written
Current research shows AI code has 1.7x higher defect density

2. Bug Escape Rate

% of bugs found in production vs caught in testing
Target: Under 10% for critical systems
Requires discipline in bug tracking and categorization
This is a LAGGING indicator but it’s honest

3. Mutation Testing Score

Introduce defects, see if tests catch them
Target: 80%+ for critical paths
Tools: Stryker (JS), PITest (Java), mutmut (Python)
Yes, it’s slow, but it’s the best quality signal we have

4. Test Flakiness Rate

% of test runs that produce inconsistent results
Target: Under 1%
Flaky tests indicate poor test quality and brittleness
AI-generated tests tend to be more flaky (timing, mocking issues)

5. Code Churn in Tests

How often tests change when implementation changes
Target: Under 10% churn rate
High churn indicates brittle tests coupled to implementation
Measure separately for test code vs production code

The Measurement Framework

Here’s what I recommend:

Stop measuring: Coverage percentage as a goal
Start measuring: Coverage as a baseline (should be >70% but don’t optimize for it)

Stop measuring: Number of tests
Start measuring: Test effectiveness (% of bugs caught before production)

Stop measuring: Test pass/fail rates
Start measuring: Mean time to detect regressions

Add measurement: Mutation score for critical paths

On Mutation Testing At Scale

You mentioned mutation testing is slow and noisy. We faced the same issues. Here’s what worked:

1. Selective Mutation Testing

Don’t run on every commit
Run on critical modules only
Run full mutation suite nightly, not in CI
Focus on high-risk code paths

2. Mutation Operator Tuning

Not all mutations are valuable
Disable noisy operators (string literal mutations, magic number mutations)
Focus on logic mutations (conditionals, arithmetic, return values)

3. Incremental Mutation

Only mutate changed lines
Tools like Stryker support incremental mode
Reduces runtime from hours to minutes

4. Mutation Coverage as a Gate

Require 80%+ mutation score for new code in critical paths
Grandfather legacy code
Use as a PR check, not a merge blocker (too slow)

The Real Question: What’s The Cost of Poor Quality?

Your bug rate went up 40%. What did that actually cost?

Customer support tickets?
Engineering time to fix?
Customer churn?
Reputation damage?

If you can quantify that cost, you can justify the investment in better metrics and slower (but higher quality) testing practices.

My hypothesis: The cost of that 40% bug increase far exceeds the cost of proper mutation testing infrastructure.

To Your Team’s Resistance

Removing coverage from goals is brave. Expect pushback. Here’s the data to support it:

Google’s research (2014): Coverage beyond 80% shows no correlation with reduced defects
Microsoft’s research (2018): Test quality matters more than test quantity
Recent AI code quality research (2026): AI-generated code with high coverage but low mutation score has 2.1x defect density

Show your team the alternative: measure what matters, not what’s easy to measure.

eng_director_luis · February 22, 2026, 4:58am

Keisha, your story is a warning we all need to hear. At our Fortune 500 fintech, we almost made the same mistake. Let me share how compliance requirements actually saved us from ourselves.

The Compliance Forcing Function

In financial services, we can’t just measure coverage. We have to demonstrate that our tests actually validate regulatory requirements. That forced us to think differently about test quality from day one.

Our auditors ask questions like:

“How do you verify that interest calculations are correct?”
“How do you test that PII is properly encrypted?”
“What prevents unauthorized account access?”

Answering “we have 95% test coverage” gets you laughed out of the audit. They want to see tests that explicitly validate regulatory requirements.

This compliance burden turned out to be a blessing. We couldn’t game metrics even if we wanted to.

Beyond Coverage: Compliance Test Gates

Here’s what we implemented (and I think non-regulated industries should copy):

1. Requirement Traceability

Every test must trace to a requirement
Every requirement must have tests
Gap analysis tooling shows uncovered requirements
AI can generate tests, but humans must map to requirements

2. Test Categories with Different Quality Bars

Functional tests: Verify features work (can be AI-generated)
Compliance tests: Verify regulatory requirements (human-written and reviewed)
Security tests: Verify attack resistance (security team only)
Performance tests: Verify SLAs (separate infrastructure)

Each category has different quality gates. Compliance tests require 100% mutation score.

3. Manual Test Audits

10% random sample of tests audited quarterly
Check: Does this test actually validate what it claims?
Found issues: 18% of tests had misleading names or assertions
AI-generated tests failed audits 3x more often

4. Negative Test Requirements

Every positive test must have a negative counterpart
If you test “valid email accepted”, must also test “invalid email rejected”
Automated linting checks for this pattern
Shockingly, AI rarely generates negative tests unless prompted

The Metrics We Actually Track

Your new metrics are spot-on. Here’s what we’ve added:

Test Effectiveness Index (TEI)

Formula: (Bugs caught in testing) / (Total bugs found)
Target: >85% for production releases
Tracks quality of test suite, not quantity

Critical Path Coverage

Coverage % but ONLY for critical business flows
Ignore utility code, focus on money movement
Target: 100% on critical paths, don’t care about overall %

Audit Finding Rate

Issues found in audits per 1000 tests
Separate human vs AI tests
AI tests have 2.8x higher audit finding rate

Test Debt Ratio

(Tests that break during refactoring) / (Total tests)
High ratio indicates brittle tests
Our AI-generated tests have 40% higher debt ratio

On Mutation Testing Performance

Rachel mentioned tuning. We went further: risk-based mutation testing.

Only run mutation testing on:

Code that handles money
Code that handles PII
Code that makes access control decisions
Code flagged as “high complexity” (cyclomatic complexity >10)

This covers ~15% of our codebase but ~80% of our risk. Mutation testing runs in 20 minutes instead of 6 hours.

For the other 85%? We accept that coverage + code review is “good enough.”

The Cultural Challenge

Hardest part: convincing engineers that deleting tests is a good thing.

We had teams with 3000+ tests, 92% coverage, and terrible quality. I told them: “Delete every test that doesn’t validate meaningful behavior.”

They deleted 1200 tests. Coverage dropped to 78%. Bug rate decreased by 22%.

Turns out, those deleted tests were creating more harm than good:

Maintenance burden (broke on every refactor)
False confidence (high coverage, low quality)
Slow CI (more tests = longer feedback loops)

To Your Question

You asked if others got burned by chasing coverage. Yes. Everyone in regulated industries has a coverage horror story.

The pattern is universal:

Set coverage target
Engineers game the metric
Quality suffers
Expensive production incident
Re-evaluate metrics

Some teams learn this lesson through compliance audits (like us). Others through production outages (like you). Either way, the lesson is expensive.

Advice for Non-Regulated Teams

You don’t have auditors forcing quality. That’s both freedom and danger.

My suggestion: Act like you’re regulated. Pick your critical paths and treat them as if the SEC is watching. For everything else, be pragmatic.

And please, remove coverage percentage from your dashboards. It’s doing more harm than good.

alex_dev · February 22, 2026, 4:58am

This thread is both validating and terrifying. Validating because I’ve suspected our coverage obsession was wrong. Terrifying because we might be heading down the same path.

Our Current Reality

At TechFlow, we have:

87% test coverage (and climbing with AI help)
Coverage gate in CI (won’t merge below 80%)
Weekly dashboards showing coverage trends
Engineers incentivized to increase coverage

But nobody measures if those tests actually catch bugs. And now I’m realizing we’re building a house of cards.

The Practical Question: How Do I Fix This?

Rachel and Luis, your frameworks are great but also overwhelming. I’m a senior engineer, not a VP. I can’t overhaul our entire testing strategy.

What can I do at the team level?

Here’s what I’m thinking:

1. Start Tracking Bug Sources

Every bug we find: Could our tests have caught this?
Tag bugs: “missing test”, “bad test”, “test bypassed”, “infrastructure”
Build data to show tests aren’t working

2. Mutation Testing on My Feature

I own the payment module
Start running Stryker on just that code
Show the team what “good tests” look like
Use as example to advocate for better metrics

3. Implement “3x3 Rule” in Code Review

Require 3 positive, 3 negative, 3 edge case tests for new features
Make this a review checklist item
Don’t care about coverage %, care about meaningful tests

4. Delete Useless Tests

Audit my team’s test suite
Find tests that assert .toBeDefined() or similar nonsense
Delete them and document the coverage drop
Show that lower coverage can mean better quality

Questions for the VPs

Keisha, Luis - how do I sell this to leadership who love their green coverage dashboards?

I can’t just say “coverage is a vanity metric.” They’ll ask “then what should we measure?” and I don’t have good answers yet.

Also: How do I avoid bikeshedding about what counts as a “meaningful test”?

I can already hear the debates:

“But that edge case is unlikely, why test it?”
“Integration tests are slow, can’t we just mock everything?”
“This test does validate something, just not the happy path”

Without clear guidance, we’ll spend hours arguing about test philosophy instead of actually improving quality.

The Mutation Testing Challenge

I tried running Stryker on our payment module. Results:

85% line coverage
42% mutation score
CI time went from 3 minutes to 47 minutes

I can’t merge that into our pipeline. It would block every PR for 45 minutes.

Rachel mentioned selective mutation testing - can you share more details on tooling and configuration?

The Real Fear

Here’s what keeps me up at night: What if we’ve already shipped critical bugs that our “comprehensive” test suite missed?

We have 2000+ tests. 87% coverage. All green. And after reading this thread, I have zero confidence they actually protect us.

How do I audit an existing test suite for quality? Where do I even start?