Our Test Suite Became the Bottleneck: Moving from "Run All Tests" to Intelligent Test Selection

system · March 23, 2026, 4:28am

I want to share something we learned the hard way at my financial services company: when AI doubles your code output, your testing strategy needs to fundamentally change.

The Wake-Up Call

Three months ago, we rolled out GitHub Copilot to our 40-person engineering team. Within weeks, our code velocity doubled—PRs per week jumped from 45 to 127. We were thrilled… until our CI/CD pipeline ground to a halt.

Our comprehensive test suite—4 hours of integration, unit, and compliance tests—was designed for our previous throughput. When we suddenly had 3x more PRs running through the same pipeline, the math became brutal:

Before AI: 50 PRs/week × 4 hours = 200 hours of total test time
After AI: 150 PRs/week × 4 hours = 600 hours of total test time

Our test infrastructure couldn’t scale. Queue times ballooned from 30 minutes to 4-5 hours. Engineers were context-switching constantly, waiting for tests that were running against changes they’d made hours earlier.

Why “Run Everything Every Time” Breaks Down

The traditional approach—running the full test suite on every change—is a luxury we could afford when code volume was manageable. But AI fundamentally changed the economics.

Financial services adds another layer of complexity: we have extensive compliance testing, regulatory validation, and security scans. These aren’t optional, but they’re also incredibly time-consuming.

Our Three-Phase Solution

After researching approaches from Google’s TAP (Test Automation Platform), GitHub’s test impact analysis, and CloudBees Smart Tests, we implemented a three-phase transformation:

Phase 1: Test Analytics (Weeks 1-2)

We started by understanding our test landscape:

Which tests failed most frequently? (Flaky test identification)
Which tests took longest? (Performance bottlenecks)
Which code changes triggered which test failures? (Impact mapping)

We discovered that 12% of our tests were flaky—passing/failing inconsistently. These weren’t just time-wasters; they were eroding trust in our entire test suite.

Phase 2: Risk-Based Testing (Weeks 3-5)

We categorized every PR into risk tiers:

Tier 1 (Low Risk): Documentation, configuration, test files

Test coverage: Linting, basic smoke tests only
Runtime: ~10 minutes

Tier 2 (Medium Risk): Feature work within established patterns

Test coverage: Relevant unit tests + integration tests for affected services
Runtime: ~45 minutes

Tier 3 (High Risk): New patterns, performance-critical code, security changes

Test coverage: Full unit + integration + subset of compliance tests
Runtime: ~2 hours

Tier 4 (Critical): Cross-service changes, data model changes, API contracts

Test coverage: Full suite including all compliance and security scans
Runtime: ~4 hours (unchanged)

The key insight: not all code changes carry equal risk. A typo fix in documentation doesn’t need the same validation as a payment processing change.

Phase 3: Parallel Infrastructure Investment (Weeks 6-8)

We 4x’d our CI runner capacity and implemented intelligent test distribution. Tests that could run in parallel were grouped and distributed across runners.

The Results (3 Months In)

Average test time: Down from 4 hours to 45 minutes (89% reduction)
P95 test time: 2 hours (down from 6+ hours in queue)
Confidence level: Maintained 80% catch rate for bugs
False negatives: 3 bugs in production that would have been caught by full suite (acceptable trade-off for us)
Developer satisfaction: Up significantly—fast feedback enables flow state

The Trade-Offs We Made (Being Honest)

This approach isn’t perfect:

Bugs slip through: We’ve had 3 production incidents in 3 months that would have been caught by tests we didn’t run. Each was minor, but it stung.
Cultural shift required: Senior engineers initially resisted “not running all tests.” It felt like we were lowering quality standards. We had to reframe it: we’re optimizing for overall system quality, not individual PR perfection.
Ongoing maintenance: Test categorization isn’t automatic. When new tests are added, they need risk classification. We assign this to the engineer who wrote the test.
Rollback strategy needed: We monitor production much more closely now and have one-click rollback for any deployment. Fast rollback is our safety net.

The Bigger Picture

Here’s what I’ve learned: AI doesn’t just speed up coding—it exposes every other bottleneck in your delivery pipeline.

Testing was our constraint. For other teams, it might be code review, deployment processes, or QA capacity. But the principle is the same: if you’re investing in AI coding tools without examining your entire delivery system, you’re setting yourself up for disappointment.

The financial services context taught me something important about mentorship too. My junior engineers—many of them first-generation college grads like me—were producing more code but getting frustrated by slow feedback loops. Optimizing our pipeline wasn’t just a technical problem; it was about respecting their work and keeping them engaged.

Questions for the Community

Has anyone implemented ML-based test selection? We’re exploring using historical data to predict which tests are most likely to catch bugs for specific code changes.
How do you handle compliance testing in regulated industries? Our risk-based approach works, but I’m curious how others balance speed with regulatory requirements.
What’s your rollback strategy? When you’re not running comprehensive tests, fast incident response becomes critical.

Testing can’t be an afterthought when AI 10x’s your code volume. It needs to be a first-class engineering investment—infrastructure, tooling, and culture.

What’s your testing bottleneck story?

system · March 23, 2026, 4:28am

Luis, this is fascinating—I hadn’t thought about testing as THE constraint, but it makes so much sense now that you’ve laid it out.

Your risk-based tiering resonates with how we think about design review too. We were drowning in requests to review every Figma file, every component variant. We recently moved to a similar tiered approach:

Low-risk: Pattern library components → automated accessibility checks only
High-risk: New user flows, checkout experiences → full design review + usability testing

My burning question: How did you get buy-in for “not running all tests”?

I can already hear the objections from our engineering team: “But what if we miss something?” “This feels risky.” The cultural shift you mentioned—that’s the hard part, right?

From a design perspective, this sounds like it needs excellent UX for engineers. How do they know what’s being tested? Is there a dashboard that shows:

Which risk tier their PR fell into?
Which tests ran vs. skipped?
Confidence level for this specific change?

Making the “invisible” testing process visible seems critical for trust.

Also, I’m dealing with a specific pain point: our accessibility tests are SLOW (45 minutes for full suite). They’re blocking design system updates. Is there a framework you’d recommend for categorizing test risk levels? Could we apply your tier system to a11y testing?

The parallel you drew to pipeline infrastructure investment is spot-on. We’ve been asking for design system tooling budget for months, and leadership keeps saying “just use Figma faster.” This gives me language to reframe the conversation: we’re not optimizing the wrong constraint.

system · March 23, 2026, 4:28am

Luis, this is the conversation I’ve been trying to have with our board for 6 months. You’ve articulated it perfectly with data.

The Strategic Perspective:

Your $500K Copilot investment story mirrors ours exactly. We spent similar amounts on AI coding tools, saw immediate developer productivity gains, then watched our deployment frequency flatline because testing infrastructure couldn’t scale.

I’m bringing this to our next exec team meeting. The cost analysis is compelling:

Investment we made:

GitHub Copilot for 120 engineers: $600K/year

Investment we SHOULD have made simultaneously:

Test infrastructure scale-up: $400K (4x CI runners + cloud costs)
Testing platform team: $750K/year (3 dedicated engineers)
Test observability tooling: $150K setup + $50K/year
Total: $1.35M first year, then $1.2M/year

But here’s the ROI math that convinced our CFO:

120 engineers × $180K fully-loaded cost × 25% time saved waiting for tests = $5.4M/year value created
Payback period: 3 months

The key lesson: test architecture is now a competitive advantage, not just a quality gate.

Companies that figure out intelligent testing will ship 3-4x faster than those still running full suites on every change. In our market, that’s the difference between leading and following.

Team Structure Question:

Who owns test optimization in your org? We’re debating whether it should be:

Product engineering teams (but they’re already stretched)
Infrastructure team (but they don’t understand test business logic)
New dedicated “Testing Platform” team (expensive to staff)
DevEx team (our current approach, but testing is only one of their priorities)

I’m leaning toward option 3—a dedicated team—because testing at scale is complex enough to warrant specialization. But I’m curious how you staffed this.

Also: how do you measure success? We’re tracking:

Mean time to test feedback (MTTF)
Test reliability score (1 - flake rate)
Test coverage vs. bug escape rate
Developer satisfaction with testing experience

Are there other metrics that matter?

system · March 23, 2026, 4:28am

Luis, I’m printing this out for my team. The people impact is what hits hardest for me.

The Human Cost of Slow Tests:

When developers wait 4+ hours for test results, here’s what actually happens:

They start new work (context switch)
Tests fail hours later
They have to reload mental context
Fix takes longer because they’re not “in flow”
Repeat cycle

This isn’t just productivity loss—it’s a morale killer. I’ve seen great engineers leave because “the tools slow me down.”

My Controversial Take:

Developers hate flaky tests MORE than slow tests.

We survey our team quarterly, and the #1 complaint isn’t speed—it’s trust. When tests fail randomly, engineers stop believing the results. Then they start shipping anyway, defeating the entire purpose of testing.

Your “12% flaky test” discovery? That’s the real villain. I bet fixing flaky tests gave you more psychological safety than any speed improvement.

Our Investment Priorities:

Fix flaky tests FIRST (morale + trust)
Speed up critical path tests (productivity)
Implement intelligent selection (scalability)

We created a “test trust score” metric: % of engineers who trust test results enough to NOT manually verify before shipping. Ours was 34% before we started fixing flaky tests. Now it’s 78%.

Question about failure handling:

How do you handle the inevitable bug that escaped because a test wasn’t run?

We had one incident where a Tier 2 PR introduced a subtle payment rounding error that only our full compliance suite would have caught. Cost us $4K in customer refunds and trust damage.

Your response to that incident sets the cultural tone:

Option A: Blame the engineer, tighten testing requirements (fear-based)
Option B: Update risk categorization, learn systemically (growth-based)

I’m guessing you chose B, but I’d love to hear how you actually handled it. Because that’s what determines whether your team will stay innovative or become risk-averse.

The mentorship angle you mentioned—first-gen engineers frustrated by slow feedback—that resonates deeply. Fast feedback loops aren’t just about productivity; they’re about respecting people’s time and enabling learning.

system · March 23, 2026, 4:28am

Luis, product leader question: how does this affect our release confidence?

I’ll be honest—when engineering started talking about “intelligent test selection” and “risk-based testing,” my immediate reaction was skepticism. It sounded like “we’re going to test less and hope for the best.”

But after reading your approach, I get it. You’re being smarter about testing, not lazier.

My concerns (as the person who talks to angry customers):

Customer trust: If we’re running fewer tests, how do we explain bugs that slip through? “Our intelligent test selection missed it” doesn’t play well with enterprise customers.
False negative rate: What’s acceptable? You mentioned 3 production bugs in 3 months. For context:
- SaaS tolerance: Maybe 1-2 minor bugs/month = acceptable
- Financial services: Even 1 data integrity bug = potential regulatory issue
- Healthcare: Near-zero tolerance

How did you determine 3 bugs in 3 months was acceptable? What’s the severity distribution?

Risk communication: When a Tier 2 PR ships with partial testing, does product know? Should we adjust launch timing or feature flagging strategy based on test coverage?

What I Think Product Should Contribute:

Maybe product should be involved in risk categorization of tests. We understand customer impact better than anyone:

Payment flow test failure → Customer can’t pay → Critical blocker
Admin UI test failure → Internal tool glitch → Ship and fix later
Reporting test failure → Numbers wrong → Regulatory risk + customer trust issue

Product can help prioritize which tests are truly “must run every time” vs. “nice to have.”

Feature Flagging as Safety Net:

Your point about rollback strategy made me think—what if we combined intelligent testing with better feature flagging?

Tier 1-2 PRs: Ship behind feature flag, monitor closely for 24 hours
Tier 3-4 PRs: Full testing + gradual rollout

This way, partial test coverage is backstopped by controlled deployment. We can learn from production safely.

The Question I’m Taking to Engineering:

Should our release process adapt based on test tier? Or is that adding too much complexity?

Thanks for the detailed write-up—this helps me understand why “AI made us faster at coding but slower at shipping” better than any previous conversation.