I want to share something we learned the hard way at my financial services company: when AI doubles your code output, your testing strategy needs to fundamentally change.
The Wake-Up Call
Three months ago, we rolled out GitHub Copilot to our 40-person engineering team. Within weeks, our code velocity doubled—PRs per week jumped from 45 to 127. We were thrilled… until our CI/CD pipeline ground to a halt.
Our comprehensive test suite—4 hours of integration, unit, and compliance tests—was designed for our previous throughput. When we suddenly had 3x more PRs running through the same pipeline, the math became brutal:
- Before AI: 50 PRs/week × 4 hours = 200 hours of total test time
- After AI: 150 PRs/week × 4 hours = 600 hours of total test time
Our test infrastructure couldn’t scale. Queue times ballooned from 30 minutes to 4-5 hours. Engineers were context-switching constantly, waiting for tests that were running against changes they’d made hours earlier.
Why “Run Everything Every Time” Breaks Down
The traditional approach—running the full test suite on every change—is a luxury we could afford when code volume was manageable. But AI fundamentally changed the economics.
Financial services adds another layer of complexity: we have extensive compliance testing, regulatory validation, and security scans. These aren’t optional, but they’re also incredibly time-consuming.
Our Three-Phase Solution
After researching approaches from Google’s TAP (Test Automation Platform), GitHub’s test impact analysis, and CloudBees Smart Tests, we implemented a three-phase transformation:
Phase 1: Test Analytics (Weeks 1-2)
We started by understanding our test landscape:
- Which tests failed most frequently? (Flaky test identification)
- Which tests took longest? (Performance bottlenecks)
- Which code changes triggered which test failures? (Impact mapping)
We discovered that 12% of our tests were flaky—passing/failing inconsistently. These weren’t just time-wasters; they were eroding trust in our entire test suite.
Phase 2: Risk-Based Testing (Weeks 3-5)
We categorized every PR into risk tiers:
Tier 1 (Low Risk): Documentation, configuration, test files
- Test coverage: Linting, basic smoke tests only
- Runtime: ~10 minutes
Tier 2 (Medium Risk): Feature work within established patterns
- Test coverage: Relevant unit tests + integration tests for affected services
- Runtime: ~45 minutes
Tier 3 (High Risk): New patterns, performance-critical code, security changes
- Test coverage: Full unit + integration + subset of compliance tests
- Runtime: ~2 hours
Tier 4 (Critical): Cross-service changes, data model changes, API contracts
- Test coverage: Full suite including all compliance and security scans
- Runtime: ~4 hours (unchanged)
The key insight: not all code changes carry equal risk. A typo fix in documentation doesn’t need the same validation as a payment processing change.
Phase 3: Parallel Infrastructure Investment (Weeks 6-8)
We 4x’d our CI runner capacity and implemented intelligent test distribution. Tests that could run in parallel were grouped and distributed across runners.
The Results (3 Months In)
- Average test time: Down from 4 hours to 45 minutes (89% reduction)
- P95 test time: 2 hours (down from 6+ hours in queue)
- Confidence level: Maintained 80% catch rate for bugs
- False negatives: 3 bugs in production that would have been caught by full suite (acceptable trade-off for us)
- Developer satisfaction: Up significantly—fast feedback enables flow state
The Trade-Offs We Made (Being Honest)
This approach isn’t perfect:
-
Bugs slip through: We’ve had 3 production incidents in 3 months that would have been caught by tests we didn’t run. Each was minor, but it stung.
-
Cultural shift required: Senior engineers initially resisted “not running all tests.” It felt like we were lowering quality standards. We had to reframe it: we’re optimizing for overall system quality, not individual PR perfection.
-
Ongoing maintenance: Test categorization isn’t automatic. When new tests are added, they need risk classification. We assign this to the engineer who wrote the test.
-
Rollback strategy needed: We monitor production much more closely now and have one-click rollback for any deployment. Fast rollback is our safety net.
The Bigger Picture
Here’s what I’ve learned: AI doesn’t just speed up coding—it exposes every other bottleneck in your delivery pipeline.
Testing was our constraint. For other teams, it might be code review, deployment processes, or QA capacity. But the principle is the same: if you’re investing in AI coding tools without examining your entire delivery system, you’re setting yourself up for disappointment.
The financial services context taught me something important about mentorship too. My junior engineers—many of them first-generation college grads like me—were producing more code but getting frustrated by slow feedback loops. Optimizing our pipeline wasn’t just a technical problem; it was about respecting their work and keeping them engaged.
Questions for the Community
-
Has anyone implemented ML-based test selection? We’re exploring using historical data to predict which tests are most likely to catch bugs for specific code changes.
-
How do you handle compliance testing in regulated industries? Our risk-based approach works, but I’m curious how others balance speed with regulatory requirements.
-
What’s your rollback strategy? When you’re not running comprehensive tests, fast incident response becomes critical.
Testing can’t be an afterthought when AI 10x’s your code volume. It needs to be a first-class engineering investment—infrastructure, tooling, and culture.
What’s your testing bottleneck story? ![]()