Three months ago, I proposed something unusual to our VP Engineering: let’s run an experiment where we track AI-generated code and human-written code as completely separate streams through our entire delivery pipeline.
Not just in development. Through QA. Through security review. Through compliance. All the way to production.
She thought I was overthinking it. I thought we needed data. We were both right.
The Experiment Design
We instrumented our Git workflow to tag commits as either ai-assisted or human-primary based on developer self-reporting (yes, honor system—more on that later). Then we tracked these two streams through every stage of our delivery cycle for 90 days.
Team size: 45 engineers across 6 teams
Codebase: Java microservices + React frontend
AI tooling: GitHub Copilot + ChatGPT Enterprise
Hypothesis: AI code would be faster end-to-end because it’s faster to write.
Result: We were measuring the wrong thing entirely.
What We Actually Discovered
Finding 1: The Speed Gain Is Real But Asymmetric
AI-assisted code moved through initial development 40% faster (median commit-to-PR time: 3.2 hours vs 5.4 hours).
But here’s where it got interesting:
| Stage | AI Code | Human Code | Delta |
|---|---|---|---|
| Initial development | 3.2 hrs | 5.4 hrs | -40% |
| Code review | 8.1 hrs | 6.2 hrs | +31% |
| QA testing | 4.5 hrs | 3.8 hrs | +18% |
| Security review | 9.2 hrs | 6.1 hrs | +51% |
| Deployment | 2.1 hrs | 2.0 hrs | +5% |
| Total cycle time | 27.1 hrs | 23.5 hrs | +15% |
AI code was slower end-to-end, despite being much faster to write.
Finding 2: Change Failure Rate Tells a Complex Story
AI-generated code had a 30% higher change failure rate in the first 48 hours post-deployment.
But when we dug into the failure modes:
- Logic errors: 12% higher (AI optimized locally, missed global constraints)
- Integration issues: 45% higher (AI didn’t understand cross-service dependencies)
- Performance problems: 8% higher (AI chose straightforward but inefficient patterns)
- Security vulnerabilities: 52% higher (AI used deprecated libs or insecure patterns)
The type of failure mattered more than the rate of failure.
Finding 3: The Real Bottleneck Shifted
Before AI: Coding speed was our constraint
After AI: Review and integration became our constraint
Code review took longer because reviewers were validating intent rather than just correctness. “This code works, but is this what we actually want to build?” became the dominant question.
Security review took longer because AI code had larger surface area—more dependencies, more API calls, more potential attack vectors per feature.
What We Changed Based On This Data
1. Differential Review Gates
We created different automated review processes based on code source:
AI-assisted code gets:
- Automated security scanning (always)
- Dependency risk analysis (always)
- Integration test coverage threshold: 85% (vs 70% for human code)
- Automatic assignment to senior reviewer if touching sensitive systems
Human-primary code gets:
- Standard review gates
- Focus on architectural alignment
Is this creating a two-tier system? Maybe. But the data says AI and human code have different risk profiles. Treating them identically was costing us.
2. Context Injection Requirements
We require engineers to include a CONTEXT.md file with AI-assisted PRs explaining:
- What you asked the AI to do
- What constraints you gave it
- What you verified manually
- What you’re uncertain about
This simple change reduced code review time by 35% because reviewers had the context to evaluate intent, not just correctness.
3. Metrics That Actually Mattered
We stopped tracking:
- Lines of code written
- PR velocity
- Individual developer speed
We started tracking:
- Review time per line of code (by source type)
- Test coverage percentage (by source type)
- Rollback rate within 48 hours (by source type)
- Time from PR merge to production (by source type)
- MTTR for failures (by source type)
Results After Adjustment
After implementing these changes and running for another 60 days:
- Total cycle time: AI code now 12% faster than human code (reversed the initial finding)
- Change failure rate: Reduced to 8% higher for AI code (down from 30%)
- MTTR improved 25% overall because we were catching issues earlier in the pipeline
- Security vulnerabilities in AI code: Reduced to 15% higher (down from 52%)
The Uncomfortable Questions This Raises
1. The Honor System Problem
We relied on developers to self-tag their commits as AI-assisted vs human-primary. I’d estimate this is 75% accurate at best. Some engineers don’t tag accurately. Some don’t want to admit how much AI they’re using. Some genuinely don’t know where to draw the line.
Does anyone have a better way to track AI attribution at the commit level?
2. The Technical Debt We’re Not Measuring
AI code that passes review and ships successfully might still be creating long-term technical debt we won’t discover for 6-12 months. Our experiment was 90 days. That’s not long enough to measure second-order effects.
How do you measure technical debt creation rate by code source? I don’t have a good answer yet.
3. The Team Dynamic Implications
We noticed that senior engineers started gravitating toward “AI code review specialist” as an informal role. They got good at spotting AI patterns and understanding failure modes.
Is this a good specialization? Or are we creating a future where only senior engineers can effectively work with AI?
4. The Attribution Problem
When AI-assisted code fails, who’s accountable? The engineer who prompted it? The reviewer who approved it? The AI vendor? Our company policy says “the engineer,” but that feels increasingly strained.
What I’d Do Differently
If I were starting this experiment today:
- Track context quality, not just code quality - Measure how well engineers communicate intent to AI
- Longer timeframe - 90 days isn’t enough to understand technical debt creation
- More granular tagging - Percentage AI-assisted (0-100%) rather than binary
- Business impact correlation - Link code source to customer value created (not just code shipped)
What Have Others Tried?
I’m sharing this because I think we’re all fumbling through this transition together. These metrics are better than what we had before, but I’m not confident they’re the right metrics.
Has anyone else run similar experiments? What did you learn? What metrics ended up mattering that you didn’t expect?
Because here’s the truth: we’re going to make major architectural and organizational decisions based on these metrics. We’d better get them right.
Related: