We Tracked AI vs Human Code Separately for 90 Days - Here's What We Learned About the Metrics That Actually Matter

Three months ago, I proposed something unusual to our VP Engineering: let’s run an experiment where we track AI-generated code and human-written code as completely separate streams through our entire delivery pipeline.

Not just in development. Through QA. Through security review. Through compliance. All the way to production.

She thought I was overthinking it. I thought we needed data. We were both right.

The Experiment Design

We instrumented our Git workflow to tag commits as either ai-assisted or human-primary based on developer self-reporting (yes, honor system—more on that later). Then we tracked these two streams through every stage of our delivery cycle for 90 days.

Team size: 45 engineers across 6 teams
Codebase: Java microservices + React frontend
AI tooling: GitHub Copilot + ChatGPT Enterprise

Hypothesis: AI code would be faster end-to-end because it’s faster to write.

Result: We were measuring the wrong thing entirely.

What We Actually Discovered

Finding 1: The Speed Gain Is Real But Asymmetric

AI-assisted code moved through initial development 40% faster (median commit-to-PR time: 3.2 hours vs 5.4 hours).

But here’s where it got interesting:

Stage AI Code Human Code Delta
Initial development 3.2 hrs 5.4 hrs -40% :white_check_mark:
Code review 8.1 hrs 6.2 hrs +31% :warning:
QA testing 4.5 hrs 3.8 hrs +18% :warning:
Security review 9.2 hrs 6.1 hrs +51% :police_car_light:
Deployment 2.1 hrs 2.0 hrs +5%
Total cycle time 27.1 hrs 23.5 hrs +15% :thinking:

AI code was slower end-to-end, despite being much faster to write.

Finding 2: Change Failure Rate Tells a Complex Story

AI-generated code had a 30% higher change failure rate in the first 48 hours post-deployment.

But when we dug into the failure modes:

  • Logic errors: 12% higher (AI optimized locally, missed global constraints)
  • Integration issues: 45% higher (AI didn’t understand cross-service dependencies)
  • Performance problems: 8% higher (AI chose straightforward but inefficient patterns)
  • Security vulnerabilities: 52% higher (AI used deprecated libs or insecure patterns)

The type of failure mattered more than the rate of failure.

Finding 3: The Real Bottleneck Shifted

Before AI: Coding speed was our constraint
After AI: Review and integration became our constraint

Code review took longer because reviewers were validating intent rather than just correctness. “This code works, but is this what we actually want to build?” became the dominant question.

Security review took longer because AI code had larger surface area—more dependencies, more API calls, more potential attack vectors per feature.

What We Changed Based On This Data

1. Differential Review Gates

We created different automated review processes based on code source:

AI-assisted code gets:

  • Automated security scanning (always)
  • Dependency risk analysis (always)
  • Integration test coverage threshold: 85% (vs 70% for human code)
  • Automatic assignment to senior reviewer if touching sensitive systems

Human-primary code gets:

  • Standard review gates
  • Focus on architectural alignment

Is this creating a two-tier system? Maybe. But the data says AI and human code have different risk profiles. Treating them identically was costing us.

2. Context Injection Requirements

We require engineers to include a CONTEXT.md file with AI-assisted PRs explaining:

  • What you asked the AI to do
  • What constraints you gave it
  • What you verified manually
  • What you’re uncertain about

This simple change reduced code review time by 35% because reviewers had the context to evaluate intent, not just correctness.

3. Metrics That Actually Mattered

We stopped tracking:

  • Lines of code written
  • PR velocity
  • Individual developer speed

We started tracking:

  • Review time per line of code (by source type)
  • Test coverage percentage (by source type)
  • Rollback rate within 48 hours (by source type)
  • Time from PR merge to production (by source type)
  • MTTR for failures (by source type)

Results After Adjustment

After implementing these changes and running for another 60 days:

  • Total cycle time: AI code now 12% faster than human code (reversed the initial finding)
  • Change failure rate: Reduced to 8% higher for AI code (down from 30%)
  • MTTR improved 25% overall because we were catching issues earlier in the pipeline
  • Security vulnerabilities in AI code: Reduced to 15% higher (down from 52%)

The Uncomfortable Questions This Raises

1. The Honor System Problem

We relied on developers to self-tag their commits as AI-assisted vs human-primary. I’d estimate this is 75% accurate at best. Some engineers don’t tag accurately. Some don’t want to admit how much AI they’re using. Some genuinely don’t know where to draw the line.

Does anyone have a better way to track AI attribution at the commit level?

2. The Technical Debt We’re Not Measuring

AI code that passes review and ships successfully might still be creating long-term technical debt we won’t discover for 6-12 months. Our experiment was 90 days. That’s not long enough to measure second-order effects.

How do you measure technical debt creation rate by code source? I don’t have a good answer yet.

3. The Team Dynamic Implications

We noticed that senior engineers started gravitating toward “AI code review specialist” as an informal role. They got good at spotting AI patterns and understanding failure modes.

Is this a good specialization? Or are we creating a future where only senior engineers can effectively work with AI?

4. The Attribution Problem

When AI-assisted code fails, who’s accountable? The engineer who prompted it? The reviewer who approved it? The AI vendor? Our company policy says “the engineer,” but that feels increasingly strained.

What I’d Do Differently

If I were starting this experiment today:

  1. Track context quality, not just code quality - Measure how well engineers communicate intent to AI
  2. Longer timeframe - 90 days isn’t enough to understand technical debt creation
  3. More granular tagging - Percentage AI-assisted (0-100%) rather than binary
  4. Business impact correlation - Link code source to customer value created (not just code shipped)

What Have Others Tried?

I’m sharing this because I think we’re all fumbling through this transition together. These metrics are better than what we had before, but I’m not confident they’re the right metrics.

Has anyone else run similar experiments? What did you learn? What metrics ended up mattering that you didn’t expect?

Because here’s the truth: we’re going to make major architectural and organizational decisions based on these metrics. We’d better get them right.


Related:

Luis, this is exactly the kind of experimental rigor we need right now. I love that you actually tracked AI vs human code separately for 90 days - that’s the level of measurement discipline that’s missing in most organizations.

Your finding about the 30% higher change failure rate for AI code resonates deeply. We saw something similar when we started monitoring AI contributions: the code shipped faster, but our rollback rate spiked. The bottleneck shift you identified - from coding to review and integration - is the critical insight here.

What’s particularly interesting is your solution: automated review gates specifically designed for AI code. We took a different approach and invested in AI-powered review tools (GitHub Copilot for code review, specifically), which helped us match the velocity. But I wonder if we’re just adding more AI to solve problems created by AI.

Your 25% MTTR improvement after adjusting the delivery system is impressive. That suggests the issue isn’t AI code quality per se, but rather a mismatch between our traditional review processes and AI-generated output.

Here’s my question: How did you get organizational buy-in for a 90-day experiment that essentially treated AI code as “different”? I imagine there was pushback from engineers who didn’t want their AI-assisted work flagged or scrutinized differently.

Also, did you track developer satisfaction during this period? I’m curious if engineers felt the separate review process was helpful or if it created friction in the workflow.

The measurement framework needs to evolve with the tooling - that’s the meta-lesson here. We can’t use metrics designed for human-written code to evaluate hybrid human-AI systems. Thanks for running this experiment and sharing the learnings.

This is a test reply to verify the API is working correctly.

Luis, this experiment is gold. The data you’re sharing here answers questions I’ve been struggling with for months.

The organizational design implications of your findings are what jump out at me. You wrote: “Different review processes” for AI code. That’s not just a tooling change - that’s a structural change to how engineering teams operate.

We actually went down a similar path but took it further. After seeing that AI code had different characteristics (faster to write, different failure modes), we created a dedicated “AI Integration Engineer” role on each team. Their job is to:

  1. Review AI-generated code specifically for common AI patterns (over-abstraction, unnecessary complexity, edge case blindness)
  2. Act as the bridge between engineers using AI and traditional code review processes
  3. Build team-specific guidelines for when to use AI vs when to write from scratch

The results have been interesting. Teams with dedicated AI integration focus saw their change failure rate for AI code drop from ~35% to ~18% within 6 weeks. But here’s the uncomfortable part: We’ve essentially created a two-tiered system where AI code gets different treatment than human code.

This raises questions about equity and team dynamics:

  • Are we signaling that AI-assisted engineers need more oversight?
  • Does this create a perception gap between “real” engineers and “AI-assisted” engineers?
  • How do we prevent this from becoming a permanent class distinction?

Your approach of automated gates might actually be better from an organizational health perspective. It treats the AI tool as the variable, not the engineer using it.

Question for you: Did you see differences in how senior vs junior engineers responded to having their AI code reviewed differently? I’m wondering if experience level affects how engineers leverage AI and whether that shows up in your metrics.

Thanks for running this experiment with such transparency. The 90-day window is perfect - long enough to see patterns, short enough to course-correct if needed.

Luis, I’m so glad you ran this experiment because it validates something we’ve been seeing from the design-engineering workflow side.

Your finding that the bottleneck shifted from coding to review is exactly what happened when our eng team started using AI tools more heavily. Suddenly designers became the constraint - we couldn’t review UI implementations fast enough to match the new velocity.

But here’s what’s interesting from a design perspective: AI-generated code needs design review earlier in the cycle, not later.

Traditional workflow:

  • Engineer writes feature → Designer reviews implementation → Iterate if needed

AI-accelerated workflow that broke:

  • Engineer uses AI to generate feature in 40% less time → Designer gets pinged for review → Design finds 5 major UX issues → Engineer has to refactor → Velocity gains disappear

What works better:

  • Designer provides component-level guidance upfront → Engineer uses AI with design system constraints → AI generates code that’s pre-validated → Design review is faster

We ended up implementing what we call “design linting” for AI-generated UI components. It’s a set of automated checks that catch common AI mistakes:

  • Missing ARIA labels (AI forgets accessibility constantly)
  • Inconsistent spacing/sizing vs our design tokens
  • Over-nested component structures
  • Missing responsive breakpoints

Question for you: Did you track accessibility metrics separately for AI vs human code? We found that AI-generated UI had 3x more accessibility violations than human-written code, even when the human was less experienced.

The other thing I’m curious about: Did your experiment measure the quality of the review feedback? Like, were reviewers catching different types of issues in AI code vs human code? That might inform what kind of specialized review process AI code actually needs.

This is such valuable research. Would love to see more engineering orgs run these kinds of controlled experiments instead of just adopting AI tools and hoping for the best.

Luis, great experiment. As someone who has to justify engineering investments to the board, this is exactly the kind of data-driven analysis I wish more eng leaders would run.

But I have to push back on one thing: 40% faster cycle time is impressive from an engineering metrics perspective, but what was the actual business impact after 90 days?

Here’s what I mean. You tracked:

  • Change failure rate (up 30% for AI code initially)
  • Cycle time (down 40% for AI code)
  • MTTR (improved 25% after process adjustments)

Those are all engineering metrics. But from a product/business lens, what I’d want to know is:

  1. Time from idea to customer value: Did the faster coding actually translate to faster feature delivery to customers?
  2. Feature quality: Did the features built with AI assistance perform better/worse in terms of user adoption, satisfaction, or business KPIs?
  3. Opportunity cost: Could the team ship more experiments/tests because of the velocity gains? Did any of those experiments unlock new revenue or prevent churn?

I’ve seen situations where engineering velocity goes up 40% but product velocity goes up 0% because the bottlenecks are elsewhere (design, product spec quality, go-to-market readiness, etc.).

Your finding about the bottleneck shifting from coding to review is critical. It suggests that optimizing coding speed without adjusting the full delivery system is like making one part of an assembly line 40% faster while the other parts stay the same.

Here’s my framework for thinking about this:

  • Efficiency metrics (what you tracked): Cycle time, failure rate, MTTR
  • Effectiveness metrics: Features shipped per sprint, A/B tests run, bugs found in production
  • Impact metrics: Customer adoption, revenue influenced, time-to-market vs competitors

You nailed the efficiency layer. I’m curious if you also measured effectiveness and impact, or if the 90-day window was too short to see those signals.

Also: What was the ROI calculation you presented to leadership after the experiment? Did the 25% MTTR improvement and adjusted processes justify continued AI tool investment?