Are we trading one cognitive load for another? The AI context-switching paradox in DevEx

Over the past six months, I’ve been leading my engineering teams at a Fortune 500 financial services company through AI adoption. The feedback from developers has been overwhelmingly positive—they consistently report feeling more productive, spending less time on boilerplate, and shipping features faster. But when I look at our team velocity metrics and delivery timelines, the story isn’t quite as clear. We’re seeing modest improvements, maybe 10-15%, but nothing close to the 20-50% productivity gains the research promises.

This disconnect led me down a research rabbit hole, and what I discovered challenges how we think about developer experience in the AI era.

The Three Pillars of Developer Experience

Recent research from DX and published in ACM Queue identifies three core dimensions that determine how developers experience their work: feedback loops, cognitive load, and flow state. These aren’t abstract concepts—they’re practical areas we can observe, measure, and improve. Organizations with better developer experience see each one-point improvement correlate to 13 minutes of saved developer time per week.

But here’s what caught my attention: AI tools impact all three dimensions, and not always in the ways we expect.

The Cognitive Load Trade-off

We adopted AI coding assistants expecting them to reduce cognitive load. And in some ways, they do—handling boilerplate code, generating test scaffolding, and answering “what does this error mean?” questions. Developers who report high understanding of their code feel 42% more productive than those who don’t.

But AI introduces a new pattern that researchers at JetBrains and UC Irvine call “stealth friction.” When developers use AI assistants, they engage in what I now recognize as a constant cycle:

  1. Write a prompt (context switch from code to instruction)
  2. Wait for generation (mental model interrupted)
  3. Review the output (switch from creating to evaluating)
  4. Debug and integrate (switch back to hands-on coding)

The fascinating—and concerning—part? In their study, 74% of developers didn’t notice they were context switching more frequently. The switching doesn’t feel like switching, but the cognitive cost accumulates.

What We’re Seeing on the Ground

In our financial services engineering teams, I’ve observed this play out in ways that surprised me. Our senior engineers, the ones who adopt new tools most enthusiastically, started reporting something unexpected: the minutes they saved generating boilerplate were often wiped out by the time spent reviewing, debugging, or completely rewriting AI-generated code.

One of my tech leads put it perfectly: “I’m no longer in the code—I’m managing the assistant.” That shift from creator to manager represents a fundamental change in cognitive load distribution.

The Productivity Paradox

This matches what the broader industry is experiencing. Over 75% of developers now use AI coding assistants, yet organizational productivity gains haven’t kept pace. Companies report seeing individual task completion speed up, but delivery velocity and business outcomes remain relatively flat.

The research suggests why: AI amplifies whatever organizational state already exists. In teams with strong processes, clear architecture, and fast feedback loops, AI acts as a force multiplier. In teams struggling with slow code reviews, unclear requirements, or brittle test suites, AI highlights and magnifies those existing problems.

The Feedback Loop Question

This brings me to what I think is the critical question: Should we optimize our feedback loops before layering AI on top of them?

When builds take hours instead of minutes, feedback loops are already broken. Adding AI that generates more code faster doesn’t fix the slow build—it just means developers write more code that they’ll wait longer to test. We’re optimizing the wrong part of the system.

Same with code review. If PR turnaround time is your bottleneck (research suggests same-day reviews are ideal), generating code faster with AI just creates a bigger backlog for reviewers.

What Should We Measure?

I’m increasingly convinced that we’re measuring the wrong things. Velocity—lines of code, PRs merged, commits per developer—tells us about activity, not value. But cognitive load reduction and flow state preservation? Those are harder to quantify but ultimately more important.

The DX research framework suggests pairing quantitative telemetry (build times, deployment frequency, PR cycle time) with qualitative just-in-time surveys triggered by workflow events. Ask developers right after they submit a PR: “How clear were the requirements? How much context switching did you experience? Did you feel you had time for deep work?”

Looking for Perspectives

I’m curious how other engineering leaders and teams are navigating this:

  • What are you measuring to understand AI’s real impact—velocity metrics or something else?
  • Have you noticed the “stealth friction” of AI context switching on your teams?
  • How do you balance the speed of AI generation against the cognitive cost of review and integration?
  • Are you optimizing feedback loops first, or introducing AI first and fixing processes later?

In my experience leading diverse engineering teams across multiple time zones, the best solutions come from diverse perspectives. What’s working—or not working—for you?


Some relevant research that shaped my thinking:

This hits home in a way I didn’t expect.

When we started building our design system at Confluence, I was thrilled about AI tools for component generation. Need a button variant? AI scaffolds it in seconds. Need a card component? Done. The speed felt intoxicating—until we started reviewing the output.

Here’s what I learned from my failed startup (and yes, I’m still processing those lessons): speed without understanding creates technical debt. And AI-generated components are the poster child for this.

The Review Burden Is Real

My design systems team spent weeks cleaning up AI-generated components because the AI didn’t understand our design token architecture. It would hardcode colors instead of using tokens. It would create responsive breakpoints that didn’t match our spacing scale. Every shortcut in generation became debt in maintenance.

But the bigger issue? Accessibility. AI-generated components consistently missed ARIA labels, semantic HTML, and keyboard navigation patterns. We caught it in review, but think about all the teams that don’t have dedicated accessibility expertise. They’re shipping AI-generated code that works visually but fails for screen reader users.

The “Time to Production-Ready” Metric

Luis, your point about measuring the wrong things resonates. In design, we learned this the hard way at my startup. We optimized for “time to first design mockup” when we should have optimized for “time to validated, buildable, shippable design.”

With AI, I think we’re making the same mistake with code. Time to first draft is meaningless if the draft requires hours of debugging, accessibility fixes, and architectural refactoring to become production-ready.

Our team’s current approach:

  • Use AI for scaffolding only - the basic structure
  • Human review for integration - does this fit our architecture?
  • Human decision for patterns - is this the right abstraction?

It’s slower than pure AI generation, but faster than writing everything from scratch. More importantly, the final code is understandable and maintainable.

The Context Switch You Didn’t Mention

There’s another cognitive load issue I’ve noticed: the temptation to accept mediocre solutions because “AI suggested it.”

When you write code from scratch, you iterate naturally. But when AI gives you something that works (even if it’s not great), there’s psychological pressure to accept it and move on. You already invested the mental energy in prompt engineering and review—why start over?

That acceptance of “good enough” AI output is its own form of cognitive load. It’s easier in the moment, but it accumulates as tech debt that creates cognitive load for every future developer who touches that code.

What if the real question isn’t “How do we optimize AI workflows?” but “How do we teach engineers to treat AI suggestions with the same healthy skepticism they’d apply to Stack Overflow answers?”

Luis, your observation about the 10-15% gains versus the promised 20-50% is exactly what we’re seeing across the industry, and I think the explanation reveals something crucial about organizational readiness.

The AI Productivity Paradox Is Real

At our mid-stage SaaS company, we’ve been tracking AI adoption meticulously since Q3 2025. The data tells a clear story:

  • 75%+ developer adoption of AI coding assistants
  • Individual task completion speed increased by 20-35%
  • Organizational delivery velocity increased by… 8-12%

Where did the gains go?

AI as Organizational Diagnostic

After digging into this with our engineering teams across Seattle, Austin, and remote, we discovered something unexpected: AI amplifies whatever organizational state already exists.

In teams with:

  • :white_check_mark: Strong code review processes
  • :white_check_mark: Clear architectural patterns
  • :white_check_mark: Fast CI/CD pipelines
  • :white_check_mark: Well-documented systems

AI acted as a genuine force multiplier. Developers wrote code faster and the organization shipped faster.

In teams struggling with:

  • :cross_mark: Slow PR review cycles (3-5 day turnaround)
  • :cross_mark: Unclear requirements and frequent pivots
  • :cross_mark: Brittle test suites that broke constantly
  • :cross_mark: Legacy systems without documentation

AI made things worse. Developers generated code faster, which meant:

  • More PRs backing up in review queues
  • More integration failures from AI code that didn’t understand legacy constraints
  • More context switching as developers waited on blocked PRs and started new work

Using AI Rollout as a Diagnostic Tool

Here’s what we did that changed our approach: We treated AI adoption as a diagnostic for organizational health.

When a team reported that AI “wasn’t helping” or “created more work,” we didn’t blame the tool. We investigated:

  1. What’s your PR review turnaround time?
  2. How often do builds break?
  3. Do developers understand the architecture of the code they’re modifying?
  4. Are requirements clear before coding starts?

In every case where AI “failed,” we found underlying process problems that AI exposed.

The Implementation Sequence That Worked

Based on this insight, we changed our rollout strategy:

Phase 1: Fix Feedback Loops (2-3 months)

  • Target: Same-day code review turnaround
  • Reduce build times from 45 minutes to under 10 minutes
  • Implement just-in-time developer surveys after PR submissions

Phase 2: Reduce Cognitive Load (1-2 months)

  • Improve documentation for critical systems
  • Clarify decision rights (who can approve what)
  • Establish architectural patterns and guardrails

Phase 3: Enable Flow State (ongoing)

  • Protect deep work time blocks
  • Reduce meeting fragmentation
  • Create clear sprint boundaries

Phase 4: Reintroduce AI (with better results)

  • Now AI enhanced already-strong processes
  • Developers had the context to evaluate AI suggestions
  • Review cycles handled increased throughput

The CFO Question

The business side asks: “Why invest in process improvement before AI? Isn’t AI supposed to make us faster despite bad processes?”

My answer: AI is a multiplier, not an adder.

If you multiply a broken process by AI speed, you get a faster broken process. Fix the process first, then multiply it.

The math:

  • Broken process × 2x AI speed = 2× broken output
  • Fixed process × 2x AI speed = 2× quality output

To Answer Your Questions Directly

What are you measuring?

We measure both:

  • Velocity: Deployment frequency, lead time, PR cycle time
  • Developer experience: Cognitive load surveys, flow state self-reports, “would you recommend this team to a friend?”

But we weight DX metrics higher because they’re leading indicators. Velocity follows good DX, not the other way around.

Are you optimizing feedback loops first or introducing AI first?

Feedback loops first, always. We learned this the hard way by doing it backward initially.

The teams that succeed with AI are the teams that didn’t need AI to be productive—they were already high-functioning. AI just made them faster.

The teams that struggle with AI are revealing organizational problems that need fixing regardless of what tools we give them.

This conversation is surfacing something I’ve been wrestling with as we scale our EdTech engineering org from 25 to 80+ engineers: the cognitive load issue isn’t just about the tool—it’s about when and how we use it.

Michelle, your phased approach resonates, but I want to add a people dimension that I think gets overlooked in these discussions.

The Mentorship Gap Nobody’s Talking About

When I look at our junior engineers using AI assistants, I see a pattern that concerns me:

Before AI:

  • Junior dev hits a problem → asks senior engineer
  • Senior explains the why, not just the how
  • Knowledge transfers, patterns propagate, junior grows

With AI:

  • Junior dev hits a problem → asks AI
  • AI provides code that works
  • Junior moves on without understanding
  • Pattern comprehension never happens

Our senior engineers started noticing this 4-5 months into AI adoption. One of my engineering managers (formerly at Google, knows mentorship) put it bluntly: “The juniors are getting answers, but they’re not learning to think.”

The Context Switching Asymmetry

But here’s where it gets interesting: The type of cognitive load differs by experience level.

Senior engineers using AI experience what Luis described:

  • The prompt → wait → review → debug cycle
  • The shift from “in the code” to “managing the assistant”
  • The burden of validating AI suggestions against architectural knowledge

Junior engineers experience something different:

  • Relief from not having to interrupt senior engineers
  • Speed of getting unblocked
  • But also: No development of pattern recognition skills

And honestly? The junior engineers don’t realize what they’re missing. They feel productive because they’re shipping code. But they’re not building the mental models that make you a senior engineer.

Our Experiment: AI-Free Time Blocks

Based on this insight, we tried something unconventional at our EdTech startup:

AI-enabled time (afternoons):

  • Integration work, boilerplate generation, test writing
  • Use AI for scaffolding and repetitive tasks
  • Fast iteration on known patterns

AI-free time (mornings):

  • Deep architecture work
  • Pair programming sessions
  • Design discussions and code reviews

The results surprised us:

  1. Flow state improved: Developers reported better focus during morning deep work
  2. Knowledge transfer increased: Pair programming naturally happened more during AI-free time
  3. AI usage became more intentional: Instead of constant assistant mode, developers saved AI for specific use cases
  4. Junior skill development recovered: Juniors started asking seniors “why” questions again

Treating AI Like Meetings

Maya, your question about teaching skepticism is spot-on, but I’d frame it differently:

What if we treat AI tools the way we treat meetings—something to schedule, not a constant interruption?

You wouldn’t let your calendar fill with back-to-back meetings all day because you know it destroys flow state and deep work. Why do we let AI assistants interrupt our thought process constantly?

The research Luis cited talks about context switching costs that 74% of developers don’t notice. That’s exactly what happens with always-on meeting culture too—people don’t notice the cost until you force a meeting-free day and they realize how much they accomplished.

The Diversity and Inclusion Angle

I have to add this because it’s central to my work: AI productivity tools risk amplifying existing disparities.

Who benefits most from AI code generation?

  • Developers with strong foundational knowledge who can evaluate output
  • People from strong CS programs who learned patterns deeply
  • Engineers with mentors who taught them why, not just what

Who struggles?

  • Career changers and bootcamp grads still building fundamentals
  • Underrepresented engineers who already face more scrutiny in code review
  • Juniors from non-traditional backgrounds learning on the job

If AI becomes the default knowledge source instead of human mentorship, we’re building a two-tier system: those who can critically evaluate AI suggestions and those who blindly accept them.

What We’re Measuring Now

To answer Luis’s questions directly:

What are you measuring?

We added two DX metrics specifically for AI:

  1. “AI assistance quality”: 1-5 scale survey after each PR - “Did AI help or create extra work?”
  2. “Learning and growth”: Monthly survey - “Are you learning new patterns or just shipping code?”

The second one is crucial. Productivity without growth is a ticking time bomb.

How do you balance speed vs cognitive cost?

Time-box AI usage. Sounds simple, but it works. Developers report knowing they have “AI time” later makes them more focused during “deep work time.”

The Real Question

Michelle’s framework about AI as a multiplier is exactly right. But I’d add:

AI multiplies your current trajectory.

If you’re on a path toward strong engineering practices and continuous learning, AI accelerates you.

If you’re on a path toward technical debt and skill stagnation, AI accelerates that too.

Which path are we multiplying?

Coming from the product side, this conversation is fascinating because I keep seeing parallels to mistakes we made in early-stage product development—and I think engineering teams are making similar mistakes with AI metrics.

The Vanity Metrics Problem

In product, we learned to distinguish between:

  • Vanity metrics: Sign-ups, page views, downloads
  • Business metrics: Activation, retention, revenue

Vanity metrics feel good—they go up! But they don’t predict business success.

Reading this thread, I see engineering teams measuring:

  • Vanity metrics: Lines of code, PRs merged, commits per developer
  • Business metrics: Features delivered, customer value shipped, time from idea to validated learning

Luis’s observation about 10-15% team velocity gains despite individual 20-50% speed increases? That’s the vanity metric gap.

The Customer Development Parallel

When I was at Airbnb, we had a PM who kept shipping features users requested without validating whether those features solved real problems. Usage metrics looked great… until we realized users weren’t retaining.

With AI coding, I’m seeing the same pattern:

What we’re measuring:

  • “How fast can we write code?”
  • “How many PRs did we merge?”
  • “Did developers report feeling productive?”

What we should be measuring:

  • “How fast did we deliver customer value?”
  • “How many features reached production and worked correctly?”
  • “Did customers notice any improvement?”

The Integration Testing Tax

Here’s what I observed leading product for our Series B fintech startup:

One quarter, our engineering team shipped 60% more commits. I was thrilled! But when I looked at feature delivery to customers, we only shipped 15% more features.

Why the gap?

The engineers explained: AI-generated code required significantly more integration testing and bug fixes. Fast code generation created a backlog in QA, customer acceptance testing, and edge case discovery.

From a product perspective, AI optimized the wrong part of the value stream. It’s like optimizing checkout page load time when your real problem is unclear product value proposition.

Jobs to Be Done Framework for AI Tools

In product thinking, we ask: “What job is the customer hiring this product to do?”

Apply that to AI coding assistants:

Job we think developers are hiring AI to do:
“Write code faster”

Job developers are actually hiring AI to do:
Could be any of these:

  • “Get unblocked when I’m stuck”
  • “Generate boilerplate so I can focus on business logic”
  • “Learn patterns in an unfamiliar codebase”
  • “Avoid asking senior engineers obvious questions”
  • “Meet velocity targets my manager set”

Each of those jobs requires different product features and different success metrics.

If we’re measuring “lines of code generated” but developers hired the tool to “learn patterns,” we’re optimizing the wrong thing.

The Time-to-Value Metric

Michelle’s point about fixing feedback loops first is exactly right from a product lens.

In product, we learned: Measure time from idea to customer feedback, not time from idea to code written.

AI coding assistants optimize “time from idea to code written.” But that’s not where value is created.

Value is created when:

  1. Code reaches production
  2. Customers use the feature
  3. We validate it solved their problem
  4. We learn what to build next

If AI speeds up step 1 but creates bottlenecks in steps 2-4, you’ve optimized the wrong constraint.

The Question I’d Ask Your Teams

Keisha’s point about junior engineers not learning patterns hits home because we see this in product too—PMs using data dashboards without understanding statistical significance.

Here’s what I’d ask your engineering teams:

“If the AI disappeared tomorrow, would your code quality go up or down?”

If the answer is “up”—you have a dependency problem.
If the answer is “down”—AI is genuinely helping.
If the answer is “same, just slower”—AI is optimizing the wrong thing.

The Real Metric: Customer Impact Per Engineer

At our startup, I’m pushing for a different metric:

Not: “How many lines of code per engineer?”
But: “How much customer value delivered per engineer per sprint?”

This forces the conversation away from activity metrics toward outcome metrics.

When we measure this way, AI’s impact becomes clearer:

  • Did we ship more features that customers actually used?
  • Did we reduce time from customer request to deployed solution?
  • Did we improve product quality (fewer bugs, better UX)?

To Luis’s Original Question

Are we trading one cognitive load for another?

From a product perspective: Yes, and we’re not measuring the trade-off correctly.

We’re measuring the cognitive load reduction (less time writing boilerplate) but not measuring the cognitive load increase (context switching, review burden, integration debugging).

In product terms, we’re measuring benefits but not costs. That’s how you end up with a beautiful feature that tanks retention because we didn’t measure the onboarding friction it created.


Maya’s framing is perfect: Are we optimizing for “time to first draft” when we should optimize for “time to production-ready”?

I’d add: Are we optimizing for “features coded” when we should optimize for “customer problems solved”?

The best product teams learned to measure outcomes, not outputs. Maybe the best engineering teams need to make the same shift.