DORA, SPACE, DX Core 4 – Are We Measuring Engineering Productivity or Just Activity?

Over the past four years, we’ve watched engineering productivity measurement evolve from DORA alone to DORA + SPACE, and now in 2026, to the unified DX Core 4 framework. Each iteration promises better insight into how our teams actually work. Each framework gets more sophisticated, with more dimensions, more metrics, more surveys.

And yet, I’m starting to wonder: are we measuring the right things?

The Framework Evolution

At my company—a Fortune 500 financial services firm—we’ve implemented all three frameworks over the years:

  • DORA gave us deployment frequency, lead time, change failure rate, and time to restore. Clean, actionable, focused on delivery and stability.
  • SPACE added satisfaction, performance, activity, communication, and efficiency. More holistic, but also more complex to instrument.
  • DX Core 4 now unifies these into four dimensions: Speed, Effectiveness, Quality, and Impact. The “Impact” dimension is particularly compelling—measuring the percentage of time spent on new capabilities vs. maintenance. That’s a business outcome, not just an engineering activity metric.

But here’s the paradox: the more sophisticated our frameworks become, the less clarity we seem to have about actual productivity.

The AI Productivity Mirage

This year, the data is impossible to ignore:

  • 41% of all code written in 2025 was AI-generated (source)
  • 76% of developers now use AI coding assistants
  • Individual coding speed is up—dramatically, for simple tasks

But at the organizational level:

  • Code churn increased by 41%
  • Delivery stability dropped 7.2%
  • Security vulnerabilities in AI-assisted code are up 23.7%
  • Only 33% of developers trust AI results (source)

So developers are writing code faster, but we’re shipping less reliably, with more technical debt and more vulnerabilities. The productivity gains at the individual level aren’t translating to organizational throughput.

Outputs vs. Outcomes

I think the core issue is this: we’re still measuring outputs (commits, story points, lines of code) when we should be measuring outcomes (customer value delivered, system health, sustainable pace).

The moment you tell developers they’ll be evaluated on cycle time, you’ll get optimized cycle time—and everything else (code quality, documentation, mentoring, thoughtful design) will suffer. Goodhart’s Law in action.

Even the DX Core 4’s “Impact” dimension, which I genuinely like, is hard to implement in practice. How do you categorize work as “new capability” vs. “maintenance” when most real-world changes involve both? How do you avoid creating perverse incentives where teams game the classification to look more innovative?

What I’m Struggling With

DORA metrics work because they tie delivery speed directly to system stability. You can’t game both simultaneously—if you optimize for deployment frequency while ignoring quality, your change failure rate will expose you.

But with AI changing how we write code, are even DORA metrics still sufficient? When 41% of code is AI-generated and code churn has doubled, traditional velocity metrics feel disconnected from actual value delivery.

I don’t have answers here—just questions:

  • What metrics actually predict success at your organizations?
  • How are you adapting measurement to account for AI-assisted development?
  • Have you found ways to tie engineering metrics to business outcomes without creating perverse incentives?
  • Is the DX Core 4 framework overkill, or does the additional complexity actually yield better insight?

I’d love to hear how other engineering leaders are thinking about this. Because right now, I’m worried we’re optimizing for measurement theater instead of actual productivity.


References:

This hits so close to home. We just went through this exact cycle at our EdTech startup—implemented velocity tracking, watched the team optimize for story points closed, and then realized we’d created a system where people avoided complex, high-impact work in favor of easy wins that moved the numbers.

The outputs vs. outcomes distinction you’re drawing is critical. But I’d add: there’s also a human cost when metrics become misaligned with actual value.

The Velocity Optimization Trap

Last quarter, one of my senior engineers told me in our 1:1 that he felt pressured to take on smaller tickets because he knew larger architectural work would hurt his “productivity score.” This is someone who’d previously proposed and led a database migration that cut our query times by 60%. But the metrics didn’t capture that impact—they just showed him as “slower” for two sprints.

That’s when I realized: we’d accidentally created a system that punished exactly the kind of work that made the business successful.

AI Isn’t Just a Coding Problem

Your point about AI-generated code—41% of output but delivery stability down 7.2%—resonates deeply. But I think the issue is broader than just code quality.

At the organizational level, speed gains at the coding stage don’t matter if the bottleneck is somewhere else. In our case:

  • Product requirements still take the same time to clarify
  • Design reviews haven’t gotten faster
  • QA cycles haven’t shortened
  • Cross-team dependencies still block deploys

AI lets developers write code faster, but it doesn’t help them understand what to build, or navigate organizational complexity, or reduce coordination overhead. The system-level bottlenecks remain.

What We’re Trying Now

We’ve kept DORA metrics as our baseline—deployment frequency, lead time, change failure rate, time to restore. Those still work because, as you said, you can’t game speed and stability simultaneously.

But we’ve added qualitative team health surveys every month. Not just “are you satisfied?” but specific questions about:

  • Psychological safety to take on risky, high-impact work
  • Clarity on what “good” looks like beyond velocity
  • Alignment between individual work and team outcomes

The combination gives us leading indicators (team health) and lagging indicators (DORA metrics). When team health drops, we see DORA metrics degrade 4-6 weeks later.

The Manager Behavior Question

Here’s what I’m wrestling with: manager behavior matters more than framework choice.

If managers treat metrics as performance scorecards instead of diagnostic tools, no framework will save you. DX Core 4 can be misused just as easily as story points.

The question I keep asking myself: How do you prevent metrics from becoming performance theater?

I don’t have a complete answer yet. But I think it starts with executive buy-in that metrics are for learning, not ranking. And it requires constant, explicit conversation about what we’re optimizing for—and what we’re willing to sacrifice to get there.

Coming at this from a design systems perspective, and honestly, this resonates hard. Metrics that ignore creative work, design thinking, and user research in the delivery pipeline are missing half the story.

My failed startup taught me the difference between vanity metrics and real progress the hard way. We had beautiful dashboards showing “features shipped” and “commits per week” while completely missing that users weren’t adopting what we built. We were measuring activity, not impact.

The Designer’s Take on Productivity Frameworks

Here’s what I find fascinating about DX Core 4’s “Effectiveness” dimension—the Developer Experience Index measured through surveys about flow state and cognitive load.

That actually matters.

I’ve worked with engineering teams where the culture was “ship fast, break things, fix later.” Developers were stressed, context-switching constantly, and dealing with fragmented tooling. The velocity looked good on paper. But the cognitive load was crushing, and burnout was real.

Compare that to teams with thoughtful developer experience: clear documentation, consistent patterns, time to think. They shipped sustainably. The metrics didn’t always show it immediately, but the long-term outcomes were dramatically better.

AI Tools = Faster Output ≠ Better Outcomes

This applies to design work too. I can use AI to generate interface variations 10x faster than I could by hand. But if I’m generating the wrong variations because I haven’t done proper user research, I’m just failing faster.

Speed without direction is just motion.

The parallel to your code churn observation is spot-on. Copy-paste patterns are up 48%? That’s like designers duplicating components without thinking through the system. You end up with 12 slightly different button styles instead of one flexible component. Technical debt accumulates whether it’s code or design.

What About Non-Engineering Work?

One thing that bugs me about DORA, SPACE, and even DX Core 4: they’re all engineering-centric frameworks.

But in most product development, engineering is only one part of the pipeline:

  • Product strategy and roadmap decisions
  • User research and validation
  • Design and prototyping
  • Cross-functional alignment
  • Go-to-market coordination

If you only measure engineering velocity, you miss the bottlenecks in discovery, design, and decision-making. You optimize the wrong part of the system.

Developer Experience vs. Product Experience

Here’s the question I keep asking: Are we measuring developer satisfaction or developer effectiveness?

Surveys can tell you if people feel productive. But feelings don’t always correlate with outcomes. I’ve seen teams that loved their tools and processes but weren’t shipping meaningful customer value.

DX Core 4 tries to address this with the “Impact” dimension—percentage of time on new capabilities. That’s directionally right. But as Luis pointed out, categorizing work as “new” vs. “maintenance” is ambiguous. Most real work is both.

And here’s the bigger question: What if new capabilities don’t solve customer problems?

We could spend 80% of our time building new features that users ignore, and the “Impact” metric would look great while the business fails.

Integrating Design and Product Metrics

What I’d love to see: frameworks that integrate engineering, design, and product metrics into a unified view of delivery effectiveness.

Something like:

  • Discovery health: Are we validating problems before building solutions?
  • Design-engineering collaboration: How smoothly does designed work flow into implementation?
  • Customer impact: Are the things we ship actually used and valued?
  • System sustainability: Can we maintain velocity without accumulating debt?

Most orgs measure these in silos. Engineering has DORA. Design has NPS or design system adoption. Product has feature usage. But they’re all part of one system.

My Hot Take

Framework proliferation is a symptom of a deeper problem: we’re trying to measure productivity without defining success.

DORA → SPACE → DX Core 4. Each framework gets more sophisticated, but they all assume we know what “good” looks like. And I’m not sure we do.

Maybe instead of adding more dimensions to measure, we should spend more time asking: What outcome are we actually trying to achieve? And is the thing we’re measuring actually predictive of that outcome?

From a product leadership perspective, this entire conversation is exactly what’s been missing from most engineering metrics discussions I’ve been in. Thank you for surfacing it.

Here’s my frustration: engineering metrics almost never connect to business outcomes in a meaningful way.

The Disconnect Between Fast Deploys and Customer Value

At my current company (B2B fintech), we had a quarter where engineering hit all their DORA targets:

  • Deployment frequency: :white_check_mark: 12 deploys/week (up from 8)
  • Lead time: :white_check_mark: 2.3 days (down from 4.1)
  • Change failure rate: :white_check_mark: 3.2% (within target)

The engineering team presented these metrics at the board meeting as evidence of high performance. And technically, they were right—by DORA standards, they were performing well.

But here’s what the board saw in the same quarter:

  • Customer acquisition: Flat
  • Activation rate: Down 8%
  • Churn: Up 2.1%
  • NPS: Dropped 6 points

We were shipping faster, but we weren’t solving customer problems. The features we deployed quickly weren’t the features customers needed. Engineering velocity was up, but business velocity was stalled.

That’s when I realized: velocity is only valuable if you’re pointed in the right direction.

Why I Love (and Fear) DX Core 4’s “Impact” Dimension

The “Impact” dimension in DX Core 4—measuring percentage of time spent on new capabilities vs. maintenance—is the first framework metric I’ve seen that attempts to bridge engineering activity and business outcomes.

But here’s the problem Luis identified: how do you categorize work as “new capability” vs. “maintenance”?

And more importantly: what if the new capabilities don’t move the business metrics?

We could spend 80% of our time building new features that:

  • Don’t solve validated customer problems
  • Don’t improve activation, retention, or revenue
  • Don’t support our go-to-market strategy

The “Impact” metric would look great. The business would still fail.

The AI Productivity Skepticism

The AI productivity data you cited—41% of code AI-generated, but delivery stability down 7.2%—mirrors what we’re seeing in product outcomes.

Bain’s research shows AI productivity gains are “unremarkable” when measured independently, while vendors claim 20-55% speedups. That’s a massive gap in reported effectiveness.

My theory: vendors measure individual task completion speed. Bain measures organizational throughput.

And organizational throughput is what matters for business outcomes. If developers write code 40% faster but:

  • Product requirements are still unclear
  • Design handoffs are still broken
  • QA cycles haven’t shortened
  • Cross-functional alignment still takes weeks

…then the business doesn’t actually ship 40% faster. The bottleneck just moved.

Framework Fatigue

Here’s what drives me crazy: every year, there’s a new framework, but the underlying problems stay the same.

DORA (2014) → SPACE (2021) → DX Core 4 (2024) → ??? (2027)

Each iteration adds more dimensions, more surveys, more complexity. But none of them solve the fundamental challenge: tying engineering work to customer value and business outcomes.

Maya’s right—we’re measuring productivity without defining success.

What Product Leaders Actually Need

If I could design a productivity measurement framework from scratch, here’s what I’d want:

1. Business outcome alignment

  • % of engineering time on initiatives tied to top 3 business goals
  • Correlation between deploys and customer-facing metrics (activation, retention, NPS)
  • Time from customer problem identification to solution deployment

2. Validated discovery

  • % of roadmap items validated with customer research before build
  • Feature adoption rate (are customers using what we ship?)
  • Time to validate or invalidate hypotheses

3. Cross-functional flow

  • Cycle time from idea → research → design → build → launch → measure
  • Handoff efficiency between product, design, and engineering
  • Rework rate (how often do we have to redo work due to unclear requirements?)

4. Sustainable pace

  • DORA metrics as baseline (you can’t sustain speed without stability)
  • Team health and burnout indicators
  • Technical debt trends

Engineering frameworks like DORA and DX Core 4 measure the “build” phase. But product development is:

  • Discover (validate problems)
  • Design (prototype solutions)
  • Build (implement)
  • Measure (validate outcomes)

If you only optimize “build,” you miss 75% of the value creation process.

The Question I Keep Asking

How do we tie engineering velocity to revenue, retention, and customer satisfaction metrics without creating perverse incentives?

Because right now, most companies I know measure these in complete isolation:

  • Engineering: DORA, velocity, uptime
  • Product: Feature adoption, NPS, retention
  • Finance: Revenue, CAC, LTV

They’re all part of one system. But we measure them like they’re independent variables.

That’s the real problem. Not the frameworks themselves—but the fact that we’ve siloed measurement to the point where nobody can see the whole picture.

This is one of the best discussions I’ve seen on engineering productivity measurement. The cross-functional perspectives here—engineering leadership, product, design—are exactly what’s needed to move past framework theology toward actual organizational effectiveness.

Let me add the executive/CTO perspective, because this conversation touches on something I wrestle with constantly: how do you make measurement strategic instead of just operational?

Frameworks Are Tools, Not Solutions

I’ve seen DORA work brilliantly at Microsoft—tight feedback loops, strong engineering culture, clear alignment between velocity and business impact.

I’ve also seen DORA fail spectacularly at startups—teams optimizing for deployment frequency while shipping features nobody wanted, or sacrificing architectural decisions for speed metrics.

The difference wasn’t the framework. It was whether measurement served strategy or replaced it.

When frameworks become the goal instead of the diagnostic tool, you get what Keisha called “performance theater”—metrics that look good but don’t predict success.

The Real Issue: Measurement Without Strategy

Luis asked: “Are we measuring the wrong outcomes?”

My answer: We’re measuring outcomes without defining the strategy those outcomes should serve.

DORA metrics are excellent for answering: “Are we delivering software reliably?”

SPACE adds: “Are our developers satisfied and effective?”

DX Core 4 attempts: “Are we spending time on high-impact work?”

But none of them answer: “Is what we’re building aligned with our business strategy and customer needs?”

That’s not a criticism of the frameworks—that’s fundamentally not what they’re designed to measure. But organizations treat them as if they can.

AI and the Quality Crisis Ahead

The data Luis cited is deeply concerning from a CTO perspective:

  • 41% of code is AI-generated
  • Security vulnerabilities up 23.7%
  • Code churn up 41%
  • Delivery stability down 7.2%

Here’s what keeps me up at night: we’re trading short-term velocity for long-term technical debt and security risk.

When I talk to other CTOs, we’re seeing the same pattern:

  • Developers love AI coding assistants (who wouldn’t? They’re faster)
  • Code review quality is declining (reviewers trust AI output too much)
  • Security issues are being introduced at higher rates
  • Technical debt is accumulating faster because AI code tends toward copy-paste patterns rather than thoughtful abstraction

And traditional velocity metrics don’t capture this degradation until it’s already a crisis.

My Approach: Layered Measurement

At my current company, we use a three-layer measurement approach:

Layer 1: DORA as baseline hygiene

  • Deployment frequency
  • Lead time for changes
  • Change failure rate
  • Time to restore service

These tell us: “Can we deliver software reliably?” If these numbers are bad, nothing else matters.

Layer 2: SPACE for diagnostic insight

  • When DORA metrics degrade, SPACE helps us understand why
  • Is it satisfaction? Communication? Efficiency?
  • We don’t optimize for SPACE metrics—we use them to diagnose bottlenecks

Layer 3: Custom strategic metrics

  • % of engineering time aligned to top 3 company OKRs
  • Customer-facing metric improvement per deploy (NPS, activation, retention)
  • Technical debt ratio (time spent on new features vs. paying down debt)
  • Security posture trend (vulnerability discovery rate, time to remediation)

This layered approach prevents us from optimizing any single dimension at the expense of the whole system.

The Leadership Challenge Nobody Talks About

Here’s the uncomfortable truth: metrics fail because leadership fails to use them correctly.

When engineering metrics are used as performance scorecards—to rank teams, to justify headcount cuts, to punish “low performers”—you destroy psychological safety and create gaming behavior.

When metrics are used as learning tools—to understand system health, to identify bottlenecks, to allocate resources—they work.

The framework doesn’t matter as much as the culture around measurement.

David’s point about siloed measurement is critical. When I present to our board, I show:

  • Engineering: DORA + custom strategic metrics
  • Product: Feature adoption, customer satisfaction
  • Business: Revenue, retention, growth

But the real insight comes from showing the correlations:

  • When engineering lead time decreased, did customer activation improve?
  • When we reduced technical debt ratio, did feature velocity increase?
  • When deployment frequency increased, did revenue grow?

If those correlations are weak or negative, we’re optimizing the wrong things.

My Answer to Luis’s Question

Are we measuring the wrong outcomes?

Yes—if we’re measuring engineering productivity in isolation from business and customer outcomes.

No—if we’re using engineering metrics as one input into a broader understanding of organizational effectiveness.

The problem isn’t DORA or SPACE or DX Core 4. The problem is treating any framework as sufficient by itself.

What we need is measurement that:

  1. Ties to strategy: Engineering work serves business goals
  2. Captures system health: Speed, quality, sustainability, security
  3. Enables learning: Diagnostics, not rankings
  4. Connects to outcomes: Customer value, not just activity

And most importantly: We need leadership that uses metrics to understand the system, not to control people.

Because the moment metrics become control mechanisms, they stop measuring reality and start measuring compliance.


This has been an incredibly valuable conversation. I’d love to hear from others:

  • How are you handling the AI code quality challenge?
  • What metrics have you found that actually predict business success?
  • How do you prevent measurement from becoming performance theater?