"Measure outcomes, not activity" - but how do you actually do this without surveillance tools?

I keep hearing “measure outcomes, not activity” as the golden rule for managing remote teams. But here’s the thing: after scaling our EdTech engineering team from 25 to 80+ engineers over the past two years—most of that growth happening fully remote—I’m convinced this principle is correct but incomplete.

Everyone nods along when you say it in leadership meetings. Then someone asks “so what are we actually measuring?” and the room goes quiet.

The False Choice Nobody Talks About

The conversation usually degrades into two extremes:

  1. Surveillance camp: Install monitoring tools, track keystrokes, measure “productivity scores”
  2. Trust camp: Hire great people, trust them completely, no metrics needed

Both are wrong. The first destroys morale and drives away your best talent. The second leaves you flying blind when something’s actually broken.

What I’ve Learned Actually Works

After a lot of trial and error (and some painful quarters), here’s what’s working for our team:

1. Define Outcomes, Not Activities

This sounds obvious until you try it. Instead of “was online 8 hours,” we align on:

  • Clear ownership: Who owns this outcome?
  • Realistic timelines: When is success expected?
  • Quality standards: What does “done” actually mean?

For our engineering teams, this translates to:

  • Sprint velocity (story points completed per sprint)
  • Code review metrics (PR turnaround time, review quality)
  • Bug rates (defects per 1000 lines of code)
  • Feature delivery (actual vs committed)
  • Code quality metrics (test coverage, maintainability)

2. Check-ins Replace Check-ups

We do weekly 1:1s (or biweekly for more senior folks) to discuss:

  • Progress toward outcomes
  • Blockers and obstacles
  • Priorities alignment

This isn’t surveillance—it’s support. When a senior engineer has been stuck on the same PR for 3 days, I want to know so we can help, not punish.

3. Transparency Over Surveillance

Our team dashboards show the same metrics I see. No secret “productivity scores.” If we’re tracking it, everyone knows about it and understands why.

Research backs this up: Stanford found remote workers are actually 13% more productive than office workers. The difference? Outcome-focused management with transparent expectations.

The Leadership Mindset Shift

Here’s what I had to personally work through: if you can’t trust a developer without surveillance, you hired the wrong person.

The most successful approach I’ve seen is what some call “authentic leadership”—leading through trust, transparency, and outcomes rather than presence and surveillance.

But I’m still learning. What’s working for other engineering leaders here?

Specific questions I’m wrestling with:

  • How do you define “outcome” for exploratory or research work?
  • What’s your balance between quantitative metrics and qualitative judgment?
  • How do you avoid “outcome theater”—where teams game metrics instead of delivering real value?

Looking forward to learning from this community’s experience.

Keisha, this resonates deeply with my experience leading distributed teams across three time zones at our financial services company. But I want to add some nuance, especially for those of us in regulated industries.

The Compliance Layer

In financial services, we can’t just say “trust everyone.” We have audit requirements, SOX compliance, and security frameworks that demand certain visibility. But here’s the thing: compliance requirements don’t require surveillance.

What we’ve learned:

  • Audit trails ≠ activity monitoring: We need to know who deployed what and when (compliance), not how many hours they worked (surveillance)
  • Access logs for security: We track system access for breach detection, not productivity scoring
  • Code review for quality: Every PR requires review—this is both a quality gate and an audit requirement, not a trust issue

Metrics That Actually Matter

After 18 years in this industry (Intel, Adobe, now here), I’ve settled on these core metrics:

Code Quality Metrics:

  • PR turnaround time (not to punish slowness, but to identify bottlenecks)
  • Code review quality (are we catching issues? Learning from each other?)
  • Defects per 1000 lines of code (trend over time, not individual blame)

Delivery Metrics:

  • Sprint commitment vs. actual (helps us get better at estimation)
  • Feature delivery cycle time (from commit to production)
  • Rollback rate (are we shipping stable code?)

Team Health Indicators:

  • Blocker resolution time (how fast do we unblock people?)
  • Cross-timezone collaboration lag (are our async processes working?)
  • On-call incident response (for critical systems)

The Weekly 1:1 is Critical

I completely agree with your point about check-ins vs check-ups. My weekly 1:1s with directs aren’t status reports—they’re:

  1. Blocker removal sessions: What’s in your way?
  2. Growth conversations: What are you learning?
  3. Alignment checks: Are we working on the right things?

When I moved from Adobe to this role, I inherited a team that had quarterly check-ins and daily standup reports. We flipped it: weekly 1:1s, async standups, and the difference was immediate. People felt supported instead of monitored.

The Balance

Here’s what I’m still working on: balancing autonomy with accountability in a way that doesn’t feel like micromanagement.

For senior engineers, I tend toward “tell me what you’re delivering this quarter, how you’ll measure success, and let’s check in every two weeks.” For earlier-career folks, it’s more collaborative—we co-create the outcomes and check in weekly.

Your question about exploratory work is spot-on. We’ve started framing research work with:

  • Learning goals: What question are we answering?
  • Decision points: What will we decide based on this research?
  • Timebox: How long should this exploration take?

This helps teams feel like they have permission to explore without the anxiety of “am I being productive enough?”

Looking forward to hearing what others have learned here.

Both of you are hitting on something critical, but I want to challenge the framing a bit.

The Binary Trap

We keep talking about “outcomes vs. activity” as if they’re opposites. They’re not. They’re different levels of abstraction in the same measurement stack.

The reality: You need visibility at multiple levels depending on context:

  • Strategic level: Business outcomes (revenue, retention, market share)
  • Tactical level: Team outcomes (features shipped, quality metrics, velocity)
  • Operational level: Individual activities (when debugging a broken deployment pipeline)

The problem isn’t measuring activity. It’s measuring ONLY activity and calling it productivity.

The Organizational Change Management Challenge

Here’s what I learned leading our cloud migration: saying “we measure outcomes now” doesn’t make it true.

We had to:

1. Define outcomes at every level

  • Business: 30% cost reduction in infrastructure spend by Q3
  • Engineering: Zero-downtime migration for all tier-1 services
  • Team: Each squad migrates 5 services per sprint with <2% rollback rate

2. Build transparent dashboards

  • Everyone sees the same metrics
  • No “manager view” vs “IC view”
  • Real-time access to progress

3. Run regular retrospectives

  • What outcomes did we hit?
  • What outcomes did we miss and why?
  • Which metrics are misleading us?

4. Kill metrics that create bad behavior

  • We stopped tracking “lines of code” when we noticed people inflating PRs
  • We stopped tracking “story points velocity” when teams started gaming estimations
  • We stopped tracking “hours logged” because… obviously

Beware “Outcome Washing”

This is the thing that keeps me up at night: companies claiming to measure outcomes while actually measuring activity with a fresh coat of paint.

I see this constantly:

  • “Outcome: 95% utilization” (that’s an activity metric masquerading as an outcome)
  • “Outcome: 100% sprint commitment met” (also activity, also gameable)
  • “Outcome: 40 hours logged per week” (not even trying)

Real outcomes tie to business value:

  • Features that increased user engagement by X%
  • Services that reduced support tickets by Y%
  • Systems that improved reliability from Z nines to Z+1 nines

Luis’s Compliance Point Deserves More Attention

Luis, your distinction between audit trails and surveillance is brilliant. In our SaaS business, we have SOC 2 compliance requirements. Here’s how we frame it:

Compliance requirements:

  • Change management: Who changed what in production?
  • Access control: Who has access to customer data?
  • Security monitoring: Are there suspicious access patterns?

NOT compliance requirements:

  • Was someone “active” for 8 hours?
  • How many keyboard strokes did they make?
  • Did they attend enough meetings?

The first set protects the business and customers. The second set destroys trust.

The Cultural Foundation

None of this works without the right culture. If your organization doesn’t trust people, no amount of “outcome-based management” will fix it. You’ll just get surveillance with better PR.

The foundation has to be:

  1. Hire adults you trust
  2. Give them clear outcomes and constraints
  3. Get out of their way
  4. Hold them accountable to results
  5. Support them when they’re blocked

When we say “outcome-based management,” we’re really saying “treat your team like professionals with agency, not children who need constant monitoring.”

But—and this is important—don’t confuse trust with abdication. Regular check-ins, clear expectations, transparent metrics… that’s not surveillance. That’s good management.

This conversation is fascinating from a product lens, and Michelle’s “outcome washing” point hits close to home.

Product-Engineering Alignment on Outcomes

Here’s where this gets messy: engineering outcomes and business outcomes often aren’t the same thing.

I learned this the hard way during a product pivot last year. Our engineering team was absolutely crushing their metrics:

  • 95% sprint commitment met ✓
  • Low defect rate ✓
  • High code coverage ✓
  • Excellent PR turnaround time ✓

But we were building the wrong features. Fast, high-quality, well-tested features that users didn’t want.

Our problem: engineering was measuring shipping velocity, product was measuring user value, and they weren’t aligned.

What We Changed

Now we tie engineering outcomes directly to product outcomes:

Before:

  • Engineering OKR: “Ship 5 features per quarter”
  • Product OKR: “Increase activation rate by 15%”

These look aligned but they’re not. Engineering could ship 5 features that have zero impact on activation.

After:

  • Shared OKR: “Increase activation rate by 15% through checkout flow improvements”
  • Engineering measures: Feature delivery + impact metrics (A/B test results, activation cohorts)
  • Product measures: User research validation + activation metrics

The key shift: engineering doesn’t declare victory when code ships, but when the outcome is achieved.

The Sprint Velocity Trap

Michelle, your point about gaming story points resonates. I’ve seen teams:

  • Inflate estimates to look more productive
  • Split stories artificially to hit “more completed items”
  • Focus on easy wins vs high-value work

My controversial take: maybe sprint velocity is the wrong metric entirely.

What if instead we measured:

  • Hypothesis validation rate: How many of our feature bets created the expected impact?
  • Time to learning: How fast do we validate assumptions?
  • Impact per unit of engineering effort: Are we building the right things?

This forces product and engineering to agree on:

  1. What we’re trying to achieve (business outcome)
  2. How we’ll measure success (leading indicators)
  3. When we’ll evaluate (learning cadence)

The “Jobs to Be Done” Framework Applied to Teams

I’ve been experimenting with applying JTBD thinking to team performance:

Traditional question: “How many story points did the team complete?”

JTBD question: “When our team ships a feature, what job is the business hiring us to do?”

Possible jobs:

  • Validate a hypothesis (exploration/learning)
  • Drive a specific metric (growth/engagement)
  • Reduce technical debt (sustainability)
  • Improve reliability (trust/quality)

Each “job” has different success metrics. Measuring exploratory work the same way as metric-driven features creates dysfunction.

My Question for Engineering Leaders Here

How do you balance:

  • Sprint velocity (shipping predictably) vs. product-market fit discovery (learning fast)?

In B2B SaaS, we’re constantly iterating based on customer feedback. Sometimes the outcome is “we learned this approach doesn’t work, pivot.”

That’s a successful outcome from a product perspective, but how do you frame it as a successful outcome for engineering teams without it feeling like wasted effort?

Looking forward to your perspectives on this.

David, your question about velocity vs. learning hits on something I’ve been thinking about a lot. Let me share what’s worked for us.

Reframing “Failed” Experiments as Successful Outcomes

You asked: “How do you frame ‘we learned this doesn’t work’ as a successful outcome without it feeling like wasted effort?”

Here’s what we do:

We track learning as a first-class metric.

In our quarterly OKRs, we explicitly allocate capacity for exploration:

  • 70% committed delivery (features we’re confident about)
  • 20% validated experiments (hypotheses we’re testing)
  • 10% technical investment (debt reduction, infrastructure)

For that 20% experimental bucket, success criteria are:

  1. Hypothesis clearly stated (“We believe X will drive Y because Z”)
  2. Test designed with falsifiability (“If we see [metric] move [direction] by [amount], hypothesis confirmed”)
  3. Decision made within timebox (“Two-week spike, then ship/kill/pivot decision”)
  4. Learnings documented and shared (what we learned, what we’d do differently)

When we kill a project after learning it won’t work, that’s a win. We celebrate it in retros. Why? Because the alternative is building the wrong thing for 6 months.

Quarterly OKRs That Bridge Business and Engineering

Your point about shared OKRs resonates. Here’s how we structure ours:

Company OKR: Increase enterprise customer activation from 45% to 60%

Product OKR: Improve onboarding completion rate from 30% to 50%

Engineering OKR:

  • Ship new onboarding flow with <5% error rate
  • Achieve <3 second page load time for onboarding steps
  • Impact metric: Contribute 10+ percentage points to activation improvement

The last one is key: we don’t get credit for shipping fast code if it doesn’t move the needle.

The Dual Scorecard Approach

We actually run two parallel scorecards:

Scorecard 1: Engineering Health

  • Code quality (test coverage, defect rates)
  • Delivery predictability (sprint commitment accuracy)
  • System reliability (uptime, incident response)
  • Team satisfaction (1:1 feedback, retention)

Scorecard 2: Business Impact

  • Features shipped that moved target metrics
  • Hypotheses validated (successful or failed, both count)
  • Customer-facing incidents reduced
  • Revenue-generating features delivered

Both matter. Great engineering health with zero business impact = building the wrong things well. High business impact with terrible engineering health = unsustainable shortcuts.

We review both quarterly with leadership. This prevents the “but we shipped everything!” argument when business metrics didn’t move.

Speed AND Learning: The Two-Track System

For your velocity vs. learning tension, we run two parallel tracks:

Track 1: Committed Delivery (70% capacity)

  • Clear requirements, low uncertainty
  • Traditional sprint planning and velocity tracking
  • Predictable delivery for known outcomes

Track 2: Discovery Sprints (20% capacity)

  • Time-boxed experiments (1-2 weeks)
  • Success = validated learning, not shipped code
  • Product and engineering co-own the hypothesis

Both tracks report progress differently:

  • Track 1: “Did we ship what we committed?”
  • Track 2: “Did we learn what we needed to decide?”

This explicitly makes room for exploration without teams feeling like they “failed” when an experiment doesn’t pan out.

Making This Work in Practice

The hardest part isn’t the framework—it’s the cultural shift.

Early on, when we killed a project after two weeks of exploration, the engineer who led it felt demoralized. In the retro, I had to explicitly say: “You succeeded. You saved us from building the wrong thing for three months. That’s a huge win.”

Now we celebrate “fast failures” the same way we celebrate successful launches. Both advance the business.


David, to your specific question about B2B SaaS iteration: the key is making learning velocity as important as shipping velocity.

If product-market fit discovery requires 10 small experiments to find 1 winner, then running those 10 experiments in a quarter is a success—even if 9 “failed.”

The metric becomes: time from hypothesis to validated decision. Fast feedback loops beat slow certainty.