CFOs Don't Fund 'Psychological Safety'—So We Started Measuring It Through Code Review Data

I had the most frustrating budget meeting of my career three months ago. Our platform team needed two additional headcount to support our growing engineering org. I came prepared with developer experience survey results, testimonials from team leads, and research showing that psychological safety has “outsized influence” on productivity.

The CFO listened politely, then asked: “What’s the ROI?”

I talked about retention, velocity, innovation. He asked for numbers. I showed survey scores. He said surveys are subjective. I referenced the research. He said “show me in our data that this investment will deliver measurable business outcomes, and I’ll approve it.”

I left that meeting convinced he just didn’t get it. Turns out, I was the one who didn’t get it.

The Breakthrough: Culture Shows Up in Interaction Patterns

Our VP of Engineering pulled me aside after that meeting and asked a question that changed my approach: “If psychological safety is real, shouldn’t it show up in how people work together?”

That sent me down a rabbit hole. We started analyzing our development collaboration patterns, particularly code reviews. Not just the mechanics (time to merge, comment count), but the character of interactions:

What we tracked:

  • Repeat collaboration rate: How often do the same people review each other’s code? Healthy teams have consistent pairs; siloed teams don’t.
  • Constructive feedback rate: Comments that lead to meaningful changes vs. comments that just block or rubber-stamp
  • Cross-team review participation: Are teams learning from each other or staying in bubbles?
  • Question-asking rate: Are junior developers engaging with seniors? Are they asking questions without fear?
  • Review discussion depth: Superficial “LGTM” vs. substantive technical conversations

The Correlation That Convinced the CFO

We ran this analysis across our 12 product teams. Then we correlated these “interaction health” metrics with actual business outcomes:

  • Velocity: Teams in the top quartile of interaction health shipped 40% more features per sprint
  • Quality: Same teams had 25% fewer bugs making it to production
  • Innovation: Measured by “new architectural patterns introduced” - 3x higher in high-interaction teams
  • Retention: 18-month retention was 89% for high-interaction teams vs 67% for low-interaction teams

The pattern was clear: psychological safety wasn’t some soft cultural nice-to-have. It showed up in measurable ways in how people collaborated, and that collaboration quality drove business outcomes.

Making the Business Case

I went back to the CFO with a different presentation:

"Our platform team investment will improve interaction health metrics by providing better development workflows, faster feedback loops, and reduced friction in collaboration. Based on our data, improving interaction health from the 50th to 75th percentile correlates with 25% more features shipped and 15% reduction in bugs.

For a 60-person engineering org, that’s roughly 15 additional features per quarter and 30 fewer production incidents. Valued conservatively at $X per feature and $Y cost per incident, the ROI is 3.2x in the first year."

He approved the budget in 10 minutes.

Platform Teams as Culture Infrastructure

This reframed how I think about platform engineering. We’re not just building technical infrastructure—we’re building culture infrastructure.

Good developer platforms don’t just make deploys faster. They make collaboration easier. They reduce friction in code review. They create visibility across teams. They enable question-asking and knowledge sharing.

All of that shows up in interaction patterns. And interaction patterns correlate with business outcomes.

When we talk about developer experience, we often separate “culture work” from “platform work.” But they’re the same thing. The platform IS cultural infrastructure. The metrics we should track aren’t just uptime and build speed—they’re collaboration quality, knowledge sharing, and psychological safety.

And yes, you can measure all of that.

The Metrics We Now Track

Our platform team reports these quarterly:

  1. Interaction Health Score (composite of the metrics I mentioned)
  2. Cross-team collaboration rate (proxy for knowledge sharing)
  3. Knowledge transfer velocity (new patterns adopted across teams)
  4. Review engagement quality (depth of technical discussions)

Paired with traditional metrics:

  • Feature velocity
  • Bug escape rate
  • Production incidents
  • Developer satisfaction scores

The combination tells a complete story: technical performance AND cultural health, both driving business outcomes.

Questions I’m Still Wrestling With

Privacy and ethics: When you measure social interactions, you’re creating potential for surveillance culture. How do we track patterns without making people anxious about being watched?

Causation vs correlation: We see correlations between interaction health and outcomes, but can we prove platform investments drive interaction health? Or are both caused by some third factor (team quality, leadership, org design)?

What’s the right granularity: Team-level metrics? Org-level? Individual-level (seems problematic)?

I don’t have perfect answers, but I know this: CFOs fund what you can measure. If we believe culture matters, we need to find ways to make it visible in data—not to replace qualitative understanding, but to complement it with quantitative evidence.

Anyone else measuring cultural factors through technical collaboration data? What’s worked? What’s backfired?

This is exactly the kind of thinking that bridges the gap between “culture is important” rhetoric and actual organizational investment. But I have some hard questions about causation.

The Attribution Problem

You showed correlation between interaction health and business outcomes. That’s valuable. But when you go back to the CFO and say “platform investment will improve interaction health,” how did you isolate the platform variable?

What if high-performing teams naturally have better interaction patterns AND naturally adopt platform tools more effectively? You might be measuring team quality, not platform impact.

To prove causation, you’d need something like:

  • A/B testing: Teams with early platform access vs. control group
  • Staged rollout: Measure interaction health before/after platform adoption for same teams
  • Regression analysis: Control for confounding variables (team size, tenure, domain complexity)

Did you do any of that? Or is the CFO buying a correlation-based ROI story?

I ask because I’ve seen this pattern before: we find correlations, build a compelling narrative, get funding, then struggle to prove actual impact when the investment doesn’t move the needle as predicted.

The Product Manager Lens: Leading vs. Lagging Indicators

Your interaction health metrics are interesting because they might be leading indicators of product success, not just trailing indicators.

If cross-team collaboration increases before velocity improves, that’s predictive. You can tell the CFO “we’re seeing early signs that this investment will pay off” before the outcome metrics move.

But if interaction health and velocity move together, they might both be caused by something else (new leadership, market pressure, team motivation), and you’re just observing correlation.

Have you analyzed the time-series data to see which metrics lead and which lag?

What I’d Want to See

If I were the CFO (thankfully I’m not), here’s what would convince me:

  1. Before/after analysis: Teams that adopted the platform vs. teams that haven’t (yet)
  2. Dosage response: Teams that use the platform heavily vs. lightly—do interaction health metrics scale with adoption?
  3. Natural experiments: Did anything disrupt the pattern? New platform feature launch, org reorg, etc.

Without those, you’ve got a compelling correlation, but you haven’t proven that platform investment drives interaction health.

That said: sometimes you don’t need perfect causation to justify investment. If the correlation is strong enough and the narrative is coherent, that’s often sufficient for budget approval. Just be careful about promising specific outcomes you can’t control.

Michelle, this is fascinating. I want to implement something similar, but I’m genuinely concerned about creating a surveillance culture. How do you handle that tension?

The Measurement Anxiety Problem

In financial services, we’re heavily regulated, which means we already track a lot of engineer activity (audit trails, compliance checks, code lineage). Developers are pretty sensitive to “being watched.”

If I start tracking “question-asking rate” and “review discussion depth,” I worry that:

  1. Performance anxiety: People start performing for the metrics rather than collaborating naturally
  2. Privacy concerns: Even aggregated data can feel invasive when it’s about social interactions
  3. Gaming: If interaction health becomes a KPI, people will game it (ask fake questions, create artificial discussions)

How did you roll this out without creating that dynamic?

Transparent vs. Opaque Metrics

I’ve been wrestling with whether these metrics should be:

Transparent (everyone knows they’re being measured):

  • Pro: No secret surveillance, aligns with psychological safety principles
  • Con: Changes behavior, potentially creates performative collaboration

Opaque (measured but not disclosed):

  • Pro: Natural behavior patterns, harder to game
  • Con: Feels like surveillance, undermines the trust you’re trying to measure

At Twilio, we made code review metrics transparent but individual-level data private. Teams could see aggregate patterns but not individual performance. That helped somewhat, but people still knew they were part of the data set.

What’s your approach? Do team leads see individual-level interaction patterns, or only aggregates?

The Irony

The thing that strikes me: you’re trying to measure psychological safety through interaction patterns, but the act of measurement might undermine psychological safety if people feel surveilled.

It’s like the observer effect in physics—measuring the system changes the system.

I’m not saying don’t measure. But I think the “how” of measurement is as important as the “what.” You need to design the measurement system in a way that reinforces the culture you’re trying to measure, not undermines it.

Would love to hear how you navigated that.

I love this reframe: platform teams as culture infrastructure. That’s exactly how I think about design systems too, and it’s why I think platform work is fundamentally product work, not just ops work.

Interaction Design for Culture

When we built our design system, the technical artifacts (components, tokens, documentation) were only half the work. The other half was designing FOR collaboration:

  • Office hours: Regular sessions where designers could ask questions and discuss patterns
  • Contribution guides: Not just “how to submit a PR” but “how to propose a new component” with clear decision criteria
  • Usage examples: Real implementations from real teams, showing not just what’s possible but what’s recommended
  • Discussion channels: Spaces where designers could debate approaches and learn from each other

All of that is culture infrastructure. It shapes interaction patterns. And those interaction patterns determine whether the design system gets adopted or ignored.

The best design systems aren’t the ones with the most comprehensive component libraries—they’re the ones that create the healthiest collaboration patterns.

The Metrics Parallel

Your interaction health metrics are basically measuring what product designers call “engagement quality.” Not just “are people using the thing” but “are they using it in ways that create value and build community?”

For design systems, we track:

  • Cross-team component reuse (like your cross-team review participation)
  • Contribution rate (like your question-asking rate)
  • Discussion depth (same as yours for code reviews)
  • Pattern propagation (like your knowledge transfer velocity)

It’s the same model applied to a different domain. Platform-as-culture-infrastructure.

The Part I’m Skeptical About

Here’s where I push back a bit: you’re assuming that better platforms directly improve interaction health. But what if causation runs the other way?

What if teams with healthy culture are just better at using platforms? They’d have high interaction health AND high platform adoption, but the platform didn’t cause the culture—the culture enabled effective platform use.

I’ve seen design systems fail in toxic cultures even when the design system itself was excellent. And I’ve seen mediocre design systems thrive in healthy cultures because people collaborated around them.

Culture might be the independent variable, not the dependent one. Your platform investment might only work if the cultural foundation already exists.

That doesn’t mean don’t invest in platforms. It means: invest in platforms AND culture-building simultaneously. Don’t assume one will fix the other.

Michelle, I want to highlight something important about your approach: you’re measuring culture at the right scale.

The Granularity Problem

A lot of organizations try to measure culture through individual performance metrics. That’s where you get into surveillance problems, gaming, and anxiety.

But culture isn’t individual—it’s collective. It emerges from interaction patterns, not individual behaviors.

By focusing on team-level and org-level metrics (cross-team collaboration, repeat pairing, knowledge sharing), you’re measuring the right things at the right scale.

Where This Breaks Down

The challenge comes when someone asks: “Which teams have healthy culture and which don’t?”

Now you’re using aggregate metrics to make comparative judgments. And that creates competition dynamics that might undermine the collaboration you’re trying to measure.

If Team A sees that Team B has higher interaction health scores, does that:

  • Motivate Team A to improve their collaboration patterns? (Good outcome)
  • Make Team A defensive and resentful? (Bad outcome)
  • Create pressure to game the metrics? (Terrible outcome)

I’ve seen all three happen with engineering metrics. The culture around the metrics matters as much as the metrics themselves.

What Makes This Work at Scale

The reason I think your approach can work: you’re using these metrics to justify platform investment, not to evaluate individual teams or people.

That’s the right framing. The metrics exist to help leadership understand where to invest, not to rank teams or create competition.

But you need to be really careful that the metrics don’t evolve into performance evaluation tools. Once people believe their team is being judged by interaction health scores, the behavior changes and the metrics become less reliable.

The Inflection Point

Here’s my question: at what scale do these metrics become necessary vs. when do they create unnecessary overhead?

At 25 engineers, you don’t need metrics to see collaboration patterns. You can literally observe them.

At 200 engineers, patterns are invisible without metrics. You NEED quantitative data to understand what’s happening.

Somewhere in between is an inflection point where the value of measurement exceeds the cost (in overhead, potential anxiety, risk of gaming).

Where do you think that inflection point is? And how do you avoid measuring too early (creating unnecessary bureaucracy) or too late (missing important signals)?