Are We Really Doing Blameless Postmortems? The Gap Between Policy and Practice

Three years ago, I sat in a conference room watching what was supposed to be our first “blameless” postmortem. On paper, we had all the right policies. Our incident response documentation literally said “blameless culture” in bold. But as the meeting progressed, I watched our VP of Engineering ask increasingly pointed questions: “Why didn’t you check that before deploying?” “Shouldn’t you have known that would cause problems?”

The engineer who caused the incident - a brilliant senior developer - became quieter and quieter. By the end, everyone knew whose fault it was, even though no one said it explicitly. That engineer left the company six months later.

The Blameless Policy-Practice Gap

Since then, I’ve led engineering teams at three different companies. I’ve seen this pattern repeat: organizations adopt “blameless postmortem” policies, create templates, maybe even get training. But the actual culture? Still blame-focused.

Here’s what I’ve learned about why this gap exists:

1. Language Undermines Blamelessness

We say “blameless” but our language betrays us. Terms like “root cause” implicitly point to a singular failure point - usually a person. When we ask “Who deployed this?” instead of “What deployment process allowed this?”, we’re centering blame on individuals rather than systems.

At my current company, we’ve shifted to “contributing factors” instead of “root causes.” It sounds subtle, but it changes how people think. Instead of finding the one thing that broke, we explore the multiple system conditions that enabled the failure.

2. Leadership Behavior Sets The Tone

The most important factor isn’t your policy document - it’s how leadership behaves during and after incidents. I’ve seen CTOs who preach blamelessness but then ask “How did this get through code review?” in a tone that clearly assigns fault.

Leaders need to model the behavior. When I’m in a postmortem now, I deliberately share times I’ve made similar mistakes. I talk about architectural decisions I made that caused incidents. It signals: we’re all learning here, including me.

3. Career Impact Creates Fear

Here’s the uncomfortable truth: even in “blameless” cultures, being associated with a major incident impacts your reputation. It might not show up in your performance review, but engineers remember. Peers make comments. You become “the person who took down production.”

This fear prevents honest sharing. Engineers minimize their role, focus on external factors, or worst case, hide information. The postmortem might be blameless, but the social dynamics aren’t.

4. Metrics Can Create Perverse Incentives

Some companies track incident metrics in ways that inadvertently encourage blame. “Incidents per team” or “time to resolution by person” might seem like good accountability measures, but they make incidents feel like individual failures rather than learning opportunities.

I’ve started tracking different metrics: “System improvements per postmortem” and “Repeat incident rate.” These focus on learning outcomes, not individual performance.

Making Blamelessness Real

So what actually works? Here’s what I’ve seen succeed:

  • Remove “who” from templates entirely. Our postmortem template doesn’t have a field for “person responsible.” It has “system conditions that enabled this failure.”

  • Celebrate vulnerability. We have a monthly “lessons learned” session where people share their mistakes. When senior engineers and leaders participate authentically, it normalizes failure as learning.

  • Separate incident response from performance reviews. Our engineers know that being involved in an incident - even causing one - is explicitly excluded from performance evaluation. This is written policy, communicated clearly.

  • Focus on enabling vs preventing. Instead of asking “How do we prevent people from making this mistake?”, we ask “How do we make it impossible for this mistake to cause an outage?” Better guardrails, better testing, better observability.

Questions For The Community

I’m curious how others have tackled this:

  • How do you ensure blamelessness is real, not just rhetoric?
  • What signals indicate your incident culture is actually psychologically safe?
  • Have you seen organizations successfully shift from blame to learning? What enabled that change?

The gap between policy and practice is real. But I believe we can close it - it just takes intention, leadership modeling, and system-level thinking about culture, not just technology.

What’s your experience with this? Are your “blameless” postmortems actually blameless?

Luis, this resonates deeply. The leadership vulnerability piece you mentioned is absolutely critical - and honestly, it’s where I’ve seen the biggest impact.

When I was at Google as an engineering manager, I watched a VP do something that completely changed how I think about incident culture. We had a major outage that affected millions of users. During the postmortem review, this VP - who had been at Google for 15+ years - stood up and said: “This reminds me of an architectural decision I made in 2019 that caused an even bigger outage. Here’s what I learned from that mistake, and here’s how it applies to what we’re discussing today.”

The room went silent. Then the conversation completely shifted. People started sharing their own experiences, their own mistakes. The psychological safety in that room went from zero to 100.

Leaders Must Go First

Since becoming VP of Engineering at my current company, I’ve made it a practice to share my failure stories first. Not in a self-deprecating way, but authentically. Here are some specific things I do:

  • In every major postmortem review, I share a relevant mistake from my own career
  • I talk about the incident I caused as a senior engineer that taught me about database connection pooling
  • I openly discuss architectural decisions I made as Director that we had to unwind
  • I frame these as “here’s what this taught me” rather than “look at this dumb thing I did”

Removing “Who” From The Conversation

You mentioned removing “who” from templates - we’ve done this too, and I want to share a specific tactical approach that worked for us:

Our old postmortem template had sections like:

  • “Timeline of events” (which naturally included who did what)
  • “Root cause analysis”
  • “Action items and owners”

Our new template has:

  • “System behavior during incident” (describes what happened from the system’s perspective)
  • “Contributing factors and system conditions”
  • “Opportunities for improvement” (not action items - reframing)
  • “System changes” (not owners - we discuss ownership separately, not in the postmortem)

The language shift was subtle but powerful. When you describe “system behavior” instead of “timeline,” you naturally focus on what the system did, not what people did.

The Performance Review Separation

This is huge. At my current company, we have an explicit policy: “Incident involvement is excluded from performance reviews.” It’s in our employee handbook. We communicate it in onboarding. We remind people of it after major incidents.

But here’s what makes it real: when I write performance reviews for my directs, I literally do not look at incident reports. I actively avoid including anything about incidents they were involved in. Instead, I look at what they learned, what they taught others, and how they improved systems.

The first time I had to explain to a manager “No, you cannot include this incident in their review, even as a positive example of how they handled it” - that’s when people realized we meant it.

The Career Impact Problem

You hit on something really hard here: the social dynamics. Even with all these policies, people remember. I don’t have a great solution for this yet, but here’s what we’re experimenting with:

  • Rotating incident commander roles so everyone is “the person who was IC during an incident”
  • Celebrating incident response publicly (not just incident prevention)
  • Monthly “incident learnings” presentations where people share what they learned from incidents they were involved in
  • Making sure senior engineers and leadership are visible in incident response

The goal is to normalize being involved in incidents. If everyone has been incident commander, if everyone has been paged, if everyone has made a mistake that caused an issue - then no single person carries the stigma.

What I’m Still Figuring Out

What I struggle with: how do you handle genuine negligence? Like, someone knowingly skips a critical step because they’re in a hurry. Our blameless culture is real, but does it cover everything?

I lean toward: even negligence is usually a system failure (why was there pressure to hurry? why wasn’t the step enforced by automation?), but I’d love to hear others’ thoughts on this edge case.

Luis, your point about metrics is spot on. We track “system improvements per incident” too. It’s become one of our key engineering metrics, and it completely changes the conversation from “who broke it” to “what did we learn.”

This thread is fascinating and I appreciate the thoughtful perspectives. I want to add a dimension that I think complicates the blameless discussion: security incidents.

I’m a security engineer working in identity and fraud prevention. In my world, incidents aren’t just “oops, we had some downtime.” They’re “oops, user data was exposed” or “oops, we got breached.” The stakes are fundamentally different, and that changes the calculus around blamelessness.

When Blameless Meets Security

I absolutely believe in psychological safety and learning from incidents. But I also believe in accountability, especially when security policies exist explicitly to prevent certain behaviors.

Example: We have a clear policy about not storing API keys in code. It’s in our security training, our onboarding docs, our PR checklists. If an engineer commits an API key to a public repo and we get compromised - is that still “blameless”?

My honest answer: it’s complicated.

On one hand, the system should prevent this (pre-commit hooks, secret scanning, code review). If an engineer could commit a key, our guardrails failed. That’s a system problem.

On the other hand, there was a known policy and the engineer chose to bypass it (or ignored it). That’s not the same as a novel failure mode or an unclear situation.

The Line Between Blameless and Accountability-Free

Keisha, you asked about genuine negligence and I think this is at the heart of it. I’ve seen two failure modes:

  1. Blame culture disguised as accountability: Every mistake becomes “you should have known better,” which kills psychological safety
  2. Blamelessness used to avoid accountability: “It’s blameless so we can’t talk about individual responsibility at all,” which can enable repeated careless behavior

The question I’m wrestling with: Where’s the line between blameless learning and accountability for willful policy violations?

What We’re Trying

At my company, we’ve separated incident response into two tracks:

Track 1: System Learning (Always Blameless)

  • What happened?
  • What system conditions enabled it?
  • What can we improve?
  • How do we prevent recurrence?

This is always psychologically safe. Always learning-focused. Always system-oriented.

Track 2: Policy Review (Sometimes Individual)

  • Were existing policies followed?
  • If not, was the policy unclear or unrealistic?
  • Do we need to change the policy or its enforcement?
  • Was there willful circumvention that needs addressing?

Track 2 happens separately, usually involves HR or management if needed, and is about policy compliance, not the technical incident itself.

The Goal: Learn Systems, Enforce Policies

The idea is: we always learn from the incident as a system failure. But separately, we also ask if policies were followed. Most of the time, policy violations reveal that the policy was unrealistic or the tooling made it hard to comply - which is system feedback.

But occasionally, there’s genuine negligence that needs addressing outside the postmortem process.

Why This Matters for Security

In security, we can’t adopt a purely system-oriented view because adversaries exploit both systems AND human behavior. Yes, we should have better systems. But we also need engineers to be vigilant about security practices.

I worry that an overly rigid “blameless” approach might communicate that security policies are optional. That’s dangerous in a threat environment where attackers actively target engineers.

Questions I Have

  • Do others differentiate between different types of incidents (availability vs security vs data loss)?
  • How do you balance psychological safety with security accountability?
  • Has anyone seen a framework that handles this nuance well?

I want to be clear: I’m 100% in favor of blameless culture for learning. I just think security introduces complications that pure system thinking doesn’t fully address. Would love to hear if others have figured this out better than we have.

Reading this thread is kind of therapeutic honestly. I was the engineer Luis described (not literally, but basically). I caused a production incident last year and even though we supposedly had a “blameless” culture, the experience nearly made me quit tech.

What Actually Happened

I was working on a feature that required updating our database schema. I tested it in staging, got it code reviewed, followed the deployment checklist. But I missed something subtle: the migration script had a performance issue that only showed up at production scale. When it ran, it locked a critical table for 8 minutes during peak traffic. We lost about K in revenue.

The postmortem was officially “blameless.” No one directly blamed me. But here’s what happened:

  • My manager asked me to write the postmortem (which felt like being asked to write your own confession)
  • During the meeting, people kept asking me questions about “why didn’t you catch this?”
  • The action items were all assigned to me personally
  • For weeks after, when I’d suggest ideas, someone would joke “make sure it doesn’t take down production this time”

Nobody meant to be cruel. But the social cost was real.

The Stigma is Silent

Luis mentioned this: “even in blameless cultures, being associated with a major incident impacts your reputation.” This is so true and so hard to talk about.

I started noticing:

  • My code reviews got more scrutiny than others’
  • I wasn’t included in a high-visibility project I would have been perfect for
  • In team meetings, my technical opinions carried less weight
  • I felt like I had to “prove” myself all over again

The worst part? I couldn’t talk about any of this because officially, we were blameless. Complaining would sound like I was rejecting the culture or being overly sensitive.

What Would Have Actually Helped

Looking back, here’s what would have made a difference:

  1. Someone else writing the postmortem: When you write your own postmortem, you’re forced to analyze your mistakes in front of everyone. Have a neutral party write it, with input from the person involved but not making them the author.

  2. Explicit “incident recovery” process: Like, your manager saying “Hey, I know incidents can affect how people see you. Let’s intentionally get you leading some wins to reset that dynamic.” Acknowledge the social reality instead of pretending it doesn’t exist.

  3. Rotating incident exposure: Make sure everyone on the team has been incident commander, has been paged, has led production issues. When it’s everyone’s experience, it’s nobody’s stigma.

  4. Leadership sharing their L’s first: Keisha mentioned this and YES. When my director later shared a story about an incident he caused as a senior engineer, it helped. But it came months later. Do it proactively, not reactively.

The System Perspective

Here’s what I’ve learned though: that incident was absolutely a system failure, not a personal one.

Why could I deploy a migration script without testing it at scale? Why didn’t we have staging environments that matched production load? Why wasn’t there automatic rollback for migrations? Why did a single table lock take down the whole app?

If I could do it, anyone could do it. I just happened to be the person who found the gap in our system.

We’ve since fixed most of these issues:

  • Migration scripts now run in a dry-run mode first
  • We have load testing automation
  • Database deployment process includes rollback plans
  • Better connection pooling so one slow query doesn’t cascade

Those improvements helped the team way more than any individual accountability would have.

The Culture Gap

But the gap Luis described is real. We SAY blameless but we FEEL blame. Until we address the social and career dynamics - not just the policy and process - blamelessness is only half real.

I ended up changing teams (not companies, but different product area). Fresh start, new people who didn’t know my history. That shouldn’t have been necessary, but it was the only way to escape the invisible stigma.

Anyway, this turned into more of a vent than I intended. But this thread is making me realize how much that experience affected me. Thanks for starting this conversation, Luis.

Alex, thank you for sharing that. Your experience really highlights how much the structure and design of our incident processes matter - not just the stated policy.

I’m a design systems lead, so I think about everything through the lens of systems, templates, and user experience. Reading this thread made me realize: our postmortem templates are UX artifacts that shape behavior. And most of them have terrible UX for blamelessness.

The Template Tells The Truth

Luis, you mentioned your template doesn’t have a “person responsible” field. When we redesigned ours, I went even further. Here’s what I changed:

Old template had:

  • “Timeline” (inevitably lists who did what when)
  • “Root cause” (points to a singular failure)
  • “Person who discovered the issue”
  • “Person who resolved the issue”
  • “Action items with owners”

Every single one of those fields is person-oriented. Even when you’re trying to be blameless, the template structure pushes you toward naming individuals.

New template has:

  • “What surprised us?” (shifts from blame to learning)
  • “What did we learn about our system?” (explicitly system-focused)
  • “What made this possible?” (plural, systemic)
  • “How might we reduce the likelihood of similar surprises?” (future-oriented, possibility-focused)
  • “System changes to implement” (no owners listed in the postmortem itself)

The language is intentionally:

  • Question-based (invites exploration, not judgment)
  • Plural (“we” and “us” and “our”)
  • Curiosity-focused (“surprised,” “learned,” “what made this possible”)
  • Forward-looking (“how might we”)

Visual Design Matters

This might sound weird, but: the visual layout of your postmortem template affects blame dynamics.

Our old template was a linear document with sections flowing down the page. It created a narrative arc that naturally led from “what happened” → “who did it” → “what should they have done differently.”

Our new template uses a different visual structure:

  • Central question at the top: “What did our system teach us?”
  • Three parallel columns: “System behavior” | “What we learned” | “What we’ll change”
  • No hierarchy suggesting causation or responsibility
  • Visual emphasis on insights and improvements, not timeline

It sounds like a small thing, but people literally interact with postmortems differently when the visual structure emphasizes learning over causation.

The Role of Language Patterns

You know how in design we talk about “microcopy” - the tiny words that shape user behavior? Postmortem templates are full of microcopy that signals blame or learning.

Instead of “What caused the incident?” → “What conditions enabled this behavior?”
Instead of “How do we prevent this?” → “How do we make our system more resilient to this?”
Instead of “Root cause” → “Contributing factors”
Instead of “Owner” → “Area of focus”

Each shift is subtle. Together, they create a completely different psychological frame.

My Startup Failure Taught Me This

I ran a failed B2B SaaS startup for three years. When we did postmortems on feature launches or customer churn, I made the mistake of using blame-oriented language even when I didn’t mean to.

“Why didn’t we catch this in user testing?” (implies someone should have)
“Who was responsible for this launch?” (attaches identity to failure)
“What went wrong?” (negative framing)

After the company died (which I learned a TON from btw), I reflected on how the language I used as founder shaped the team’s psychology. We weren’t psychologically safe because every retrospective reinforced individual responsibility rather than collective learning.

Practical Design Tips

If you’re redesigning postmortem templates:

  1. Remove all “who” fields - Luis and Keisha mentioned this, it’s critical
  2. Use questions instead of fill-in-the-blank - questions invite exploration
  3. Lead with learning - make “what we learned” the first section, not the timeline
  4. Emphasize plurality - “we,” “our system,” “the team” rather than singular
  5. Future-orient action items - “how we’ll improve” not “what we should have done”

Templates Are Culture Artifacts

Your postmortem template isn’t just documentation - it’s a cultural artifact that communicates values. If your template is blame-structured, your culture will be blame-oriented, no matter what your policy document says.

Design your templates as carefully as you design your products. They shape how people think, feel, and behave during the most psychologically charged moments of engineering work.

Alex, I hope wherever you are now has better templates and better culture. Your insights here are valuable and you shouldn’t have had to change teams to escape stigma.