The Best Product Insights Come From Production Incidents - Here's Why We Should Embrace Them

product_david · February 23, 2026, 8:16pm

Controversial take: Production incidents are some of the most valuable learning opportunities for product teams - not just engineering teams.

I know that sounds counterintuitive. Incidents are expensive, stressful, and harm customers. We should definitely try to prevent them.

But when they DO happen? They’re gold mines of product insight that we often ignore.

Context: VP of Product Perspective

I’m VP of Product at a Series B fintech startup. I’ve been in product for 12+ years at Google, Airbnb, and now here. And one pattern I’ve noticed across every company:

Engineering treats incidents as engineering problems. Product barely pays attention. We’re missing massive learning opportunities.

The Incident That Changed My Perspective

Six months ago, we had a payment processing incident. Our payment service was down for 90 minutes during a weekday afternoon. Classic infrastructure problem, right?

Engineering wrote a great postmortem: database connection pool exhaustion, root cause identified, preventive measures implemented. Case closed.

But I dug deeper and found something fascinating:

The incident revealed that customers were using our product in ways we never intended.

What caused the connection pool exhaustion? A spike in a specific API endpoint that we thought was rarely used. Turns out, one of our enterprise customers had built an integration that was hitting that endpoint hundreds of times per minute.

We had no idea customers needed this level of API usage. It never came up in user research. But the incident revealed it.

That integration became a product feature. We built proper support for high-frequency API access, turned it into a paid tier, and it’s now generating $200K+ ARR.

What Incidents Reveal About Product

Since that realization, I’ve started reviewing every major incident through a product lens. Here’s what incidents teach us:

1. Real Usage Patterns (Not Assumed Ones)

User research tells us what customers THINK they do. Incidents reveal what customers ACTUALLY do.

Examples from our incidents:

Incident showed customers exporting 50K+ row reports (we designed for 1K)
Incident revealed customers using our app at 2am (we assumed business hours only)
Incident exposed that customers were screen-scraping our web app because we lacked a proper API

Each of these became product opportunities.

2. Which Features Actually Matter

When a service goes down, you see what customers complain about. And often, it’s surprising.

We had an incident that took down our analytics dashboard (internal tool). I expected minimal customer impact. Wrong. Support got flooded with complaints.

Turns out, customers used this “internal” dashboard daily for reporting to their executives. It wasn’t a nice-to-have, it was critical to their workflow.

We completely reprioritized analytics work based on this learning.

3. System Architecture Constraints on Product

Incidents reveal where our technical architecture limits product possibilities.

Example: We had repeated incidents related to a monolithic service that powered multiple features. During incidents, unrelated features would fail together.

This taught product: we need to architect for feature independence. It influenced our multi-year platform strategy.

4. Gaps in Product Documentation

Many incidents are caused by customers misunderstanding how to use the product.

When a customer misconfigures something and causes an incident, that’s not their fault - it’s our documentation and UX failing them.

We’ve improved error prevention, clearer docs, and better onboarding based on incident patterns.

5. Integration and Partner Dependencies

Incidents expose which third-party integrations customers actually rely on (vs which ones are optional).

We learned that our Salesforce integration was far more critical than our HubSpot integration based on customer impact during respective outages.

This informed partnership prioritization and reliability investment.

Cross-Functional Incident Reviews

After this realization, I changed how we handle major incidents:

Engineering postmortem (technical):

What broke, why, how to prevent recurrence
System improvements, monitoring, architecture

Product review (business):

What does this tell us about customer behavior?
What product gaps did this reveal?
What opportunities emerged?
Should this change our roadmap?

Both are valuable. Both should happen.

Example: API Rate Limiting Incident

Engineering perspective:

We hit rate limits on a third-party API
Root cause: Didn’t anticipate usage growth
Fix: Implement caching, negotiate higher limits
Prevent: Better usage monitoring

Product perspective:

Customers are using this feature 10x more than we predicted
This feature is more valuable than we thought
We should build more capabilities around it
Opportunity: Premium tier with higher limits

Same incident. Different, complementary learnings.

Making Incidents a Cross-Functional Learning Moment

Here’s what we changed:

1. Product attends major incident postmortems (P0/P1)

Not as observers, but as active participants
We ask: “What does this tell us about customers?”

2. Monthly incident themes review

Engineering + Product + Design + Customer Success
Look for patterns across incidents
Ask: What should we build/change/improve?

3. Incident insights feed into roadmap planning

Quarterly, we review incident-driven insights
Some of our best features came from incident learnings

4. Customer success joins postmortems

They have context on customer reactions
They understand business impact beyond metrics

Real Examples of Product Changes from Incidents

Let me share concrete examples:

Incident: Database slowdown during data exports
→ Product learned: Customers export way more data than we designed for
→ Action: Built async export system, made it a product feature
→ Result: Enterprise customers love it, became a differentiator

Incident: Auth service timeout during login spike
→ Product learned: Customers have “all hands” meetings where everyone logs in simultaneously
→ Action: Improved session management, added “remember me” feature
→ Result: Better user experience, fewer support tickets

Incident: Search service overload
→ Product learned: Customers use search as their primary navigation (not the menu we designed)
→ Action: Redesigned navigation around search-first paradigm
→ Result: User engagement up 15%

Incident: API quota exceeded
→ Product learned: Customers want programmatic access way more than we thought
→ Action: Built proper API product with tiered pricing
→ Result: New revenue stream, $500K+ ARR

The Mental Shift

The key shift is reframing incidents:

From: “Something broke, let’s fix it”
To: “We learned something unexpected about our customers and systems”

Every incident is expensive user research. You’re paying for it (in downtime, customer impact, engineering time). You might as well extract the product insights.

The Surprising Cultural Benefit

One thing I didn’t expect: when product participates in incident response, engineering feels less blamed.

Why? Because it shifts the conversation from “engineering broke something” to “we all learned something about our product and customers.”

The shared ownership of learning reduces the implicit blame. Product saying “this incident helped us understand our customers better” reframes it from failure to insight.

Questions for Product Leaders

Do you review incident postmortems?
Do you know what the last 5 incidents revealed about customer behavior?
Are incident insights feeding into your roadmap?
Is customer success involved in incident reviews?

If not, you’re leaving product intelligence on the table.

Questions for Engineering Leaders

Do you invite product to incident postmortems?
Do you frame incidents as learning opportunities beyond technical fixes?
Do you track business/product insights from incidents?

Cross-functional incident learning benefits everyone.

The Competitive Advantage

Here’s the strategic angle: our competitors have incidents too. But they treat them as purely engineering problems.

We treat them as organizational learning opportunities. We extract product insights, customer behavior patterns, and business intelligence.

Over time, this compounds. We understand our customers better. We build better products. We make smarter roadmap decisions.

Incidents aren’t just problems to solve. They’re data to learn from.

Call to Action

If you’re a product leader: attend your next major incident postmortem. Don’t just skim the doc later - actually participate in the discussion.

Ask:

What does this tell us about our customers?
What assumptions did this challenge?
What opportunities does this reveal?
Should this change our roadmap?

I guarantee you’ll learn something valuable.

And if you’re an engineering leader: invite product to your next postmortem. Frame it as “help us understand the business context and customer impact.”

The best product companies learn from every signal. Incidents are loud, expensive signals. Let’s not waste them.

What do you think? Do other product teams do this? Am I missing something?

vp_eng_keisha · February 23, 2026, 8:17pm

David, this is EXACTLY the kind of cross-functional thinking I want to see more of. And as VP of Engineering, I can tell you: having product engaged in incident response makes my job easier, not harder.

Why Engineering Welcomes Product Involvement

When I first became VP, I was worried about involving non-engineering teams in incident postmortems. Would it feel like oversight? Would engineers get defensive?

Turns out, the opposite happened. Engineers appreciate when product cares about incidents beyond “when will it be fixed?”

The Alignment Benefit

Here’s what changed when we started including product in major incident reviews:

1. Shared Understanding of Trade-offs

Example: We had an incident caused by a performance optimization we’d deprioritized for months. When it finally bit us, product was in the postmortem.

Product: “Why didn’t we do this optimization earlier?”
Engineering: “We raised it 3 times. It kept getting deprioritized for feature work.”

That conversation shifted how we balance feature velocity vs technical health. Product now understands why some infrastructure work can’t wait.

2. Better Prioritization of Reliability Work

When product sees incidents through an engineering lens (“this is the 4th incident from this system weakness”), they understand why refactoring that system is urgent.

We used to fight about infrastructure work. Now we collaborate on prioritizing it.

3. Incident-Driven Roadmap Insights

Your examples about API usage and export functionality - yes! We’ve had similar experiences.

One incident revealed customers were using our “admin” interface in production workflows. We thought it was just for setup. Turns out, daily operations.

That insight reshaped our product roadmap for the next year. We never would have discovered it through user research.

The Translation Challenge

One thing I’ve learned: engineering and product speak different languages about incidents.

Engineering talks about:

Root cause analysis
Technical debt
System architecture
Reliability metrics

Product talks about:

Customer impact
Use cases
Feature gaps
Business value

Both are important. The key is translating between them.

How We Structure Cross-Functional Postmortems

Our format for P0/P1 incidents:

Part 1: Technical Deep Dive (30 min)

What happened technically
Engineering audience
Product/others listen and ask clarifying questions

Part 2: Business Impact & Insights (20 min)

What was customer impact?
What did we learn about usage patterns?
Product leads this discussion
Engineering provides context

Part 3: Action Items (10 min)

Technical improvements (engineering)
Product opportunities (product)
Process improvements (both)

This structure ensures both perspectives get airtime.

The Monthly Incident Themes Review

You mentioned this and I want to emphasize it: this is the highest-leverage meeting we run.

Once a month:

Engineering + Product + Design + Customer Success
Review all P1/P2 incidents from the past month
Look for patterns and themes
Identify opportunities

Some of our best product decisions came from this meeting:

Noticed pattern of auth issues → rebuilt authentication UX
Noticed pattern of data export failures → built better export system
Noticed pattern of integration errors → created integration testing framework

The ROI of Product Participation

From an engineering perspective, having product involved:

Reduces blame culture: Incidents become learning moments, not failures
Improves prioritization: Product understands why reliability work matters
Increases action item completion: Product-backed improvements get prioritized
Builds empathy: Product sees the complexity engineering deals with
Surfaces opportunities: Product finds business value in technical insights

Real Example: Payment Incident

We had a payment processing incident similar to yours. Database timeout during high load.

Engineering postmortem:

Connection pool exhaustion
Need better database scaling
Implement connection pooling improvements

Product insights (from participating in postmortem):

High load was from a single enterprise customer doing batch processing
They were using a workflow we didn’t design for
Opportunity: Build proper batch processing API

We built the batch API. That customer is now paying 40% more for the feature. And we’ve sold it to 3 other enterprise customers.

ROI of product attending that one postmortem: $300K+ ARR.

The Challenge: Scaling This Across Teams

At 80+ engineers, we can’t have product attend every postmortem. So we’ve tiered it:

P0: Product VP (me) + relevant PM attend
P1: Relevant PM attends
P2+: Engineering only, but product gets summary

And we have that monthly themes review to catch patterns.

The Cultural Shift

David, you mentioned the cultural benefit: product participation reduces engineering blame.

This is HUGE and I want to expand on it:

When incidents are engineering-only, there’s implicit blame: “Engineering broke something, again.”

When product is in the room, the framing shifts: “We’re all learning about our product and customers together.”

Engineering feels supported, not judged. Product feels informed, not excluded. Customer success feels heard, not ignored.

It’s the same incident, but the organizational response is healthier.

Recommendation for VPs of Engineering

If you’re not including product in incident reviews, try it for 3 months:

Invite product to P0/P1 postmortems
Structure the meeting to include business impact discussion
Track product insights that emerge
Measure: Do these insights lead to roadmap changes?

I bet you’ll see value within the first month.

For VPs of Product

If engineering isn’t inviting you to postmortems, ask to attend:

Frame it as: “I want to understand customer impact and usage patterns better. Can I join major incident reviews?”

Don’t frame it as oversight or blame. Frame it as learning.

Engineers will appreciate your curiosity.

The Competitive Advantage

Your point about competitive advantage is spot on. Most companies treat incidents as purely technical problems.

Companies that extract business intelligence from incidents move faster, build better products, and understand customers more deeply.

It’s a compounding advantage. Every incident teaches you something your competitors aren’t learning.

Final Thought

The best engineering-product partnerships I’ve seen all share this trait: they treat incidents as organizational learning moments, not engineering failures.

David, thank you for writing this. I’m going to share it with my product counterpart and suggest we double down on cross-functional incident learning.

This is the kind of collaboration that makes great products.

maya_builds · February 23, 2026, 8:17pm

Love this cross-functional perspective! As someone who sits between design and product, I want to add: design should ALSO be in these conversations.

Design’s Role in Incident Learning

When I first started attending incident postmortems (at my current company), it was honestly by accident. I was in a meeting that ran long, and the postmortem started in the same room. I stayed.

Best accident ever.

What Design Learns from Incidents

Incidents reveal design failures that no amount of user testing will catch:

1. Error States We Didn’t Design

User testing shows you the happy path. Incidents show you the 47 different ways things can break.

Example from one incident I attended:

API timeout during checkout
Our design: Generic “Something went wrong” error
What actually happened: User tried 5 times (creating 5 duplicate orders)
What we learned: Our error message didn’t tell users if their order went through

We redesigned error states to be:

Specific (“Payment processing timed out”)
Actionable (“Please wait 2 minutes and check your email for confirmation before trying again”)
Informative (“If you see a charge, your order went through”)

This came directly from an incident review.

2. UX That Contributes to Incidents

Sometimes the design itself causes the incident.

Example: We had repeated incidents of customers misconfiguring a complex settings page. Each misconfiguration caused downstream errors.

Sitting in the postmortem, I realized: the settings page was confusing. Unclear labels, no validation, no helpful examples.

We redesigned it. Incidents from that source dropped to near-zero.

The UI could have prevented the incidents. But we didn’t know until we attended the postmortem.

3. What Users Actually Do vs. What We Think They Do

David’s point about real usage patterns - design needs this too.

We design based on personas and user research. But incidents reveal actual behavior in the wild.

Example: We designed a feature assuming users would do X, then Y, then Z. An incident revealed users were doing Z first (which broke our assumptions).

We redesigned the feature flow based on actual behavior, not assumed behavior.

4. System Constraints That Affect UX

Incidents teach design about technical constraints.

I used to propose features without understanding system limitations. Then I started attending postmortems and learned:

Why certain features are slow (database design)
Why some integrations are fragile (third-party reliability)
Why we can’t do real-time for everything (cost + complexity)

Now I design WITH constraints in mind, not against them.

The “Incident-Driven Design” Process

After attending a few postmortems, I formalized a process:

For every P0/P1 incident:

Attend the postmortem (or read the doc thoroughly)
Identify UX failures:
- What did users experience?
- What error messages did they see?
- What actions did they try?
- What was confusing?
Propose design improvements:
- Better error messages
- Clearer flows
- Validation to prevent misconfigurations
- Graceful degradation during failures
Prototype and test:
- Design the improvement
- Test with users
- Ship it

Monthly “Incident UX Review”

Similar to David’s monthly themes review, I do a monthly UX audit of incidents:

Review all incidents from the past month
Look for UX patterns (similar error experiences, common user confusions)
Prioritize design improvements
Add to design roadmap

This has become one of my most valuable sources of design work.

Real Examples

Let me share concrete design improvements that came from incidents:

Incident: Upload timeout for large files
→ Design learned: Users don’t know what “large” means, so they try and fail
→ Design fix: Show file size limit upfront, add progress indicator, allow resume
→ Result: Support tickets for uploads dropped 60%

Incident: Configuration error caused data loss
→ Design learned: Settings page had no guardrails
→ Design fix: Added confirmation dialogs for destructive actions, preview before save, validation
→ Result: Zero configuration-related incidents since

Incident: Search overload from one customer
→ Design learned: Customer was using search as primary navigation (not the menu)
→ Design fix: Redesigned IA to make search more prominent, added search shortcuts
→ Result: Search usage up (intentional), but per-user search volume down (more efficient)

The Cross-Functional Triangle

David talks about engineering-product collaboration. I think it’s actually a triangle:

Engineering understands what broke and why (technical)
Product understands customer impact and opportunities (business)
Design understands user experience and prevention (UX)

All three perspectives make incident learning richer.

My Failed Startup Experience

I ran a startup that failed (long story), and one thing I learned: we didn’t pay attention to user behavior during errors.

We obsessed over happy path UX. We A/B tested button colors. We did elaborate user research.

But when things broke? We had terrible error handling, confusing messages, and users just churned.

If I’d attended our incident postmortems (we barely wrote them), I would have learned what users actually struggled with.

Recommendation for Design Leaders

If you’re a design lead and you’re not attending incident reviews:

Ask to attend your next P0/P1 postmortem
Focus on the user experience during the incident
Propose design improvements
Track whether those improvements get built

You’ll find a goldmine of UX work that actually matters to users.

For Engineering and Product

If design isn’t in your incident reviews, invite them:

Ask: “What was the user experience during this incident? How could design prevent or mitigate this?”

Design will have insights you haven’t considered.

The Learning Culture

What I love about this thread: everyone agrees incidents are learning opportunities, not just problems.

Engineering learns about systems.
Product learns about customers.
Design learns about UX failures.

Everyone benefits.

The Holistic View

David, your post is spot-on. But I’d expand it: The best product insights come from cross-functional incident reviews that include engineering, product, design, and customer success.

Each discipline brings a different lens. Together, you get a holistic understanding of what the incident reveals about your product, customers, and systems.

That’s how you turn expensive failures into competitive advantages.

eng_director_luis · February 23, 2026, 8:18pm

This thread is making me rethink how we run incident reviews. David, you’re absolutely right that product insights from incidents are underutilized.

The Director’s Perspective: Making This Work at Scale

I lead 40+ engineers across multiple teams. Including product in every incident review isn’t scalable. But including them strategically is.

Our Tiered Approach to Cross-Functional Reviews

We’ve structured incident reviews to balance cross-functional learning with operational efficiency:

P0 (Critical):

Full postmortem with extended team
Attendees: Engineering leadership, relevant PM, design lead, customer success, sometimes CEO
Focus: Deep learning across all dimensions (technical, product, UX, business)
Monthly frequency: 0-2 incidents

P1 (High Impact):

Standard postmortem with key stakeholders
Attendees: Engineering team, relevant PM, customer success rep
Focus: Technical depth + business impact
Monthly frequency: 2-4 incidents

P2 (Moderate):

Engineering-focused postmortem
Product gets summary + specific questions if needed
Focus: Technical improvements
Monthly frequency: 5-8 incidents

Monthly Themes Review:

Engineering + Product + Design + CS leadership
Review patterns across ALL incidents (P0-P2)
Focus: Organizational learnings and opportunities
This is where we catch patterns that individual reviews miss

Why This Works

This approach:

Respects everyone’s time (product doesn’t attend every minor incident)
Ensures critical incidents get full cross-functional attention
Creates a regular pattern-detection mechanism
Scales as we grow

The Translation Layer

One thing I’ve added: after every P1/P0 incident, engineering creates a 1-page “Business Impact Summary” for product.

It includes:

What broke (in plain language)
Customer impact (users affected, duration, severity)
What we learned about usage patterns
Product opportunities identified
What we’re fixing

This ensures product gets insights even if they can’t attend every postmortem.

Real Example: The Quarterly Business Review

Every quarter, I present “Incident Learnings” to the exec team. It includes:

Technical metrics:

Incident frequency trends
MTTR improvements
Repeat incident reduction

Business insights:

Product opportunities discovered through incidents
Customer usage patterns revealed
Revenue protected through reliability improvements
ROI of infrastructure investments

This presentation has fundamentally changed how leadership views reliability work.

Example slide:
"Q4 Incidents Revealed:

Enterprise customers need batch processing (→ $300K new ARR)
Mobile users access the app during commutes (→ offline mode priority)
Customers export 10x more data than designed (→ async export feature)"

Incidents aren’t just costs - they’re business intelligence.

The Culture Change

Getting product involved changed our engineering culture:

Before: Incidents were failures. Engineers felt defensive.
After: Incidents are learning moments. Engineers feel curious.

When product says “this incident helped us understand our customers,” it reframes the narrative from blame to learning.

The Challenge: Busy Product Teams

I’ve heard PMs say: “I don’t have time to attend incident reviews.”

My response: “You have time for user research, right? Incidents are expensive user research you’ve already paid for.”

That usually gets their attention.

Making It Practical

To make product participation work:

Be selective: Don’t invite them to every incident
Be respectful of time: Structured meetings, clear agenda, 45 minutes max
Be clear on value: “We need your perspective on customer impact and opportunities”
Follow through: When product identifies opportunities, prioritize them
Share results: Show how incident insights led to successful features

The Monthly Themes Review Structure

This meeting is the highest ROI hour we spend. Here’s how we run it:

Agenda (60 minutes):

1. Incident Overview (10 min):

Quick summary: How many incidents, what types, which teams
Trend: Are we improving or degrading?

2. Pattern Analysis (20 min):

What patterns emerged across incidents?
Similar root causes, customer impacts, or usage patterns
Engineering presents analysis, everyone discusses

3. Product Opportunities (15 min):

What did incidents reveal about customers?
Product leads discussion
Identify 2-3 opportunities to explore

4. Process Improvements (10 min):

What process changes would prevent categories of incidents?
Cross-functional perspective

5. Action Items (5 min):

What will we do based on learnings?
Who owns what?

Metrics We Track

To demonstrate value of cross-functional incident learning:

Product features shipped from incident insights: Track how many features originated from incident learnings
Revenue from incident-driven features: Quantify business value
Process improvements implemented: Cross-functional improvements made
Repeat incident reduction: Are we learning effectively?

After one year:

7 features shipped from incident insights
$600K+ ARR attributable to incident-driven features
25 process improvements implemented
60% reduction in repeat incidents

This data convinced skeptics that cross-functional incident learning is worth the investment.

The Organizational Learning System

What David is describing is really an organizational learning system:

Capture: Document incidents thoroughly
Analyze: Look for technical and business insights
Share: Cross-functional review and discussion
Act: Implement improvements (technical and product)
Measure: Track whether learnings lead to better outcomes

Most companies only do steps 1-2. Adding steps 3-5 is where the real value is.

Recommendation for Directors

If you’re an engineering director:

Implement tiered incident reviews (not everyone needs to attend everything)
Create a monthly cross-functional themes review
Track product opportunities that emerge from incidents
Quantify the business value of incident learning
Present this to leadership regularly

Show that reliability work isn’t just cost - it’s competitive advantage.

For Product Leaders

If you’re a product director:

Attend a few major incident postmortems (see what you learn)
Ask your engineering counterpart to summarize incident insights monthly
Track which roadmap items came from incident learnings
Celebrate when incident insights lead to successful features

Make incident learning a visible part of product process.

The Strategic View

David’s post reframes incidents from “engineering problems” to “organizational learning opportunities.”

This is the mindset shift that separates good companies from great ones.

Great companies learn faster. They extract insights from every signal. They turn failures into competitive advantages.

Incidents are loud, expensive signals. Companies that learn from them comprehensively - across engineering, product, design, and business - move faster than companies that treat them as pure technical problems.

Final Thought

This thread (and the others in this discussion) represent a mature view of incident management:

Blameless culture (Luis’s original thread)
Psychological safety (Keisha’s thread)
Appropriate effort (Alex Infrastructure’s thread on postmortem tiers)
Proactive learning (Alex Dev’s thread on game days)
Cross-functional insights (David’s thread here)

Together, these practices create an organizational learning system that compounds over time.

That’s how you build a culture of excellence.

alex_infrastructure · February 23, 2026, 8:19pm

As someone who builds the infrastructure and tooling, I want to add: the technical systems we build should SUPPORT this cross-functional learning, not just capture technical data.

Building for Cross-Functional Incident Learning

Most incident tracking systems are built by engineers, for engineers. They capture:

Technical details
System metrics
Root cause
Action items

But they don’t capture:

Customer impact in business terms
Product insights
UX failures
Revenue/business implications

We need better tools.

The Incident Insights Dashboard

After reading David’s perspective, I built a dashboard that bridges technical and business data:

For each incident, we now track:

Technical (engineering):

System affected
Root cause category
Time to detect, time to resolve
Repeat vs novel incident

Business (product):

Customer impact (users affected, duration)
Revenue impact (estimated)
Support tickets generated
Customer tier affected (enterprise vs SMB)

Insights (cross-functional):

Usage patterns discovered
Product opportunities identified
UX failures noted
Process gaps revealed

The Visualization

The dashboard shows:

Incident frequency over time (standard)
Customer impact trends (which customer segments are affected most)
Revenue protected by reliability improvements
Product features shipped from incident insights
ROI of infrastructure investments

This makes the business value of reliability visible to non-engineers.

Example: Customer Segmentation View

One view shows: “Which customer segments experience the most incidents?”

We discovered enterprise customers were disproportionately affected (3x more than SMB customers). Why? They use more advanced features that have less testing.

This led to:

Dedicated enterprise testing environment
Enterprise-specific SLAs
Tiered infrastructure (enterprise customers on more robust systems)

Product and engineering made this decision together based on incident data.

Incident-to-Feature Tracking

We now tag incidents with: “Product opportunity identified: Yes/No”

If yes, we create a linked Jira ticket for product to evaluate.

We track:

How many incidents generate product opportunities
How many opportunities become features
Revenue from incident-driven features

This quantifies the value of product participation in incident reviews.

The Automation Angle

Some insights can be automated:

Pattern Detection:

Automatically flag when similar incidents occur multiple times
Surface: “This is the 4th database timeout this month”
Suggest: “Pattern detected - recommend full postmortem”

Customer Impact Scoring:

Automatically calculate business impact based on users affected, duration, customer tier
Prioritize incidents by business impact, not just technical severity

Usage Anomaly Detection:

Flag when incident traffic patterns reveal unexpected usage
Example: “This API endpoint received 100x normal traffic during the incident”
Prompt: “Is this expected usage? Should product investigate?”

The Notification System

When a P0/P1 incident occurs:

Engineering gets technical alerts (standard)
Product gets business impact summary (users affected, estimated revenue impact)
Customer success gets customer communication template
Leadership gets executive summary

Different stakeholders need different information. Our tools should provide that.

The Monthly Report

We auto-generate a monthly “Incident Intelligence” report:

For engineering:

Technical trends
Repeat incidents
System reliability scores

For product:

Customer impact patterns
Usage insights discovered
Product opportunities identified

For leadership:

Reliability trends
Business impact
ROI of reliability investments

Same data, different perspectives.

Real Example: The Export System Insight

Remember David’s example about discovering customers export way more data than expected?

Our dashboard surfaced this automatically:

Incident: Database slowdown during data exports
Automated insight: “Export API traffic 10x higher than baseline”
Product opportunity flag: Created Jira ticket for PM
Result: Built async export system, now a paid feature

The tooling helped surface the insight without anyone manually connecting the dots.

Integration with Business Tools

Our incident system integrates with:

Jira: Automatically create product opportunity tickets
Looker: Pull revenue impact data for business analysis
Salesforce: Link incidents to customer accounts
Zendesk: Track support tickets related to incidents

This creates a unified view across technical and business systems.

The API for Cross-Functional Access

We built an API so non-engineers can query incident data:

Product can ask: “What incidents affected feature X in the last quarter?”
Customer success can ask: “What incidents affected customer Y?”
Finance can ask: “What was total revenue impact of incidents this quarter?”

Democratizing access to incident data enables cross-functional learning.

Cost Tracking

We track the cost of incidents in business terms:

Engineering time spent (hours × average hourly rate)
Revenue lost (estimated based on downtime and affected users)
Support cost (tickets generated × cost per ticket)
Customer churn risk (quantified based on historical data)

Then we track the value of improvements:

Revenue protected by reliability improvements
Revenue generated by incident-driven features
Time saved by process improvements

This shows the ROI of reliability work in language leadership understands.

The Feedback Loop

When product ships a feature based on an incident insight:

We update the original incident record: “Led to feature X”
We track feature success: Revenue, adoption, customer satisfaction
We calculate ROI: Cost of incident → Revenue from feature
We share success stories: “This incident led to $300K feature”

This creates a positive feedback loop: incidents → insights → features → revenue → more investment in reliability.

Recommendation for Infrastructure Engineers

If you’re building incident management systems:

Design for cross-functional use: Not just engineering data
Capture business impact: Users, revenue, customer segments
Track insights and opportunities: What did we learn beyond technical?
Automate pattern detection: Surface insights automatically
Integrate with business tools: Connect technical and business data
Visualize for different audiences: Engineering, product, leadership
Quantify ROI: Show business value of reliability work

The Meta Point

David’s post is about changing how we THINK about incidents. I’m adding: we should also change how we TOOL for incidents.

The systems we build shape how organizations learn. If our tools only capture technical data, organizations will only learn technical lessons.

If our tools capture business impact, usage patterns, product opportunities, and cross-functional insights - organizations will learn holistically.

Tools enable culture. Build tools that support cross-functional incident learning, and you’ll get better organizational learning.

Great discussion, everyone. This thread has given me ideas for dashboard improvements I’m going to build.