Every Service Observable By Default: Platform Teams Now Own the Entire Monitoring Stack?

At my last executive review, the board asked: “How do you know if a service is healthy?” My answer used to be: “Developers instrument their code and set up dashboards.” Now it’s: “Platform provides observability by default.”

That shift represents a fundamental change in who owns the monitoring stack—and it’s raising new questions about scope, cost, and customization.

The Old Model: Developer-Owned Observability

Five years ago, observability was the developer’s job:

  • Add logging statements to code
  • Choose a metrics library (Prometheus client, StatsD, whatever)
  • Set up Grafana dashboards manually
  • Configure alerting rules in PagerDuty
  • Hope you got it right before production issues hit

Problems with this approach:

  1. Inconsistency: Every team used different tools and formats
  2. Blind spots: Services shipped without monitoring because “we’ll add it later”
  3. Debugging nightmares: Correlating logs/metrics/traces across 47 different formats
  4. Incident response chaos: No standard dashboards, every outage started with “where are the logs?”

The New Model: Platform-Provided Observability

Now platform teams are moving to: “Every service is observable by default, with zero developer effort.”

What this looks like in practice:

Automatic instrumentation:

  • Platform injects OpenTelemetry agents into containers automatically
  • Logs, metrics, traces collected without code changes
  • Standardized format across all services

Standardized dashboards:

  • Every service gets: Golden Signals dashboard (latency, traffic, errors, saturation)
  • Auto-generated SLO tracking
  • Dependency maps showing service relationships

Integrated alerting:

  • Default alerts for critical metrics (error rate spike, latency P99 regression)
  • Routes to team-specific Slack channels
  • Integration with incident management (PagerDuty, Opsgenie)

Unified view:

  • One place to see logs, metrics, traces for any service
  • Correlation across observability signals
  • Faster incident response because tooling is consistent

The Benefits Are Obvious

Developers love it:

  • Ship code without thinking about instrumentation
  • Observability “just works”
  • Faster debugging because tooling is familiar

Ops teams love it:

  • No more “where are your metrics?” during incidents
  • Consistent telemetry across the organization
  • Easier on-call rotations because dashboards are standardized

Leadership loves it:

  • Better visibility into system health
  • Faster incident resolution
  • Compliance and audit trail automatically maintained

The Challenges Nobody Talks About

But here’s where it gets complicated:

Challenge 1: Platform Team as Bottleneck

Scenario: Product team needs custom metric for business logic (“number of premium subscriptions active”).

Old model: Developer adds counter, builds dashboard, done.

New model: Request goes to platform team → evaluated for standardization → maybe added to platform capabilities → 2-week turnaround.

Developer frustration: “I could’ve built this in 30 minutes, now I’m blocked waiting for platform approval.”

How much customization should platform expose vs standardize?

Challenge 2: Cost Explosion

Observability tools are EXPENSIVE at scale:

  • Datadog: 5-31 per host per month, more for APM and logs
  • New Relic: Similar pricing, costs climb with data volume
  • Elastic: Self-hosted reduces fees but increases operational burden

When you instrument EVERYTHING by default, data volume skyrockets.

Our observability bill doubled in 6 months after rolling out platform-wide automatic instrumentation. Finance is now asking: “Do we really need DEBUG logs from every service?”

Platform team caught between: Comprehensive observability vs budget constraints.

Challenge 3: The Build vs Buy Decision

Option 1: Build observability platform in-house

Pros:

  • Full customization
  • No per-host licensing fees
  • Data stays in-house (compliance, security)

Cons:

  • Massive engineering effort (Prometheus, Grafana, Loki, Tempo, alerting, etc.)
  • Operational burden maintaining the stack
  • Scaling challenges as data volume grows

Option 2: Buy managed observability (Datadog, New Relic, Honeycomb)

Pros:

  • Turnkey solution
  • Vendor handles scaling, reliability
  • Rich feature set out of the box

Cons:

  • Expensive at scale
  • Vendor lock-in
  • Data leaves your infrastructure

We started with build, moved to buy because the operational burden of running our own observability stack became higher than paying for Datadog. But that 00K/year bill makes CFO unhappy.

Challenge 4: Standardization vs Flexibility

Platform-provided observability requires standardization. But not all services fit the mold:

  • Batch jobs: Don’t fit request/response golden signals model
  • Data pipelines: Need data quality metrics, not HTTP latency
  • ML inference services: Custom metrics for model performance

Do we:

  1. Force everything into standardized golden signals?
  2. Build escape hatches for custom observability?
  3. Support multiple observability patterns within the platform?

Each approach has trade-offs between consistency and flexibility.

What I’m Wrestling With

For platform teams providing observability:

  1. Customization boundaries: Where do you draw the line between standardized platform observability and team-specific custom metrics/dashboards?

  2. Cost management: How do you balance comprehensive observability with budget reality? Sampling? Data retention policies? Tiered offerings?

  3. Build vs buy: At what scale does in-house observability become more cost-effective than managed solutions? Or is vendor-managed always worth it?

  4. Operational ownership: Does platform team own the observability stack’s uptime and performance? What happens when Datadog has an outage and you can’t see your systems?

For engineering leaders: Is platform-owned observability the right model? Or should teams retain observability as their responsibility, with platform providing standards/tooling?

These aren’t rhetorical questions—I’m genuinely trying to figure out the right approach as our scale grows.


Related: Platform Engineering Predictions 2026 | DevOps Trends 2026

Michelle, this is a textbook case for applying product thinking to internal platforms. Observability isn’t infrastructure—it’s a product feature. And like any product, you need to measure adoption and value.

Treat Observability as a Product, Not Infrastructure

Your bottleneck problem (“2-week turnaround for custom metrics”) screams product-market fit issue.

Wrong framing: “How do we standardize all observability needs?”
Right framing: “What observability capabilities deliver the most value to the most teams?”

The 80/20 Rule for Platform Capabilities

Platform should provide the 80% use case out of the box:

Tier 1 - Automatic (covers 80% of needs):

  • Golden signals (latency, traffic, errors, saturation)
  • Standard SLO dashboards
  • Default alerting on critical metrics
  • Log aggregation and search

Zero customization, zero developer effort, just works.

Tier 2 - Self-service customization (covers 15% of needs):

  • Custom metrics using standard instrumentation libraries
  • Dashboard templates developers can clone and modify
  • Alert rule builder with sensible defaults

Light customization, no platform team dependency.

Tier 3 - Bespoke (covers 5% of needs):

  • Unique observability patterns for ML, batch jobs, data pipelines
  • Requires platform team consultation
  • Evaluated case-by-case for potential standardization

Heavy customization, platform team involved.

Your Custom Metric Example

“Number of premium subscriptions active” is business metric, not infrastructure metric.

Platform’s job: Provide mechanism for developers to expose custom business metrics (e.g., OpenTelemetry SDK, standard metric endpoint)

Developer’s job: Instrument the business logic

Platform doesn’t need to approve every custom metric. Just provide the plumbing.

Cost Control Through Product Tiers

Your 00K Datadog bill doubled because you’re treating all data equally. Product thinking suggests: tiered observability based on value.

Production services: Full observability (all logs, metrics, traces, long retention)
Staging services: Reduced observability (sampled traces, shorter retention)
Development services: Minimal observability (errors only, 7-day retention)

Developers can opt into higher tiers for specific services by justifying the cost. Make cost visible in the developer experience.

The Real Metric: Developer Satisfaction

Here’s what I’d measure:

  1. Time to first meaningful dashboard for a new service (should be < 5 minutes)
  2. Adoption rate of platform-provided observability vs custom solutions
  3. Mean time to detect incidents before/after platform observability
  4. Developer satisfaction score for observability tools

If developers are satisfied and adopting platform observability, you’re succeeding. If they’re building shadow observability systems because platform doesn’t meet needs, you’re failing.

The build vs buy question should be driven by: Does this capability differentiate our platform experience? Or is it commodity we should outsource?

Datadog/New Relic/Honeycomb are commodities. Your INTEGRATION of observability into developer workflow is the differentiator.

Michelle, we went through exactly this journey. Started with build, moved to buy, and learned expensive lessons about operational burden vs licensing costs.

Our Build vs Buy Journey

2022: Built in-house observability stack

Stack:

  • Prometheus for metrics
  • Grafana for visualization
  • Loki for logs
  • Tempo for traces
  • Alertmanager for alerting

Cost: ~0K/year in infrastructure, 2 FTE maintaining it

Why we thought it made sense: “We’re saving hundreds of thousands on licensing!”

2024: Reality hit

  • Prometheus retention challenges at scale → needed Thanos, more complexity
  • Grafana performance issues with 500+ dashboards
  • Loki ingestion bottlenecks during log spikes
  • Tempo storage costs exploding with trace volume
  • 2 FTE became 4 FTE just to keep the stack running

True cost: 50K infrastructure + 00K fully-loaded engineer salaries = 50K/year

And still missing features Datadog had out of the box (anomaly detection, APM, synthetic monitoring, etc.)

2025: Migrated to Datadog

Initial sticker shock: 00K/year in licensing
Reality: Freed up 4 engineers to work on platform capabilities
Actual savings: 00K licensing vs 50K DIY = 50K saved

Plus intangibles:

  • Incident resolution faster (better tooling)
  • Developer satisfaction higher (familiar tool)
  • Compliance easier (vendor handles SOC2 for observability stack)

The Break-Even Point

Here’s my rough math on when to build vs buy:

Build makes sense when:

  • Unique observability requirements vendor tools don’t support
  • Data compliance requires on-prem or specific regions
  • Scale is SO massive that per-host pricing becomes prohibitive (e.g., you’re running 10,000+ hosts)

Buy makes sense when:

  • You’re under 1,000 hosts
  • Engineering time is more valuable than licensing fees
  • Vendor features (APM, anomaly detection, ML insights) would take years to build

We were at 300 hosts. Building was hubris, not economics.

Your Cost Doubling Problem

00K → 00K after instrumenting everything suggests: You’re sending too much data.

Cost optimization strategies we used:

  1. Sampling: 100% of errors, 10% of successful requests for traces
  2. Retention tiers: 7 days full resolution, 30 days downsampled, 1 year aggregates
  3. Log level discipline: WARN/ERROR to centralized logging, DEBUG stays local
  4. Metric cardinality limits: Block high-cardinality tags that explode costs

Reduced our bill 40% without meaningful observability loss.

The Ownership Question

You asked: Does platform team own observability stack uptime?

Yes, if you built it. If Prometheus is down, you’re on the hook. That’s operational burden.

Shared, if you bought it. If Datadog has outage, you escalate to vendor but also maintain fallback dashboards. Less operational burden, more vendor relationship management.

We sleep better knowing Datadog has SREs ensuring their uptime. Our platform team focuses on integration, not operating yet another distributed system.

The UX perspective on observability dashboards is something I wish more platform teams thought about. Michelle, your standardization vs flexibility tension is a DESIGN problem, not just a technical one.

Observability Dashboards Need Design Thinking

I see platform teams build observability dashboards like this:

  • 47 metrics crammed on one screen
  • No hierarchy, no visual priority
  • Technical jargon everywhere (“P99 latency,” “request throughput,” “saturation”)
  • Developers need to be SRE experts to interpret

This is like giving every developer a circuit diagram when they just want to know “is my service healthy?”

The Design Systems Parallel (Again)

In design systems, we learned: Progressive disclosure of complexity.

Simple dashboard for everyone:

  • Traffic light indicators (:green_circle: healthy, :yellow_circle: degraded, :red_circle: critical)
  • One-sentence status (“Service is healthy, 0 errors in past hour”)
  • Click for details if needed

Detailed dashboard for on-call:

  • All the metrics, properly organized
  • Clear visual hierarchy (most important metrics prominent)
  • Contextual help text explaining what metrics mean

Expert dashboard for platform team:

  • Raw Prometheus queries
  • Full customization
  • Assume deep technical knowledge

One service, three dashboard tiers. Most developers never need tier 3.

Your Customization Bottleneck

Developer wants custom metric for “premium subscriptions active.”

Current process: Submit request → platform team evaluates → 2 weeks
Better process: Self-service metric instrumentation with design patterns

Provide:

  • OpenTelemetry SDK as standard library
  • Dashboard templates developers can clone
  • Design patterns for common metric types (counters, gauges, histograms)

Platform team curates patterns, developers customize within patterns.

Like design tokens: You don’t need designer approval to use brand colors, because colors are standardized. Similarly, developers shouldn’t need platform approval for standard metric patterns.

The Aesthetics of Observability

This might sound superficial, but: Good-looking dashboards get used. Ugly dashboards get ignored.

Our incident postmortems revealed: Dashboards existed for services, but on-call engineers weren’t using them because they were overwhelming and ugly.

We redesigned with:

  • Clean visual hierarchy
  • Color coding that means something (not random colors)
  • Annotations explaining what normal looks like
  • Mobile-friendly views (because incidents happen at 2am)

Dashboard usage went from 40% to 85% of on-call incidents. Same data, better design.

Standardization That Doesn’t Feel Restrictive

Michelle asked: How much customization vs standardization?

Design systems answer: Standardize the system, customize within the system.

Standardize:

  • Metric collection mechanism (OpenTelemetry)
  • Data format and storage
  • Dashboard design patterns and visual language
  • Alerting integration

Customize:

  • Which metrics to track
  • Dashboard layout and widgets
  • Alert thresholds
  • SLO definitions

Feels flexible to developers, maintains consistency for platform team.

The key: Make customization EASY within the standardized system, so developers don’t feel constrained.

Michelle, your challenge resonates deeply. We went from 3 different observability tools to 1 unified platform standard, and cut costs 40% while IMPROVING developer experience. Let me share the scaling lessons.

The Multi-Tool Chaos We Had

Before platform standardization:

  • Team A used Datadog (because they came from a startup that used it)
  • Team B used New Relic (because that’s what the acquired company used)
  • Team C used ELK stack (because “we built it ourselves”)

Results:

  • 3 different vendor bills totaling 00K/year
  • Incident response nightmare (which tool has logs for this service?)
  • Zero ability to correlate across teams
  • Platform team supporting 3 different integrations

The Consolidation Decision

We picked Datadog as single observability platform. Controversial at the time.

Team C (ELK users) complained: “We already built this! Why pay for Datadog?”

My answer: “You built log aggregation. Datadog provides APM, distributed tracing, synthetic monitoring, anomaly detection, and 500+ integrations. And we won’t maintain it.”

The Migration Strategy

Phase 1: Platform provides both

  • Datadog integration standardized in deployment pipeline
  • Legacy tools still available for existing services
  • No forced migration, just make Datadog easier for new services

Phase 2: Incentivize migration

  • New platform features only work with Datadog
  • Cost showback: Teams using legacy tools see their observability costs
  • Platform team prioritizes Datadog support over legacy tools

Phase 3: Deprecation

  • 6-month sunset timeline for legacy tools
  • Platform team assists migration for stragglers
  • Legacy tools turned off

Result: 90% adoption in 8 months, 100% in 12 months.

Cost Optimization at Scale

Your cost doubled because you instrumented everything at full fidelity. We learned: Observability should be tiered based on service criticality.

Tier 1 - Critical services (customer-facing, revenue-impacting):

  • Full APM, distributed tracing
  • 100% trace sampling
  • 30-day retention
  • Synthetic monitoring

Tier 2 - Internal services (important but not customer-facing):

  • Standard metrics and logs
  • 10% trace sampling
  • 14-day retention

Tier 3 - Development/staging:

  • Logs and basic metrics
  • 1% trace sampling
  • 7-day retention

Automatic tier assignment based on deployment target (production, staging, dev).

Cost reduction: 40%. Critical services still fully observable, non-critical services have appropriate telemetry.

The Ownership Question

Platform team owns:

  • Observability PLATFORM (Datadog integration, instrumentation standards)
  • Default dashboards and alerts
  • Cost optimization and compliance

Development teams own:

  • Custom dashboards for their services
  • Alert tuning for their SLOs
  • Responding to incidents their services trigger

Clear ownership boundaries prevent confusion during incidents.

Build vs Buy: The Pragmatic Answer

Build when:

  • Observability requirements are genuinely unique to your domain
  • You have 10+ engineers to dedicate to maintaining observability stack
  • Data sovereignty requires on-prem

Buy when:

  • You’re under 10,000 hosts
  • Engineering time is better spent on platform differentiation
  • Vendor provides features you’d never build in-house

We bought. Our platform team’s job is integrating best-of-breed tools into seamless developer experience, not reimplementing Datadog from scratch.

The 00K we spend on Datadog is cheaper than the .2M in engineering salaries we’d need to build equivalent capabilities.

Your cost concern is valid, but compare it to alternative: Building observability in-house requires dedicated team, ongoing maintenance, and you’ll still have gaps compared to vendors who do nothing but observability.

Platform team value is integration and developer experience, not reinventing monitoring wheels.