At my last executive review, the board asked: “How do you know if a service is healthy?” My answer used to be: “Developers instrument their code and set up dashboards.” Now it’s: “Platform provides observability by default.”
That shift represents a fundamental change in who owns the monitoring stack—and it’s raising new questions about scope, cost, and customization.
The Old Model: Developer-Owned Observability
Five years ago, observability was the developer’s job:
- Add logging statements to code
- Choose a metrics library (Prometheus client, StatsD, whatever)
- Set up Grafana dashboards manually
- Configure alerting rules in PagerDuty
- Hope you got it right before production issues hit
Problems with this approach:
- Inconsistency: Every team used different tools and formats
- Blind spots: Services shipped without monitoring because “we’ll add it later”
- Debugging nightmares: Correlating logs/metrics/traces across 47 different formats
- Incident response chaos: No standard dashboards, every outage started with “where are the logs?”
The New Model: Platform-Provided Observability
Now platform teams are moving to: “Every service is observable by default, with zero developer effort.”
What this looks like in practice:
Automatic instrumentation:
- Platform injects OpenTelemetry agents into containers automatically
- Logs, metrics, traces collected without code changes
- Standardized format across all services
Standardized dashboards:
- Every service gets: Golden Signals dashboard (latency, traffic, errors, saturation)
- Auto-generated SLO tracking
- Dependency maps showing service relationships
Integrated alerting:
- Default alerts for critical metrics (error rate spike, latency P99 regression)
- Routes to team-specific Slack channels
- Integration with incident management (PagerDuty, Opsgenie)
Unified view:
- One place to see logs, metrics, traces for any service
- Correlation across observability signals
- Faster incident response because tooling is consistent
The Benefits Are Obvious
Developers love it:
- Ship code without thinking about instrumentation
- Observability “just works”
- Faster debugging because tooling is familiar
Ops teams love it:
- No more “where are your metrics?” during incidents
- Consistent telemetry across the organization
- Easier on-call rotations because dashboards are standardized
Leadership loves it:
- Better visibility into system health
- Faster incident resolution
- Compliance and audit trail automatically maintained
The Challenges Nobody Talks About
But here’s where it gets complicated:
Challenge 1: Platform Team as Bottleneck
Scenario: Product team needs custom metric for business logic (“number of premium subscriptions active”).
Old model: Developer adds counter, builds dashboard, done.
New model: Request goes to platform team → evaluated for standardization → maybe added to platform capabilities → 2-week turnaround.
Developer frustration: “I could’ve built this in 30 minutes, now I’m blocked waiting for platform approval.”
How much customization should platform expose vs standardize?
Challenge 2: Cost Explosion
Observability tools are EXPENSIVE at scale:
- Datadog: 5-31 per host per month, more for APM and logs
- New Relic: Similar pricing, costs climb with data volume
- Elastic: Self-hosted reduces fees but increases operational burden
When you instrument EVERYTHING by default, data volume skyrockets.
Our observability bill doubled in 6 months after rolling out platform-wide automatic instrumentation. Finance is now asking: “Do we really need DEBUG logs from every service?”
Platform team caught between: Comprehensive observability vs budget constraints.
Challenge 3: The Build vs Buy Decision
Option 1: Build observability platform in-house
Pros:
- Full customization
- No per-host licensing fees
- Data stays in-house (compliance, security)
Cons:
- Massive engineering effort (Prometheus, Grafana, Loki, Tempo, alerting, etc.)
- Operational burden maintaining the stack
- Scaling challenges as data volume grows
Option 2: Buy managed observability (Datadog, New Relic, Honeycomb)
Pros:
- Turnkey solution
- Vendor handles scaling, reliability
- Rich feature set out of the box
Cons:
- Expensive at scale
- Vendor lock-in
- Data leaves your infrastructure
We started with build, moved to buy because the operational burden of running our own observability stack became higher than paying for Datadog. But that 00K/year bill makes CFO unhappy.
Challenge 4: Standardization vs Flexibility
Platform-provided observability requires standardization. But not all services fit the mold:
- Batch jobs: Don’t fit request/response golden signals model
- Data pipelines: Need data quality metrics, not HTTP latency
- ML inference services: Custom metrics for model performance
Do we:
- Force everything into standardized golden signals?
- Build escape hatches for custom observability?
- Support multiple observability patterns within the platform?
Each approach has trade-offs between consistency and flexibility.
What I’m Wrestling With
For platform teams providing observability:
-
Customization boundaries: Where do you draw the line between standardized platform observability and team-specific custom metrics/dashboards?
-
Cost management: How do you balance comprehensive observability with budget reality? Sampling? Data retention policies? Tiered offerings?
-
Build vs buy: At what scale does in-house observability become more cost-effective than managed solutions? Or is vendor-managed always worth it?
-
Operational ownership: Does platform team own the observability stack’s uptime and performance? What happens when Datadog has an outage and you can’t see your systems?
For engineering leaders: Is platform-owned observability the right model? Or should teams retain observability as their responsibility, with platform providing standards/tooling?
These aren’t rhetorical questions—I’m genuinely trying to figure out the right approach as our scale grows.
Related: Platform Engineering Predictions 2026 | DevOps Trends 2026