When Coinbase’s $65 million Datadog bill went viral on Hacker News, it struck a nerve. But here’s the thing - that’s not an outlier.
Mid-sized companies routinely spend $50,000-$150,000 per year on Datadog. Enterprise deployments easily exceed $1 million annually once APM, logs, and RUM are included.
And for many organizations, observability has become the second-highest cost after infrastructure itself.
Something has to give.
The Pricing Breakdown That’s Breaking Budgets
Datadog’s pricing model has several features that combine to create unpredictable bills:
-
High-water mark billing - You’re billed based on your 99th percentile usage, not your average. That one incident response where you scaled up? You’re paying for it all month.
-
Dual-cost log management - You pay once to ingest logs, then pay again (at a higher rate) to index them. Want searchable logs? Double the cost.
-
Custom metrics tax - Premium rates based on unique tag cardinality. OpenTelemetry’s rich tagging model? That’s expensive on Datadog.
-
Per-host + per-feature stacking - Infrastructure monitoring, APM, Synthetics, RUM - each adds to the bill with different pricing dimensions.
Concrete Cost Comparisons
Teams running identical workloads have documented these comparisons:
| Solution |
Monthly Cost |
Savings vs Datadog |
| Datadog |
$22,303 |
- |
| Grafana Cloud |
$1,855 |
11x cheaper |
| Coroot (on-prem) |
$142-162 |
140x cheaper |
| OpenObserve |
$90 |
98% savings |
These aren’t promotional numbers - they’re real comparisons from engineering teams.
Why OpenTelemetry Changes Everything
The game-changer is OpenTelemetry becoming the instrumentation standard. When your application emits OTel data, you can send it anywhere:
- Today: Datadog
- Tomorrow: SigNoz, OpenObserve, Grafana, or whatever works better
Your instrumentation investment is protected. The switching cost drops from “re-instrument everything” to “change the exporter configuration.”
If you’re still using Datadog’s proprietary agents, you’re building vendor lock-in into your infrastructure.
What’s Actually Happening
Teams I talk to are following similar patterns:
- Standardize on OpenTelemetry - Replace Datadog agents with OTel collectors
- Pilot alternatives - Run a parallel backend for non-critical services
- Validate parity - Ensure dashboards and alerts can be recreated
- Migrate gradually - Move service by service, not big bang
The migration is real, and it’s accelerating.
The Uncomfortable Question
If you’re spending significant budget on Datadog, you need to ask: what would it take to switch? And what’s the cost of waiting another year?
What’s your observability spend looking like?
The high-water mark billing nearly killed us last year.
We had a production incident that required spinning up additional capacity for about 4 hours. Incident resolved, capacity scaled back down. Business as usual.
Then the Datadog bill came.
We got charged for the peak capacity for the entire month. Four hours of incident response turned into a ~$15K surprise on the bill.
The conversation with finance was not pleasant.
What we’ve learned since:
-
Always scope before scaling - Before spinning up capacity, consider the observability cost impact. Yes, this is absurd. Yes, we do it anyway.
-
Tag cardinality is your enemy - Every unique combination of tags is a separate timeseries. Kubernetes labels? Pod names? Request IDs? They all multiply your metrics bill.
-
Log sampling is mandatory - We moved to 10% log sampling on high-volume services. Not because we wanted less data, but because we couldn’t afford 100%.
-
The calculator lies - Datadog’s pricing calculator gives you one number. Reality gives you another. Plan for 40-60% above the estimate.
The OTel transition:
We’re now 6 months into standardizing on OpenTelemetry. The instrumentation work was significant, but we can finally see a path to alternatives.
For anyone starting this journey: the OTel collector is your friend. Centralize there first, then you can redirect anywhere.
The security implications of switching observability vendors are significant but manageable with the right approach.
Data Residency Concerns
With Datadog, your telemetry data lives in their cloud infrastructure. Moving to self-hosted alternatives like Coroot or OpenObserve gives you complete control over where sensitive operational data resides. For organizations in regulated industries, this can actually improve your compliance posture.
Vendor Risk Assessment
Any migration requires evaluating:
- Data retention policies and deletion guarantees
- SOC 2 Type II compliance status
- Incident response and breach notification procedures
- API security and authentication mechanisms
The OpenTelemetry Security Advantage
One underappreciated benefit: OTel instrumentation runs in your environment, giving you complete control over what data leaves your network. With proprietary agents, you’re trusting the vendor’s data collection code.
Migration Security Checklist
- Audit current data flows and retention
- Verify new platform meets compliance requirements
- Plan credential rotation strategy
- Test data export and deletion capabilities
- Document chain of custody during migration
The 98% cost savings Michelle mentioned are compelling, but make sure your security team is involved from day one of any migration planning.
From a data infrastructure perspective, the Datadog cost problem is fundamentally about data volume economics.
The Data Volume Reality
Modern ML pipelines generate massive telemetry. A single model training run can produce gigabytes of metrics and logs. When you’re paying Datadog’s per-GB ingestion rates plus indexing fees, the math becomes prohibitive fast.
Why ClickHouse Changes Everything
Both SigNoz and OpenObserve use ClickHouse for storage, and the compression ratios are genuinely impressive - 10-140x depending on data patterns. For observability data which is highly repetitive, this translates directly to cost savings.
Our ML Team’s Experience
We had to make painful choices about what to instrument:
- Full traces only for production inference, not training
- Sampling at 1% for high-volume endpoints
- Custom metrics limited to top 50 most critical
These compromises undermined our ability to debug model performance issues. With 98% cost savings, we could actually instrument everything.
The Hidden Cost of Data Science Tooling
Datadog’s APM for Python is solid, but their ML-specific features lag behind. Moving to open-source means we can integrate directly with MLflow, Weights & Biases, and our existing Prometheus metrics.
Michelle’s point about OpenTelemetry is crucial for data teams - it means we can emit traces from Spark jobs, Airflow DAGs, and model serving without vendor-specific instrumentation.