"We Got Billed for Peak Usage, Not Average" - Understanding Datadog's Hidden Pricing Model

Last quarter, our Datadog bill jumped 340% during a product launch. We were prepared for higher costs - but not that much higher. Here’s what we learned about Datadog’s billing model that nobody explains upfront.

The High-Water Mark Problem

Datadog bills based on the 99th percentile of hosts over the billing period. That means:

  • Scale up 50 servers for a 2-hour traffic spike? You’re billed for those servers all month.
  • Spin up test instances for load testing? They count toward your peak.
  • Auto-scale during an incident? Your bill reflects the crisis, not the resolution.

Real Numbers from Our Launch

Period Hosts What We Expected What We Got Billed
Normal 24 $2,400/mo -
Launch week 85 (peak) ~$4,000/mo $8,500/mo
Post-launch 30 Back to normal Still $8,500

The 99th percentile billing meant our 3-day scaling event defined our entire month’s cost.

The Custom Metrics Multiplier

It gets worse with custom metrics. We had 150 custom metrics at baseline. During launch, our instrumentation for new features pushed us to 400+. Each custom metric is billed at a premium, and guess what - also at 99th percentile.

Why This Punishes Good Engineering

The irony: Datadog’s billing model punishes exactly the behaviors you want:

  • Elastic scaling = billing spikes
  • Thorough instrumentation = metric cost explosion
  • Incident response (more hosts for debugging) = higher bills

What We’re Doing About It

  1. Migrated staging environments to self-hosted OpenObserve
  2. Implemented strict custom metric budgets per team
  3. Evaluating SigNoz for production (running parallel for 2 months)
  4. Standardized on OpenTelemetry so we can switch without re-instrumenting

Has anyone else been surprised by the high-water mark billing? How are you managing observability costs during scaling events?

Luis, this is exactly the budget planning nightmare I deal with every quarter.

The CFO Conversation Nobody Wants to Have

Try explaining to finance that your observability costs are “unpredictable by design.” The 99th percentile model means:

  • Annual budgets are essentially guesses
  • Any growth initiative comes with hidden observability costs
  • Incident response becomes a billable event

The Procurement Problem

We negotiated an enterprise agreement with Datadog two years ago. Here’s what we learned:

  1. Commit traps: They want annual commitments, but your usage is unpredictable
  2. SKU complexity: 15+ line items makes apples-to-apples comparison impossible
  3. True-up clauses: Exceed your commitment, pay list price for overage

What I Now Require for Observability Vendors

  • Usage-based pricing with monthly caps
  • Clear cost attribution to teams/services
  • No penalty for scaling events
  • Self-serve cost controls that engineers can implement

The Strategic Shift

We’re now treating observability as critical infrastructure, not a managed service. That means:

  • Platform team owns the stack
  • CapEx for infrastructure vs OpEx for SaaS
  • Predictable costs tied to our infrastructure, not vendor pricing models

The 340% spike you experienced would have triggered an executive review here. We can’t run a business on unpredictable infrastructure costs.

The product launch scenario Luis described is painfully familiar.

The Launch Day Dilemma

When we launch new features, we need more visibility, not less:

  • Real-time user behavior analytics
  • Error rate monitoring at granular levels
  • Performance metrics for new code paths
  • A/B test instrumentation

But Datadog’s pricing model creates a perverse incentive: add observability exactly when you can least afford the cost spike.

What This Costs Product Teams

Last quarter, we had to choose between:

  1. Full instrumentation for a new checkout flow ($15K additional/month)
  2. Minimal metrics and hope nothing breaks

We chose option 2. Guess what happened? A payment edge case caused 3% of transactions to fail silently for 4 days before we noticed. The revenue impact was $180K.

The Hidden Tax on Innovation

Every product initiative now includes an “observability budget” line item. This slows down our experimentation velocity because:

  • Small experiments need cost justification
  • Quick MVPs skip instrumentation entirely
  • We’re flying blind on features we should be learning from

What I Want From Observability

  • Fixed cost per product team, not per metric
  • Unlimited experimentation without billing anxiety
  • Cost scales with value delivered, not data volume

The 98% savings from alternatives Michelle mentioned in the other thread would completely change how we approach product observability.

As the person who actually implements the instrumentation, the cost constraints create real friction in my daily work.

The “Is This Metric Worth It?” Tax

Every time I want to add observability, I have to think:

  • Is this custom metric going to blow our budget?
  • Should I use a tag or a separate metric? (Tags are cheaper but less flexible)
  • Can I sample this at 10% and still get useful data?

This cognitive overhead slows down development. Instrumentation should be a reflex, not a cost-benefit analysis.

Real Examples From This Week

  1. Cache hit rate monitoring - Wanted to track per-key hit rates. Finance said no, we’re at our custom metric limit.

  2. API latency by customer tier - Had to remove customer_tier tag because high-cardinality tags trigger premium billing.

  3. Error sampling - We sample errors at 1% to save costs. Last week we missed a bug affecting 0.5% of requests for 3 days.

What Good Developer Experience Looks Like

I’ve been playing with SigNoz in a side project:

  • Add metrics without checking a budget spreadsheet
  • High-cardinality labels? No problem
  • Full traces for debugging, not sampled
  • Actually enjoyable to instrument code

Luis, when you mentioned standardizing on OTel - that’s the key. My instrumentation code works with any backend. When we switch (not if), it’s just a config change.