Skip to main content

The Observability Tax: When Monitoring Your AI Costs More Than Running It

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI-powered customer support bot. It works. Users are happy. Then the monthly bill arrives, and you discover that the infrastructure watching your LLM calls costs more than the LLM calls themselves.

This isn't a hypothetical. Teams are reporting that adding AI workload monitoring to their existing Datadog or New Relic setup increases their observability bill by 40–200%. Meanwhile, inference costs keep dropping — GPT-4-class performance now runs at 0.40permilliontokens,downfrom0.40 per million tokens, down from 20 in late 2022. The monitoring stack hasn't gotten that memo.

The result is an inversion that would be funny if it weren't expensive: you're paying more to watch your AI think than to make it think.

Why AI Telemetry Explodes Your Monitoring Bill

Traditional observability was designed for a world where telemetry scaled linearly with traffic. A REST API handling 10,000 requests per day generates roughly 10,000 spans, some logs, and a handful of metrics. Predictable. Budgetable.

AI workloads shatter that assumption. A single LLM inference request doesn't produce one span — it produces a cascade.

Consider a customer support bot handling 10,000 conversations daily:

  • 50,000 user messages per day
  • 200,000 LLM invocations (roughly 4 calls per message for retrieval, classification, generation, and validation)
  • 1 million spans daily (about 5 spans per invocation)
  • 4 million metric data points (20 custom metrics per invocation: token counts, latency percentiles, cost attribution, model version, temperature settings)
  • 400 MB of logs per day

That's 10–50x more telemetry than an equivalent traditional service. And every one of those data points costs money on platforms that price by volume — per GB of logs, per million spans, per custom metric.

The problem compounds with agent architectures. An agent that calls tools, spawns sub-agents, and executes multi-step plans can produce deeply branching trace trees. Each branch multiplies your telemetry volume. A 10-step agent workflow with 3 tool calls per step doesn't generate 10 spans — it generates hundreds.

The Four Layers That Stack Up

Most teams don't arrive at an expensive monitoring stack on purpose. They add layers incrementally, each one justified in isolation:

Layer 1: Basic tracing and logging. You need to know what your LLM calls look like in production. Reasonable. You instrument with OpenTelemetry, ship traces to your APM platform. Cost: moderate.

Layer 2: Cost attribution. Finance wants to know which features and customers consume the most tokens. You add per-request token tracking, per-user cost rollups, model-version tagging. Every request now carries 15–20 additional metadata fields. Cost: multiplied.

Layer 3: Quality evaluation. You add LLM-as-judge evaluations to catch hallucinations, relevance drift, and tone violations. Even at 5% sampling, you're now making additional LLM calls to evaluate your LLM calls. Cost: doubled (at minimum).

Layer 4: Human review queues. For high-stakes outputs, you pipe flagged responses to human reviewers. The tooling to manage these queues, track reviewer throughput, and compute inter-rater reliability adds another monitoring surface. Cost: now you're monitoring the monitors.

Each layer addresses a real need. But stacked together, they create an observability bill that grows superlinearly with traffic — exactly the opposite of what happens to your inference costs.

The Diminishing Returns Curve

Here's what makes this especially painful: the value of each additional monitoring layer follows a sharply diminishing curve, while the cost follows a linear (or worse) one.

Layer 1 gives you 70% of the signal you need. You can see what's happening, catch outages, debug failures. Layer 2 adds perhaps 15% — you now understand cost drivers and can optimize. Layer 3 adds another 10% — you catch quality issues before users report them. Layer 4 adds the final 5% — you have ground-truth labels for your hardest cases.

But cost doesn't follow this curve. Layer 3 alone — running evaluations on production traffic — can cost as much as Layers 1 and 2 combined if you're using a capable model as your judge. Layer 4 introduces human labor costs that dwarf everything else.

The teams that get into trouble are the ones that treat monitoring as a binary: either you have it or you don't. They skip straight to full instrumentation across all four layers because "observability is important" without asking whether the marginal signal justifies the marginal cost.

What Right-Sizing Actually Looks Like

The fix isn't to stop monitoring. It's to be as intentional about your monitoring architecture as you are about your inference architecture. Here are the strategies that work.

Tiered sampling instead of full capture

Not every request needs full telemetry. A tiered approach gives you statistical coverage without the volume:

  • Head sampling at 10–20% for normal requests. This gives you solid statistical coverage of the happy path.
  • Tail sampling at 100% for errors, high-latency outliers, and flagged outputs. You never want to miss a failure.
  • Adaptive sampling that increases capture rate when anomalies are detected and decreases it during steady state.

Teams that implement tiered sampling typically reduce telemetry volume by 50–70% while preserving their ability to debug every incident that matters.

Evaluation sampling, not evaluation-on-every-request

Running LLM-as-judge on 100% of production traffic is the single most expensive observability decision teams make. For most applications, evaluating 5% of traffic asynchronously provides equivalent quality signal. Reserve 100% evaluation for:

  • The first two weeks after any prompt change
  • Segments with known quality issues
  • Requests that trigger existing heuristic guardrails

The async part matters too. Synchronous evaluation adds latency to every request and makes your monitoring a dependency of your inference path — if the eval model goes down, your product goes down.

Aggressive retention policies

Most observability spend goes toward storing data that nobody queries. Industry data suggests roughly 70% of stored logs are never accessed. Set retention policies that match actual usage:

  • Detailed traces: 7–14 days
  • Aggregated metrics: 90 days
  • Evaluation results: 30 days
  • Cost attribution data: 12 months (finance needs this)

Consolidate before you optimize

Tool sprawl is a silent cost multiplier. A common pattern: Datadog for APM, Langfuse for LLM tracing, a custom dashboard for cost tracking, PagerDuty for alerting, and a spreadsheet for evaluation results. Each tool charges per-seat, and the overlap between them is enormous.

At moderate production scale (50 million spans per month, 5 users), the pricing spread across LLM observability tools is staggering: from 129/monthto129/month to 5,170/month for the same workload, depending on your vendor. Self-hosting with open-source tools (Grafana, Prometheus, Jaeger) can reduce costs by 90–97%, though you're trading dollars for engineering time.

The Decision Framework

Before adding any monitoring layer, run this checklist:

  • What decision does this data enable? If you can't name a specific action you'd take based on the data, you don't need it.
  • What's the cost per insight? Divide the monthly cost of the monitoring layer by the number of actionable findings it produces. If you're paying 3,000/monthforanevaluationpipelinethatcatches2realissues,thats3,000/month for an evaluation pipeline that catches 2 real issues, that's 1,500 per caught issue. Is that worth it relative to the cost of letting those issues reach users?
  • Can sampling get you there? Almost always yes. The question is what sampling rate gives you statistical significance for the metric you care about.
  • Is this a launch concern or a steady-state concern? Many monitoring layers are critical during the first weeks after a change but wasteful in steady state. Build in automatic step-down policies.

The Infrastructure Mirror Problem

There's a deeper structural issue at play. AI inference infrastructure has benefited from massive competition and optimization — custom silicon, quantization, speculative decoding, aggressive caching. Inference cost per token has dropped roughly 50x in three years.

Observability infrastructure has seen no such pressure. The major platforms still price by data volume using models designed when telemetry was scarce. When a single AI request generates 50x more telemetry than a traditional API call, you're paying 50x more to monitor it — even though the request itself costs less.

This gap will eventually close. OpenTelemetry is establishing open standards that reduce vendor lock-in. AI-native observability tools are emerging with pricing models that understand token-based workloads. Self-hosting is becoming more practical as open-source alternatives mature.

But waiting for the market to fix this could mean spending more on monitoring than inference for the next 12–18 months. The teams that right-size now — sampling intelligently, retaining aggressively, consolidating tools — will have a meaningful cost advantage over those that monitor everything because they can.

Build Your Monitoring Budget Like Your Inference Budget

The mental shift is simple: treat monitoring as an engineering system with its own cost-performance tradeoffs, not as an insurance policy you buy and forget.

You wouldn't run every inference request through your most expensive model. You route simple queries to small models and reserve the big ones for hard cases. Apply the same logic to observability: simple requests get sampled telemetry, anomalous requests get full capture, and evaluation is reserved for the cases where quality uncertainty is highest.

The teams that get this right spend 5–10% of their inference budget on monitoring and catch 90% of the issues. The teams that don't spend 200% and catch 95%. That extra 5% of coverage costs 20x more.

Whether it's worth it depends on your risk profile — but at least make it a conscious choice, not an accident of incremental tooling decisions.

References:Let's stay in touch and Follow me for more thoughts and updates