Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap
There is a broken prompt in your system right now that affects roughly three percent of traffic, and your dashboards do not know it exists. The p99 latency chart is green. The error rate is flat. The model-call success metric is at four nines. The only place the failure shows up is in a customer support ticket the platform team cannot reproduce, and by the time the ticket reaches a debugging session, the trace has been sampled away.
This is not a monitoring gap. It is a category mistake. The APM you are running was designed for a world in which dimensions are bounded sets — endpoint, status_code, region, service — and the cost of an additional label is at most a few new time series. LLM workloads do not fit that shape at all. The interesting dimensions are the user's prompt, the retrieved context IDs, the tool-call sequence, the model revision, the prompt template version, the tenant, the locale, the eval bucket the request fell into. Every one of those is high-cardinality, and any subset of them is enough to detonate the metrics store the moment you tag a span with it.
The team's response, every time, is the same. Aggregate away the offending fields. Drop the per-tenant breakdown to keep the cost graph from doubling. Strip prompt_hash from the metric so the time-series count stops climbing. Each of those decisions is locally rational and globally catastrophic, because the thing that just got removed is the only dimension on which the failure is visible. The aggregate p99 averages over a population in which the easy queries always passed and the hard ones always failed; the broken prompt is invisibly hiding inside the "always failed" tail, and the tail is exactly what got smoothed.
Why LLM Telemetry Has The Cardinality Profile Of Product Analytics
A typical service emits telemetry that looks like a low-rank matrix. There are a few hundred endpoints, a handful of status codes, a few regions, a few service versions in flight at once. Multiply them out and you get tens of thousands of unique label combinations. Prometheus, Datadog, and the rest of the metrics stack were designed exactly for this shape, and they are extremely good at it. The cost-per-cardinality curve is gentle as long as you stay under roughly a hundred unique values per dimension.
LLM telemetry does not look like that. Every prompt is a unique string. The retrieval layer returns a different combination of document IDs for almost every request. The tool-call trace is a tree whose shape depends on what the model decided to do. The prompt template is on its fourteenth revision in the eval suite this week. The model itself has three revisions live at once because canary, control, and shadow eval are all running simultaneously. The tenant set has thousands of customers. The locale set has dozens. Multiply all of those and you do not get tens of thousands of series. You get something that is, for practical purposes, unbounded — closer to clickstream analytics than to service metrics.
A typical retrieval-augmented pipeline — vector lookup, rerank, model call, post-processing — emits ten to fifty times the telemetry volume of an equivalent stateless API call, and the telemetry is dominated by high-cardinality fields. None of the cardinality is "extra." Each field is the dimension some engineer has to slice on to debug a specific class of failure. The retrieval team needs to know which doc IDs were in context. The prompt team needs to know which template version was active. The model team needs to know which revision answered. The product team needs to know which tenant complained. Stripping any of those is not an optimization — it is the controlled destruction of the signal someone needs.
The Quiet Collapse Of The Metrics Layer
The way this fails in practice is rarely an outage. It is a slow drift in which the dashboards stay green and the operators slowly stop trusting them, without quite saying so out loud. Three things happen, usually in order.
First, the metrics-store budget gets exceeded, and the alerting layer starts dropping series. On most platforms this happens silently. Datadog's custom-metrics overage charges scale linearly with cardinality, and adding a single tag with a thousand unique values can multiply the metric count into the tens of thousands and the bill into the thousands of dollars per metric per month. A naive prompt.template_id tag with two hundred templates and a tenant_id tag with five hundred tenants creates a hundred thousand series for one metric. The finance team notices the bill before the ops team notices the dropped data.
Second, the team responds by aggregating. The high-cardinality fields move from labels to drop-rules, from drop-rules to "we'll keep them in logs," from logs to "we'll keep them in traces sampled at one percent." Each step preserves the global metric and destroys the per-segment one. The shape of the data on the dashboard remains the same; the questions it can answer get narrower with every iteration.
Third, the long tail becomes invisible. The aggregate quality metric does not move when a single template breaks for a single tenant on a single locale, because that failure is a fraction of a percent of the total request volume. The break only surfaces when a customer escalates, and the escalation lands on a team whose tooling has been silently configured to not show them what just happened. The platform team starts saying "we cannot reproduce" because they genuinely cannot — the sampling layer threw away the trace before they got there.
Why Tail-Based Sampling Is Necessary And Not Sufficient
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://www.honeycomb.io/blog/llms-demand-observability-driven-development
- https://www.honeycomb.io/use-cases/ai-llm-observability
- https://docs.honeycomb.io/get-started/basics/observability/concepts/events-metrics-logs
- https://docs.datadoghq.com/account_management/billing/custom_metrics/
- https://docs.datadoghq.com/llm_observability/
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
- https://opentelemetry.io/docs/concepts/sampling/
- https://clickhouse.com/resources/engineering/high-cardinality-slow-observability-challenge
- https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/
- https://oneuptime.com/blog/post/2026-04-01-ai-workload-observability-cost-crisis/view
- https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics
