Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap
There is a broken prompt in your system right now that affects roughly three percent of traffic, and your dashboards do not know it exists. The p99 latency chart is green. The error rate is flat. The model-call success metric is at four nines. The only place the failure shows up is in a customer support ticket the platform team cannot reproduce, and by the time the ticket reaches a debugging session, the trace has been sampled away.
This is not a monitoring gap. It is a category mistake. The APM you are running was designed for a world in which dimensions are bounded sets — endpoint, status_code, region, service — and the cost of an additional label is at most a few new time series. LLM workloads do not fit that shape at all. The interesting dimensions are the user's prompt, the retrieved context IDs, the tool-call sequence, the model revision, the prompt template version, the tenant, the locale, the eval bucket the request fell into. Every one of those is high-cardinality, and any subset of them is enough to detonate the metrics store the moment you tag a span with it.
The team's response, every time, is the same. Aggregate away the offending fields. Drop the per-tenant breakdown to keep the cost graph from doubling. Strip prompt_hash from the metric so the time-series count stops climbing. Each of those decisions is locally rational and globally catastrophic, because the thing that just got removed is the only dimension on which the failure is visible. The aggregate p99 averages over a population in which the easy queries always passed and the hard ones always failed; the broken prompt is invisibly hiding inside the "always failed" tail, and the tail is exactly what got smoothed.
Why LLM Telemetry Has The Cardinality Profile Of Product Analytics
A typical service emits telemetry that looks like a low-rank matrix. There are a few hundred endpoints, a handful of status codes, a few regions, a few service versions in flight at once. Multiply them out and you get tens of thousands of unique label combinations. Prometheus, Datadog, and the rest of the metrics stack were designed exactly for this shape, and they are extremely good at it. The cost-per-cardinality curve is gentle as long as you stay under roughly a hundred unique values per dimension.
LLM telemetry does not look like that. Every prompt is a unique string. The retrieval layer returns a different combination of document IDs for almost every request. The tool-call trace is a tree whose shape depends on what the model decided to do. The prompt template is on its fourteenth revision in the eval suite this week. The model itself has three revisions live at once because canary, control, and shadow eval are all running simultaneously. The tenant set has thousands of customers. The locale set has dozens. Multiply all of those and you do not get tens of thousands of series. You get something that is, for practical purposes, unbounded — closer to clickstream analytics than to service metrics.
A typical retrieval-augmented pipeline — vector lookup, rerank, model call, post-processing — emits ten to fifty times the telemetry volume of an equivalent stateless API call, and the telemetry is dominated by high-cardinality fields. None of the cardinality is "extra." Each field is the dimension some engineer has to slice on to debug a specific class of failure. The retrieval team needs to know which doc IDs were in context. The prompt team needs to know which template version was active. The model team needs to know which revision answered. The product team needs to know which tenant complained. Stripping any of those is not an optimization — it is the controlled destruction of the signal someone needs.
The Quiet Collapse Of The Metrics Layer
The way this fails in practice is rarely an outage. It is a slow drift in which the dashboards stay green and the operators slowly stop trusting them, without quite saying so out loud. Three things happen, usually in order.
First, the metrics-store budget gets exceeded, and the alerting layer starts dropping series. On most platforms this happens silently. Datadog's custom-metrics overage charges scale linearly with cardinality, and adding a single tag with a thousand unique values can multiply the metric count into the tens of thousands and the bill into the thousands of dollars per metric per month. A naive prompt.template_id tag with two hundred templates and a tenant_id tag with five hundred tenants creates a hundred thousand series for one metric. The finance team notices the bill before the ops team notices the dropped data.
Second, the team responds by aggregating. The high-cardinality fields move from labels to drop-rules, from drop-rules to "we'll keep them in logs," from logs to "we'll keep them in traces sampled at one percent." Each step preserves the global metric and destroys the per-segment one. The shape of the data on the dashboard remains the same; the questions it can answer get narrower with every iteration.
Third, the long tail becomes invisible. The aggregate quality metric does not move when a single template breaks for a single tenant on a single locale, because that failure is a fraction of a percent of the total request volume. The break only surfaces when a customer escalates, and the escalation lands on a team whose tooling has been silently configured to not show them what just happened. The platform team starts saying "we cannot reproduce" because they genuinely cannot — the sampling layer threw away the trace before they got there.
Why Tail-Based Sampling Is Necessary And Not Sufficient
The first instinct is to reach for tail-based sampling. The OpenTelemetry collector's tail-sampling processor lets you wait until a trace is complete and then decide whether to keep it based on what happened — slow latency, errors, specific tenants, specific tool-call patterns. This is genuinely the right architectural shape. Head-based sampling decides at trace start, before the team knows whether the trace is interesting; tail-based sampling decides after the fact and lets the system bias toward the long tail that matters.
But tail-based sampling has its own cardinality bill. The collector has to hold the entire trace in memory until the sampling decision is made, which means at a thousand traces per second with a sixty-second decision window the collector is buffering sixty thousand traces simultaneously, each with multiple spans and many attributes. Memory pressure scales with attributes-per-span, and attributes-per-span on an LLM trace are large because the GenAI semantic conventions push prompt text, response text, tool-call structure, and full token-level metadata onto the span. The result is that the sampling layer that was supposed to save you cardinality has, itself, a cardinality budget — and overflowing it means dropping traces before they get sampled, which is the worst of both worlds.
The right move is to layer two sampling decisions, not one. A head-based sampler runs at request entry and keeps a thin baseline — say, one in a thousand traces — unconditionally, so you have a uniform sample for headline metrics. A tail-based sampler runs at the collector and selects the rest based on importance signals: errors, slow tail, specific tenants, eval-bucket failures, anomaly hits. The two streams go to different stores with different retention. The baseline stream powers dashboards; the importance stream powers debugging.
The Tiered Architecture That Actually Survives Production
The pattern that scales — and the one most teams converge on after one or two incidents — has three tiers, and the cardinality budget is allocated separately for each.
The metrics tier holds bounded-cardinality dimensions only: model class, prompt-template version (rolled up into the hundreds, not the millions), tenant tier (free, paid, enterprise), locale, request status. Cardinality stays well under a hundred thousand series total. This is what dashboards graph and alerts fire on. It will not, by design, surface a single broken prompt for a single tenant. That is not a bug — it is what keeps the bill bounded and the graphs queryable.
The trace tier holds full-fidelity wide events for a sampled subset, where the sample is biased toward importance. Every span carries the full high-cardinality payload — prompt template ID, retrieved doc IDs, tool-call sequence, eval bucket, tenant — and the trace store is a columnar engine (ClickHouse-class) that handles high-cardinality joins natively. Compression of fifteen to fifty times on this kind of data is normal in columnar stores; the same data in a traditional metrics store is unaffordable at any sample rate above a few percent. This is what debugging sessions query. The metrics tier does the alerting; the trace tier does the post-mortem.
The events tier — sometimes folded into traces, sometimes separated — is the per-prompt-version rollup that bridges the two. It exists because the question "did anything change when we rolled out template v37?" is one the metrics tier cannot answer (template ID is not a metric label) and the trace tier cannot answer cheaply (you would have to scan a representative sample of every trace in two windows). The rollup pre-aggregates trace-level events into a manageable cardinality — by template version, by tenant tier, by eval bucket — and is what powers the prompt-comparison dashboards the prompt-engineering team actually uses.
The split between billing-grade and ops-grade telemetry is a separate axis that cuts across all three tiers. Billing-grade telemetry — token counts, model calls, paid-tier requests — has to be unsampled, because every record drives revenue recognition or cost attribution. Ops-grade telemetry can be sampled aggressively, because losing a representative subset of traces does not change the operational picture. Conflating the two is what produces the worst kind of drift: a sampling change that nobody flagged in review silently breaks the cost dashboard six weeks later, and the finance team finds out before the platform team does.
The Architectural Realization
The thing that makes this hard is not technical. The columnar trace stores exist. The OpenTelemetry GenAI semantic conventions stabilized in 2026 and Datadog now supports them natively. The tail-sampling processor is a well-understood piece of infrastructure. The components are all available.
What is missing, on most teams, is the recognition that LLM workloads are not a service-metrics problem with extra fields. They are a product-analytics problem wearing a service-metrics uniform. The right reference point for an LLM telemetry system is closer to the click-tracking pipeline than to the host-metrics pipeline. The cardinality profile, the query patterns, the long-tail debugging needs, and the importance-based sampling all map cleanly to user-event analytics, and very poorly to the bounded-dimension metrics stack the team already has running.
Teams that get this right early end up running their LLM telemetry on a different stack from their service telemetry, and they stop apologizing for it. Teams that do not get it right end up buying a bigger metrics-store tier every six months, dropping more dimensions every quarter, and accumulating an unspoken backlog of escalations they cannot reproduce. The bill goes up; the signal goes down; the incident-review meeting starts ending with "we will improve telemetry" as an action item that never quite ships, because the telemetry stack the team is improving is structurally the wrong shape for the workload.
The fix is not a single dashboard or a single tool. It is the architectural decision to treat the cardinality profile of the workload as a first-class design input — and to build the telemetry stack around the questions the team will actually need to answer at three in the morning, not the questions the metrics store happens to be cheap to answer. The team that does that ships an observable AI feature. The team that does not is paying a metrics vendor to drop the signal at exactly the moment the signal matters.
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://www.honeycomb.io/blog/llms-demand-observability-driven-development
- https://www.honeycomb.io/use-cases/ai-llm-observability
- https://docs.honeycomb.io/get-started/basics/observability/concepts/events-metrics-logs
- https://docs.datadoghq.com/account_management/billing/custom_metrics/
- https://docs.datadoghq.com/llm_observability/
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
- https://opentelemetry.io/docs/concepts/sampling/
- https://clickhouse.com/resources/engineering/high-cardinality-slow-observability-challenge
- https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/
- https://oneuptime.com/blog/post/2026-04-01-ai-workload-observability-cost-crisis/view
- https://isburmistrov.substack.com/p/all-you-need-is-wide-events-not-metrics
