Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap
There is a broken prompt in your system right now that affects roughly three percent of traffic, and your dashboards do not know it exists. The p99 latency chart is green. The error rate is flat. The model-call success metric is at four nines. The only place the failure shows up is in a customer support ticket the platform team cannot reproduce, and by the time the ticket reaches a debugging session, the trace has been sampled away.
This is not a monitoring gap. It is a category mistake. The APM you are running was designed for a world in which dimensions are bounded sets — endpoint, status_code, region, service — and the cost of an additional label is at most a few new time series. LLM workloads do not fit that shape at all. The interesting dimensions are the user's prompt, the retrieved context IDs, the tool-call sequence, the model revision, the prompt template version, the tenant, the locale, the eval bucket the request fell into. Every one of those is high-cardinality, and any subset of them is enough to detonate the metrics store the moment you tag a span with it.
