Skip to main content

What Your APM Dashboard Won't Tell You: LLM Observability in Production

· 10 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows 99.4% uptime, sub-500ms P95 latency, and a 0.1% error rate. Everything is green. Meanwhile, your support queue is filling with users complaining the AI gave them completely wrong answers. You have no idea why, because every request returned HTTP 200.

This is the fundamental difference between traditional observability and what you actually need for LLM systems. A language model can fail in ways that leave no trace in standard APM tooling: hallucinating facts, retrieving documents from the wrong product version, ignoring the system prompt after a code change modified it, or silently degrading on a specific query type after a model update. All of these look fine on your latency graph.

Building observable LLM systems requires a different mental model — one that starts with what you need to understand, not just what you need to monitor.

The Silence of Semantic Failures

Traditional software fails loudly. Exceptions propagate up the stack, timeouts trip circuit breakers, and HTTP 5xx rates spike. The system tells you it is broken. LLMs fail quietly. A retrieval-augmented generation pipeline that fetches documents from the wrong index returns a confident, fluent response with a 200 OK and perfectly acceptable latency. Nothing in your APM tooling will flag it.

There are several categories of silent failure unique to LLMs:

Quality degradation without errors. Provider model updates (even patch versions) change model behavior without touching the API contract. A feature that worked on last month's checkpoint can regress on a new one, affecting only a specific distribution of queries. You won't know until users tell you.

Prompt drift. System prompts silently change when engineers touch unrelated code. Without tracking what prompt was actually used for each request, you can't trace a behavioral change to its cause.

Wrong-tool selection in agents. An agent can successfully invoke a tool — returning 200, logging the call — and still have picked the wrong one. The infrastructure looks healthy while the output is wrong.

Compounding retrieval errors. In RAG pipelines, a degraded embedding model or misconfigured vector index returns lower-relevance documents. The LLM generates a coherent response from them. Error rate: 0%.

Traditional APM answers the question is it working? LLM observability must answer why did it behave that way?

What a Trace Actually Looks Like

The telemetry unit hierarchy for LLMs mirrors standard distributed tracing but with additional layers:

  • Session — a multi-turn conversation grouping multiple user requests
  • Trace — a single end-to-end request through the system
  • Span — a logical unit of work within a trace
  • Generation — the specific LLM inference call

For a typical RAG pipeline, a single user request generates a tree that looks something like this:

Trace: user-request (root)
└── Span: retrieve-documents
└── Span: construct-prompt
└── Generation: chat gpt-4o
└── Span: execute_tool web_search
└── Span: guardrail-check

Each generation span should capture the model name, input token count, output token count, temperature, finish reason, and latency. The retrieval span should record which documents were returned and their relevance scores. The tool call span should record the tool name, arguments, and whether it succeeded.

The OpenTelemetry GenAI Semantic Conventions (stabilized in 2024 via the OTel GenAI SIG) define the canonical attribute names. A few key ones:

  • gen_ai.operation.name — type of operation (chat, embeddings, retrieve, execute_tool)
  • gen_ai.provider.name — provider identifier (openai, anthropic, etc.)
  • gen_ai.request.model — model name
  • gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token counts
  • gen_ai.usage.cache_read.input_tokens — tokens served from provider cache

Content attributes — full prompts, messages, retrieved documents — are opt-in and disabled by default. Enable them only when privacy regulations and data residency requirements permit. They're essential for debugging but create compliance exposure if captured indiscriminately.

The benefit of using OTel conventions is portability. Instrument once, export to any compatible backend: Langfuse, Arize, MLflow, or Datadog. The schema doesn't change.

Three Metric Dimensions You Can't Skip

Latency: More Than One Number

LLM latency has three distinct components that your P95 number collapses into one.

Time to First Token (TTFT) is perceived responsiveness — the delay before the user sees anything. For interactive applications, TTFT P95 below 800ms is a reasonable starting SLO, with 200ms for highly responsive chat interfaces. TTFT is dominated by queue time and the prefill phase; it's sensitive to prompt length and concurrent load.

Time Per Output Token (TPOT) is the decode-phase throughput — how fast tokens stream after the first one. This determines whether streaming feels smooth or stutters. On modern hardware (H100 with vLLM), you can expect around 120ms TTFT at moderate concurrency; on older deployments the variance is significant.

End-to-end latency is the full pipeline time including retrieval, tool calls, and post-processing. For complex agent pipelines this is the number users actually experience and the one you should set SLOs against.

Measure all three. A spike in TTFT and a spike in E2E can have completely different root causes.

Cost: Per Feature, Not Per Month

The most dangerous cost mistake is treating LLM spend as a single line item. When costs spike, you need to know which feature, team, or usage pattern caused it — not just that the bill went up.

Token counting at the span level enables this. If every LLM call records gen_ai.usage.input_tokens and gen_ai.usage.output_tokens, you can sum by any dimension: feature flag, user tier, agent type, or model. The real-world impact is significant — one documented case found $47K/month in spending reduced to $28K after cost attribution surfaced that a full-page web scrape was being triggered on short queries, inflating token counts invisibly.

Caching multiplies the leverage of cost visibility. Provider-level prompt caching can cut costs by 50-90% on cached tokens. Semantic caching (deduplicating near-identical queries at the application layer) can eliminate 31% of enterprise queries entirely, since that fraction is semantically similar across users. Neither of these optimizations is visible without token-level tracing.

Cost metrics the spec doesn't capture yet — actual dollar amounts — need to be derived externally by multiplying token counts by per-model pricing. Keep this logic in one place.

Quality: The Metric That Requires a New Pipeline

Quality is the category where LLM observability diverges completely from traditional APM. You can't measure it with counters or timers. You need to run evaluations.

The practical approach is online evaluation: asynchronously sample 5-10% of production traffic and score it using an LLM-as-judge or custom Python logic. This runs after the request completes, adding zero latency to the user path.

Key quality signals to track:

  • Groundedness — for RAG, did the response use the retrieved context or fabricate?
  • Relevance — did the retrieved documents actually match the query?
  • Hallucination rate — are factual claims verifiable?
  • Safety pass rate — rate of clean responses on PII, toxicity, and jailbreak dimensions
  • Task completion rate — for agents, did the agent complete the user's actual intent?

Even a simple rubric (1-5 score, LLM-judged on a 5% sample) will catch quality degradation that looks invisible on infrastructure metrics. Tooling like LangSmith and Braintrust has built production sampling pipelines around this pattern.

Multi-Agent Tracing: Harder Than It Looks

Single-turn LLM calls are relatively easy to instrument. Multi-agent pipelines introduce challenges that require explicit design.

The core problem: a complex agent might execute 15+ LLM calls across multiple models for a single user request. Without hierarchical span linking, you can't know which call introduced a quality issue, which sub-agent consumed 80% of the cost, or whether a slow request was bottlenecked in retrieval or inference.

The fix is maintaining parent span relationships across agent boundaries. Every sub-agent call should record its parent span ID so you can reconstruct the full execution tree after the fact. This requires passing trace context explicitly if your agents communicate asynchronously or across process boundaries.

Agent-specific metrics to monitor:

  • Tool call success rate per agent — a drop here often precedes quality failures
  • Chain depth — how many agent-to-agent hops a request traverses; infinite loops show up as depth anomalies before they show up as timeouts
  • Cost per completed goal — the relevant unit for agents, since 10 LLM calls that complete a task are more efficient than 3 calls that don't
  • Consecutive tool failures — alert on 3+ failures in 5 minutes; this often indicates a broken external dependency before the infrastructure monitoring sees it

One production case: two agents were calling each other in a loop. Infrastructure metrics showed only a slow request. Span depth monitoring caught it within 3 minutes by alerting when chain depth exceeded 8.

The Observability Tax

Adding AI monitoring to existing APM platforms is not free. RAG pipelines generate 10-50x more telemetry per request than traditional API calls — every request involves retrieval, prompt construction, inference, tool calls, and post-processing, each generating spans and metrics. Volume-based pricing on tools like Datadog and Splunk results in 40-200% bill increases when teams add LLM workloads without adjusting their instrumentation strategy.

Mitigation options:

  • Sampling. Not every trace needs to be stored. Sample 5-10% of routine requests; keep 100% of errors and low-quality-score traces. MLflow's tracing package supports configurable sampling ratios.
  • Purpose-built tooling. Tools like Langfuse (self-hostable, MIT-licensed), Helicone (proxy-based, minimal setup), and MLflow's lightweight mlflow-tracing package are designed around LLM telemetry patterns and avoid the linear-volume pricing trap.
  • Selective content capture. The opt-in content attributes (full prompts, retrieved documents) are the most expensive to store. Enable them for debug environments and sampled production traffic, not every request.

A Practical Starting Point

You don't need to implement everything at once. A phased approach reduces risk and surfaces value incrementally:

Week 1: Assign trace_id to every request. Log input/output token counts per LLM call. Track E2E latency and cost by feature. This alone surfaces most cost surprises.

Month 1: Add span hierarchy for retrieval, inference, and tool calls using OTel GenAI conventions. Build a reliability dashboard: TTFT P95, error rate, token usage, and cost per request. Set budget alerts at 80% of monthly targets.

Month 2: Add online evaluations on 5-10% of production traffic. Track groundedness and relevance for RAG pipelines. Add guardrail spans for safety checks. Create annotation queues for human review of low-scoring traces.

Ongoing: Version your prompts. Capture problematic production traces as regression datasets. Run quality evals as part of model update deployment gates. Without this last step, provider model updates become invisible risk.

The goal is not comprehensive telemetry — it's actionable telemetry. An alert on hallucination rate crossing a threshold is worth more than a hundred latency graphs. A production trace showing which documents were retrieved and what score they had is worth more than the system prompt alone when debugging a wrong answer.

Your APM dashboard will stay green while your users complain. Build the pipeline that explains why.

References:Let's stay in touch and Follow me for more thoughts and updates