LLM Observability in Production: Tracing What You Can't Predict
Your monitoring stack tells you everything about request rates, CPU, and database latency. It tells you almost nothing about whether your LLM just hallucinated a refund policy, why a customer-facing agent looped through three tool calls to answer a simple question, or which feature in your product is quietly burning $800 a day in tokens.
Traditional observability was built around deterministic systems. LLMs are structurally different — same input, different output, every time. The failure mode isn't a 500 error or a timeout; it's a confident, plausible-sounding answer that happens to be wrong. The cost isn't steady and predictable; it spikes when a single misconfigured prompt hits a traffic wave. Debugging isn't "find the exception in the stack trace"; it's "reconstruct why the agent chose this tool path at 2 AM on Tuesday."
This is the problem LLM observability solves — and the discipline has matured significantly over the past 18 months.
Why Traditional Tracing Falls Short
Distributed tracing works on a core assumption: given the same code and the same input, execution follows the same path. You instrument the code, the spans tell you what happened, and you can reproduce the behavior in a test environment to debug it.
LLMs break this assumption at every level. The model is a black box that samples from a probability distribution. Temperature, nucleus sampling, and model updates all mean that even with identical inputs, the output varies. In agentic systems, that variance compounds: if the LLM picks a different tool on step one, it encounters different intermediate results, which shifts its decision on step two, and so on. By step five, the trace looks nothing like the trace from yesterday's "identical" request.
The implications are significant:
- Errors surface in evaluation, not exceptions. A hallucination doesn't throw a
TypeError. It returns HTTP 200 with a confident-sounding lie. Detection requires downstream evaluation — a separate model grading the output, a user feedback signal, or a rule-based checker — not a try/catch block. - Latency is bimodal and input-dependent. A 50-token prompt and a 10,000-token prompt hit the same endpoint but have wildly different latency profiles. Using mean latency as your SLO is nearly meaningless; you need to track p95/p99 segmented by input length bucket.
- Cost is a runtime variable, not a fixed overhead. How much a request costs depends on how many tokens it generates, which varies with the model, the prompt, and the task. An agent with tool calls can balloon from a 500-token request into a 15,000-token multi-step chain.
The OpenTelemetry Vocabulary for LLMs
The OpenTelemetry community has been developing gen_ai.* semantic conventions that give engineers a vendor-neutral vocabulary for LLM telemetry. These conventions define standard span types and attributes across the full agentic stack.
The core span operations you'll instrument:
| Operation | Span Name | What It Covers |
|---|---|---|
| LLM inference | chat {model} | The actual model API call |
| Embedding | embeddings {model} | Vector generation |
| Retrieval | retrieval {data_source_id} | RAG fetch from vector store |
| Tool execution | execute_tool {tool_name} | Any tool invocation |
| Agent invocation | invoke_agent {agent_name} | Calling a sub-agent |
The key standard attributes on each span: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.temperature, gen_ai.response.finish_reason. For cost tracking, you also want gen_ai.usage.cache_read_input_tokens and gen_ai.usage.cache_creation_input_tokens — cached tokens are typically 80–90% cheaper, so cache effectiveness is a first-class performance metric.
One important design decision: prompts and completions should go into span events, not span attributes. Observability backends have payload size limits that make large completions impractical as attributes. An opt-in event containing the full message content keeps your spans lean and lets you gate content logging on a per-environment basis.
The gen_ai.* conventions are still in Development status rather than Stable, but they're widely implemented — Traceloop, OpenLIT, LangSmith, and Langfuse all support them. The fragmentation between tools has decreased substantially as the spec has solidified.
Tracing Agentic Chains
Single LLM calls are straightforward to instrument. The harder problem is distributed tracing across multi-step agentic workflows where a single user request spawns sub-agents, tool calls, and retrieval operations across different services.
The W3C Trace Context standard (traceparent header) is the propagation mechanism. Every span knows its parent, maintaining causal chains across service and model call boundaries. A fully-traced agentic request looks like this in your trace visualization:
Trace: user-request-id
└── invoke_agent orchestrator
├── chat claude-3-5-sonnet ← first LLM call
├── execute_tool web_search ← tool invocation
│ └── HTTP GET search-api
├── retrieval docs-vectordb ← RAG fetch
└── invoke_agent specialist-agent ← sub-agent spawn
└── chat gpt-4o ← sub-agent's LLM call
This trace tells you: total duration, which step contributed the most latency, what the sub-agent was asked, and how many tokens each model consumed. Without this structure, you have a collection of disconnected spans and no way to correlate the fact that the hallucination in the final response came from bad retrieval results three steps earlier.
Where propagation breaks down: HTTP calls propagate trace context automatically through OpenTelemetry's HTTP instrumentation. Model Context Protocol (MCP) servers and custom RPC mechanisms often don't — you have to manually extract the traceparent header on the server side and create child spans explicitly. This is a common gap in agent instrumentation.
Session continuity: For multi-turn conversations, trace context needs to span requests. This usually means propagating a session_id or conversation_id as a baggage item alongside traceparent. Tools like Langfuse model this explicitly with a Session → Trace → Span hierarchy that lets you analyze conversation-level failure rates, not just individual request failure rates.
Metrics That Actually Matter
Three tiers of metrics worth distinguishing by how you respond to them.
Tier 1 — Operational (alert on these):
- Token throughput (input + output tokens/second, per model)
- Error rate segmented by type: 4xx (prompt policy violations), 5xx (provider errors), timeouts
- p95/p99 latency — broken down by TTFT (Time to First Token) and TPOT (Time Per Output Token) for streaming interfaces
- Hourly spend rate and cost per user
Tier 2 — Quality (monitor for trends):
- Cache hit rate:
cache_read_input_tokens / input_tokens— a sudden drop signals a prompt change invalidating your cache prefixes - Tool call success rate and retry rate in agentic workflows
- Hallucination rate from your evaluation pipeline
- Session-level task completion rate (requires evaluation, not just tracing)
Tier 3 — Optimization signals:
- Output-to-input token ratio — unusually high values indicate prompts generating verbose responses that could be constrained
- Model routing distribution — what percentage of traffic is going to your most expensive model tier
- Retrieval relevance scores in RAG pipelines
TTFT and TPOT deserve special attention for streaming applications. TTFT is dominated by prefill time — how long it takes to process the input prompt — and is what users experience as perceived responsiveness. TPOT is the decode phase: how fast tokens arrive once generation starts. These have different optimization levers (shorter prompts improve TTFT; KV cache hits improve both), and conflating them into a single "LLM latency" metric hides the distinction.
Cost Attribution at the Feature Level
Token costs that are invisible until the monthly bill arrives are a predictable consequence of not wiring cost attribution into your traces from day one. The pattern that scales:
-
Tag traces at the top level. Attach
user_id,feature_name,environment, andexperiment_idto every trace at creation. OpenTelemetry's baggage propagation carries these through the full call chain automatically, so every child span inherits the metadata. -
Calculate cost at ingestion, not query time. Cost isn't in the OTel spec because pricing changes. The standard approach: compute
cost = (input_tokens × price_per_million) + (output_tokens × price_per_million)at the time you ingest the span, and store it as a custom attribute. Use separate rates for cache hits vs. standard input. -
Alert on rate, not total. Daily spend alerts are too slow. Alert on cost per hour rolling over a threshold, segmented by feature. The failure mode you're guarding against is a single feature's traffic spike consuming the budget for the entire product.
Proxy-based tools like Helicone handle this by routing all LLM API calls through a logging proxy — zero code change required, immediate visibility into per-user and per-feature costs. SDK-based tools like Langfuse require instrumentation but give you more flexibility in how you structure the cost hierarchy.
Debugging When You Can't Reproduce
The standard playbook for debugging a bug: reproduce it locally, add logging, step through. This breaks entirely for non-deterministic LLM behavior. The equivalent for LLM systems is building enough signal into every trace that you can reconstruct what the model experienced, even if you can't replay it exactly.
The minimum viable reproducibility record for each inference span:
- Full prompt (system message + user messages)
- Model version (not just model name —
gpt-4o-2024-11-20vsgpt-4o-2024-08-06is a meaningful distinction) - Temperature,
top_p, andseedif set - The
gen_ai.response.idreturned by the provider
These six fields let you reconstruct the exact distribution the model sampled from. You won't get identical outputs, but you'll get outputs from the same distribution — and often that's enough to reproduce a class of failures.
For agent path debugging, log tool selection decisions as span events with the tool arguments before execution. When you compare a "bad" trace against a "good" trace from the same input, the first divergence in tool selection is usually where the root cause lives.
Tail-based sampling is critical here. Don't apply a uniform 10% sampling rate — you'll lose the exact traces you most need to investigate. Keep 100% of traces that have errors, high cost, long duration, or low evaluation scores. Sample aggressively on happy-path, low-stakes traffic.
Choosing Your Tooling
The landscape has organized into a few distinct categories with different tradeoffs:
Proxy-based (zero code change): Helicone routes all LLM calls through a logging layer. Drop-in, immediate visibility, best for teams that want baseline cost and latency monitoring with minimal setup.
SDK-based with evaluation: Langfuse, LangSmith, BrainTrust, and Arize Phoenix all require SDK instrumentation but give you richer evaluation workflows, prompt versioning, and quality dashboards alongside tracing. Langfuse and Phoenix are self-hostable for teams with data residency requirements.
Infrastructure-native: If you're already running Datadog or Honeycomb, their LLM observability extensions are worth considering. Datadog has native hallucination detection that flags contradictions and unsupported claims per span. Honeycomb's high-cardinality querying makes ad-hoc trace analysis particularly powerful.
OTel-based auto-instrumentation: OpenLIT and Traceloop's OpenLLMetry give you OpenTelemetry-native instrumentation across popular LLM frameworks with minimal code changes, then let you route the telemetry to whatever backend you already use.
The decision usually comes down to whether you want a purpose-built LLM observability platform (Langfuse, Arize, LangSmith) or whether you want to extend your existing observability stack with LLM-specific conventions (Datadog, Honeycomb + OTel).
Observability as a Feedback Loop
The teams getting the most value from LLM observability aren't using it purely as an ops tool. They're using trace data to inform prompt engineering decisions, model routing rules, and caching strategies.
When you can see that a specific prompt template has a cache hit rate of 12% while a similar one is at 64%, you know where to invest in prompt restructuring. When you can see that 40% of your p99 latency comes from one retrieval step, you know where to optimize your vector index. When your evaluation pipeline flags a cluster of hallucinations all originating from the same data source, you know which knowledge base needs updating.
The infrastructure is largely solved. The open problems are at the session level — aggregating quality signals across multi-turn conversations rather than individual spans — and at the cost attribution level, where most teams are still tracking per-user costs rather than per-feature costs. Those are the areas where the next generation of tooling is focusing, and where the investment in good instrumentation pays off most directly.
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://opentelemetry.io/blog/2024/llm-observability/
- https://arize.com/blog-course/llm-observability/
- https://helicone.ai/blog/llm-observability
- https://langfuse.com/blog/2024-10-opentelemetry-for-llm-observability
- https://langfuse.com/docs/observability/features/token-and-cost-tracking
- https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse
- https://www.traceloop.com/blog/visualizing-llm-performance-with-opentelemetry-tools-for-tracing-cost-and-latency
- https://developers.redhat.com/articles/2026/04/06/distributed-tracing-agentic-workflows-opentelemetry
- https://www.datadoghq.com/blog/llm-observability-hallucination-detection/
- https://agenta.ai/blog/the-ai-engineer-s-guide-to-llm-observability-with-opentelemetry
- https://www.getmaxim.ai/articles/llm-observability-best-practices-for-2025/
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://www.braintrust.dev/articles/top-10-llm-observability-tools-2025
