LLM Observability in Production: The Four Silent Failures Engineers Miss
Most teams shipping LLM applications to production have a logging setup they mistake for observability. They store prompts and responses in a database, track token counts in a spreadsheet, and set up latency alerts in Datadog. Then a user reports the chatbot gave wrong answers for two days, and nobody can tell you why — because none of the data collected tells you whether the model was actually right.
Traditional monitoring answers "is the system up and how fast is it?" LLM observability answers a harder question: "is the system doing what it's supposed to do, and when did it stop?" That distinction matters enormously when your system's behavior is probabilistic, context-dependent, and often wrong in ways that don't trigger any alert.
The Four Silent Failures
Before building observability infrastructure, it helps to know what you're actually trying to catch. LLM systems fail in four ways that standard monitoring misses entirely:
Confident errors. The model returns a wrong answer with no indication of uncertainty. There's no exception, no 4xx status code, no elevated error rate in your metrics. The response looks identical to a correct one. A customer service bot cites a return policy that hasn't existed for six months and sounds completely authoritative while doing it. Without evaluation running on production traffic, this never appears in any dashboard.
Silent drift. Performance degrades gradually as the world changes around your system. The model's training data gets stale. Your product descriptions update but the context in your RAG pipeline doesn't. A prompt that worked well six months ago starts producing off-target responses because the context it was written for has shifted. You notice when users complain, not before.
Unbounded costs. Token counts compound in ways that aren't obvious at design time. Retry logic on failures doubles or triples your spend without any functional improvement. Context windows fill up as conversation history grows. A single poorly-scoped agent loop makes 40 tool calls instead of 4. None of this shows up as an error — just an invoice.
Opaque reasoning. When an agent takes the wrong action or a chain produces a bad output, you need to trace exactly which step introduced the error. With standard logging, you have inputs and outputs but not the intermediate state: which documents were retrieved, what the reranker scored them, whether the tool call returned what was expected, how the model interpreted the result. Debugging is archaeology.
What Observability Actually Requires
The five pillars of LLM observability reflect the ways systems fail above:
Reliability covers what traditional monitoring handles well — latency percentiles, provider error rates, rate limit recovery, availability. This is table stakes.
Quality is where most teams have a gap. You need factual accuracy and grounding success rates measured continuously on production traffic, not just during offline evaluation. The only way to know your model is still correct is to evaluate it.
Safety means tracking jailbreak attempts, PII leaks in responses, and toxic content — particularly important if your system handles sensitive domains or serves a broad user base.
Cost requires per-request accounting, not just monthly totals. You need to know which request types are expensive, where retry storms are occurring, and whether caching is working.
Governance is the audit trail: complete traceability of every decision the system made and why, reproducible enough to answer questions from legal, compliance, or an angry customer.
Building all five simultaneously isn't realistic. The practical implementation roadmap runs in phases: start with basic logging and trace IDs, standardize telemetry and build dashboards, add guardrails and evaluation, then layer in multi-model routing and agent tracing, and finally automate governance.
Distributed Tracing as the Foundation
For any system more complex than a single LLM call, distributed tracing is the structural foundation. OpenTelemetry has become the standard instrumentation layer — it separates data collection from data storage, preventing vendor lock-in while letting you route spans to whatever backend makes sense.
The mental model: a trace represents the complete lifecycle of a request. Spans represent individual operations within that request. A user question to a RAG application might generate a trace with spans for query embedding, document retrieval, reranking, prompt assembly, LLM call, and response parsing. Each span carries its own timing, inputs, outputs, and attributes.
The minimum viable span for an LLM call should capture:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("llm_call") as span:
span.set_attribute("llm.model", "claude-3-5-sonnet")
span.set_attribute("llm.prompt_tokens", prompt_token_count)
span.set_attribute("llm.temperature", 0.7)
# Capture prompt/response as events, not attributes
# — large payloads can break span exporters if stored as attributes
span.add_event("prompt", {"content": prompt_text})
response = call_llm(prompt_text)
span.add_event("response", {"content": response.text})
span.set_attribute("llm.completion_tokens", response.usage.completion_tokens)
span.set_attribute("llm.latency_ms", response.latency_ms)
One important implementation detail: prompts and responses should be captured as events on a span rather than span attributes. Span attributes in most backends have size constraints; a 4K-token prompt stored as an attribute will silently truncate or crash your exporter. Events handle large payloads correctly.
For frameworks like LangChain or LlamaIndex, auto-instrumentation libraries handle this automatically:
from openinference.instrumentation.langchain import LangChainInstrumentor
LangChainInstrumentor().instrument()
# All LangChain calls now emit spans automatically
Auto-instrumentation covers the common cases but tends to miss application-level context — user IDs, session IDs, feature flags, A/B test variants. These need to be injected manually through baggage or span attributes. Without them, traces become hard to correlate with actual user experiences.
The RAG Debugging Workflow
RAG systems fail in at least five distinct places: the query embedding, retrieval (wrong documents returned), reranking (right documents scored poorly), context preparation (truncation, ordering), and the LLM's interpretation of retrieved context. Without traces, a bad answer is just a bad answer. With traces, you can see exactly where the failure originated.
The debugging sequence when a RAG query returns a bad answer:
- Pull the trace for the specific request
- Check the retrieval span — what documents were returned? What were their similarity scores? Were the right documents in the index at query time?
- If retrieval was correct, check context assembly — was the relevant passage truncated? Was it in a position the model tends to ignore?
- If context was intact, check the LLM span — did the model use the retrieved context or ignore it and hallucinate? Grounding evaluators that run inline can flag this automatically
This workflow only works if your spans capture intermediate state — the retrieved document IDs and scores, the assembled context string, the final prompt before it hit the model. Logging only inputs and outputs of the overall pipeline makes step 2 and 3 impossible.
Agent Observability Is Different
Agents introduce observability challenges that single-turn systems don't have. A multi-step agent isn't a request/response — it's a process with state, branching, and loops. Standard trace hierarchies don't map cleanly onto this.
The metrics that matter for agents differ from single-turn metrics:
- Planning efficiency: How many reasoning steps does the agent take to complete a goal? Is this number growing over time as prompts drift?
- Tool execution quality: Which tools are being called? Are retries increasing? Are tools returning errors that the agent silently ignores?
- Goal completion rate: Does the agent actually finish the task? At what rate does it give up, loop, or produce a non-answer?
- Cost per completed goal: The real unit economics question. Not cost per request, but cost per successfully completed task.
The hardest part of agent observability is that the same trace can look very different across runs for identical inputs, because LLM outputs are non-deterministic. Comparing two agent traces requires normalized metrics, not raw span comparisons.
Tool call logging is particularly important and frequently omitted. When an agent calls a function, you want the function name, arguments (redacted for PII), return value, latency, and whether a retry occurred. A common failure mode: an agent makes a tool call, gets an error, retries four times, exhausts its context window, and returns a vague failure response. Without tool call spans, the trace just shows "agent failed."
Evaluation Must Run on Production Traffic
Offline evaluation — running your test suite against a fixed set of examples before deployment — tells you whether your model meets a bar at a point in time. It does not tell you whether it continues to meet that bar after deployment, as production data distribution shifts, as you update prompts, as the underlying model gets updated.
Production evaluation closes that gap. Running a subset of real requests through an evaluator (LLM-as-judge, embedding similarity, rule-based checks) continuously gives you a quality signal that degrades when something breaks. Platforms like Langfuse, Braintrust, and Arize all support attaching evaluators to production traces.
The practical constraint is cost. Running an LLM-as-judge evaluator on 100% of production traffic doubles your inference spend. The standard approach is sampling — evaluate 5-10% of requests, stratified by request type, user cohort, or recent prompt changes. Evaluate 100% when a deployment changes the system prompt or model version.
The metric to watch isn't a single quality score — it's the trend. A quality score that's stable at 0.78 is better than one that was 0.85 last week and is now 0.72. Alerts should trigger on slope, not just threshold.
Where to Start
The common mistake is trying to instrument everything at once before any of it is useful. Start with trace IDs threaded through every request so you can pull the complete context for any user-reported issue. That alone is more valuable than most dashboards.
Then add cost tracking at the request level. You'll immediately find which request types are expensive and whether any are unexpectedly so.
Then add one quality evaluator on production traffic — even a simple rule-based one. Does the response contain citations when it should? Is it within the expected length range? Did it refuse a request it shouldn't have?
Everything else — hallucination detection, agent tracing, governance automation — layers on top of that foundation. Teams that try to build the full five-pillar system before shipping anything tend to ship nothing.
The minimum viable LLM observability stack is: trace IDs, token cost per request, and one quality check on production traffic. Everything else is optimization.
The reason most LLM applications fail quietly is that their builders applied traditional software observability intuitions to a system that fails in non-traditional ways. An LLM doesn't throw exceptions when it's wrong. It doesn't return a 500 when it drifts. The failure mode is a response that looks correct but isn't — and the only way to catch it is to build evaluation into your production loop from the start.
