Skip to main content

Why Your Existing Observability Stack Won't Save You When AI Agents Break

· 11 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows zero errors. Latency is nominal. All services return HTTP 200. Meanwhile, your AI agent just booked a meeting in the wrong timezone, hallucinated a customer's order history, and burned $4 in tokens doing it.

This is what makes agent observability genuinely hard: the metrics you already have tell you almost nothing about whether agents are actually working.

Traditional distributed tracing was built on a set of assumptions about how software fails. LLM agents violate all of them, and the gap between "my infrastructure is healthy" and "my agent did the right thing" is where most debugging pain lives.

Why Existing Tracing Breaks Down for Agents

Distributed tracing for microservices assumes deterministic, stateless request paths. A request enters at the edge, follows a predictable call graph, and exits with a known set of outcomes. You can set baseline latency thresholds, define error codes, and alert on deviations because the paths are enumerable.

Agents break every one of these assumptions.

The same user input can trigger different tool call sequences, different retrieval results, and different reasoning chains on every invocation. There is no canonical path to diff against. An agent retrieving context from memory might take three steps on Monday and seven on Friday — both correct, both at HTTP 200.

The second break is that agents have two distinct failure modes where traditional systems have one. Infrastructure failures — timeouts, API errors, connection resets — behave like normal distributed system failures and your existing tooling handles them fine. But agents also have cognitive failures: hallucinations, bad tool selection, misinterpreted retrieval results, faulty multi-step planning. These never surface as errors. The agent confidently returns a response with a 2xx status code, your dashboards stay green, and the output is wrong.

Infrastructure metrics lie by omission. An agent stuck in a reasoning loop, retrying the same tool call with slightly different arguments, burning through your token budget while making no progress, looks indistinguishable from a healthy agent on a latency chart.

In multi-agent systems, things compound. Failures manifest at handoff boundaries — when partial context, stale memory, or ambiguous intermediate results pass between agents. No single service owns the failure. It's an emergent property of the interaction.

The core reframe: agent tracing is about debugging decisions, not request routing. The observable artifact is the sequence of choices the agent makes, not the shape of its call graph.

What an Agent Trace Actually Looks Like

The useful primitive is the execution tree: a hierarchical structure where each node is a named, timed operation. In an agent context, that means LLM API calls, tool invocations, vector DB queries, memory reads, and agent-to-agent handoffs — all with parent-child relationships that reflect how each step caused the next.

A canonical trace for a research agent might look like: the root span covers the full task (say, 1.8 seconds). Under it, an embedding call followed by a vector search retrieves context. Then a first LLM call decides to use two tools — a web search and a calculator — each with their own child spans. A second LLM call synthesizes everything into a final answer.

This structure tells you something no latency histogram can: which step added latency, whether the agent used the tools it should have, and how long it spent reasoning vs. fetching. When the trace is wrong — say, the retrieval span returned a document from six months ago — you can see exactly where reality diverged from intent.

The OTel community has converged on a gen_ai.* attribute namespace for LLM spans (still experimental as of early 2026, with possible schema changes ahead). The key attributes are the provider and model, the operation type, token counts broken out as input and output, and finish reason. Span events — user messages, assistant responses, tool results — are attached to spans rather than stored as attributes, since content can be large and you often want to control whether it's exported at all. The OTel recommendation is to capture message content only in controlled environments, with an explicit opt-in flag.

The hierarchy gets more interesting in multi-agent systems. A single user request might produce a trace spanning a request manager service, an agent orchestrator, an LLM inference router, and one or more MCP tool servers. Each hop is its own set of spans. The key engineering challenge is context propagation: every outbound call needs to carry the parent trace context (via the W3C traceparent header) so spans across services can be assembled into a single tree. Major LLM frameworks like LangGraph, LlamaIndex, and the OpenAI Agents SDK now emit OTel-compatible traces by default, but MCP servers and custom tool implementations typically require manual instrumentation.

What to Actually Measure

Given that you're debugging decisions rather than request routing, the signal categories that matter are different from traditional services.

Task completion quality is the metric that matters most and is hardest to collect. Not "did the HTTP request succeed" but "did the agent accomplish the user's intent." This requires either human review, LLM-as-judge scoring, or a task-specific success criterion baked into your evaluation pipeline. Tool call success rate is a proxy, but it's noisy — a tool can return data that the agent proceeds to misinterpret.

Latency, normalized by output tokens, is the right way to think about speed. A ten-second response is fine if the model produced 2,000 output tokens; it's a signal of something wrong if it produced 50. Static latency thresholds are misleading for LLM workloads. Time-to-first-token is a separate and often more important metric for streaming applications, since it determines perceived responsiveness.

Token costs per successful task is more meaningful than cost per request. A task that takes three retries due to bad tool routing costs three times as much as it should but may still show up as one billable outcome. If you're only tracking cost at the request level, you're missing the amplification effect of agent loops.

Step count per task is a useful anomaly signal. If your agent normally completes a task in four steps and suddenly averages eight, something regressed — in the instructions, the tools, the retrieval, or the model. Step count distributions give you a canary that pure latency and error metrics miss.

Memory and retrieval quality often goes unmeasured but is frequently where failures originate. Whether retrieved context actually appears in the agent's reasoning, whether memory reads are returning stale content, and whether the agent's final answer is grounded in what was retrieved — these are distinct failure modes that benefit from specific instrumentation.

The Nine Ways Agents Actually Fail

Microsoft Research recently published a systematic failure taxonomy from studying 115 annotated failed agent trajectories across three benchmarks. The categories are worth internalizing because they map to different debugging strategies:

  1. Plan adherence failure — the agent deviated from a plan that was actually correct
  2. Invented information — hallucinated facts, tool outputs, or context not present in the trace
  3. Invalid invocation — tool called with wrong parameters or in the wrong state
  4. Misinterpreted tool output — tool returned correct data; agent read it wrong
  5. Intent-plan misalignment — the plan was syntactically valid but didn't accomplish the user's goal
  6. Under-specified intent — ambiguous input caused compounding errors downstream
  7. Intent not supported — agent attempted tasks outside its capability set
  8. Guardrail trigger — policy enforcement interrupted execution (may be correct or a false positive)
  9. System failure — infrastructure failure during execution

Categories 2 through 5 are invisible to infrastructure monitoring. They require looking at what the agent reasoned, not what the infrastructure did. Microsoft's AgentRX framework, which addresses automated failure localization for these cases, showed a 23.6% improvement in failure localization over baseline prompting approaches.

Understanding which category a failure falls into also tells you where to focus: categories 1 and 4 often point to prompt engineering or tool description problems; category 2 points to insufficient grounding or retrieval quality; category 3 points to tool schema or capability boundary issues; category 8 may indicate guardrails that are too aggressive or need refinement.

Picking Tooling That Fits Your Context

The LLM observability market has expanded rapidly and the trade-offs are genuinely different depending on your situation.

For teams that want full data ownership and no vendor lock-in, Langfuse (MIT-licensed, self-hostable) and Arize Phoenix (open-source, runs locally) are the natural options. Phoenix is particularly useful for development — it runs in a Jupyter notebook with no external dependencies and is oriented around RAG and retrieval debugging. Langfuse has a stronger production feature set for multi-turn session tracking.

For teams already using LangChain or LangGraph, LangSmith is the path of least resistance with the best native integration, though it carries LangChain ecosystem lock-in and its seat-based pricing can become expensive at scale.

Braintrust stands out for production tracing of complex agents — it has timeline replay (a visual waterfall of when each span started and ended, across multiple steps) and an evaluation-to-trace loop that lets you convert production failures into eval datasets. This is the key workflow for making non-deterministic failures reproducible.

If you're already operating at scale on Datadog or a similar APM platform, integrating LLM spans into your existing infrastructure via OTel is often the right move. The gen_ai.* metrics emit directly to standard counters and histograms that Prometheus/Grafana or Datadog can ingest. This won't give you LLM-specific features like prompt/completion replay, but it gives you cost and latency data in the same place as everything else.

The one feature worth prioritizing regardless of platform: timeline replay. Being able to see the full execution tree with timing, not just logs or summaries, is what separates genuinely useful debugging from guesswork on multi-step agents.

Debugging a Production Failure: The Workflow

Consider a concrete scenario: your agent booked a meeting at the wrong time. Latency was normal, no errors, task marked complete.

The debugging workflow starts by pulling the full trace for that session — not just the final turn, but the complete execution tree across all steps. The final turn may look correct: the calendar tool was called with a plausible time slot, the tool returned a success response. But the trace from three steps earlier shows that the memory retrieval span returned a calendar entry with a stale timezone preference from a prior session. The agent read the current request correctly, but the context it was reasoning against was wrong.

Without the connected execution tree, you're hypothesizing at every layer. With it, you can see exactly which span introduced the bad data, what the agent inferred from it, and at what point the decision compounded into the wrong action.

The second step for non-deterministic failures is converting the production trace into a reproducible test case. Take the complete trace — all inputs, context states, intermediate results — and replay it in staging with the same inputs. This is how you make "it failed in production but I can't reproduce it locally" a solvable problem rather than a chronic debugging tax.

The third step is sampling strategy. Capturing every trace in production is expensive and often unnecessary. The practical approach is low-rate sampling (5-10%) for successful traces, 100% for errors, and 100% for high-token-count requests that might indicate runaway loops. Token rate exceeding 2x baseline is a useful alert threshold for detecting agent loops before they run to completion.

What This Means for Production Agent Systems

The engineers building agent systems right now are discovering that standard DevOps practices need extending, not replacing. The infrastructure layer still matters — you need to know when your LLM provider is down, when tools are returning errors, when database queries are slow. But that layer is now necessary and insufficient.

The additional layer is decision observability: understanding what your agent chose to do, why it made that choice, what context it was reasoning against, and whether the outcome matched the intent. This requires different data (execution trees vs. latency histograms), different tooling (trace replay vs. dashboards), and a different debugging posture (examining reasoning chains, not call graphs).

The teams making this work in production have typically converged on a few practices: mandatory trace IDs on every agent run, explicit context propagation across agent boundaries, step count monitoring as a regression canary, and a workflow for converting production traces into eval datasets. None of these are technically complex. The hard part is treating them as engineering requirements from the start rather than retrofitting them after the first incident that takes three days to debug.


Observability for agent systems isn't yet a solved problem — the OTel gen_ai.* conventions are still experimental, the tooling landscape is immature, and the failure taxonomies are still being developed. But the shape of what needs to be instrumented is clear. The teams that invest in execution-level tracing now will have a significant debugging advantage over those who are still reading latency dashboards when their agent quietly makes the wrong decision.

References:Let's stay in touch and Follow me for more thoughts and updates