Skip to main content

Distributed Tracing for Agent Pipelines: Why Your APM Tool Is Flying Blind

· 9 min read
Tian Pan
Software Engineer

Your Datadog dashboard is green. Your Jaeger traces look clean. Your P99 latency is within SLA. And your agent pipeline is silently burning $4,000 a day on retry loops that never surface an error.

Traditional APM tools were designed for microservices — deterministic paths, bounded payloads, predictable fan-out. Agent pipelines break every one of those assumptions. The execution path isn't known until runtime. Tool call depth varies wildly. A single "request" might spawn dozens of LLM calls across minutes. And when something goes wrong, the failure mode is usually not an exception — it's a silent retry cascade that inflates cost and latency while returning plausible-looking output.

The result is a generation of engineering teams flying blind, trusting dashboards that measure the wrong things.

Why Standard Trace/Span Models Break

Classical distributed tracing works because services have stable contracts. A POST to /api/orders always hits the same three services in the same order. You pre-define the schema, set up alerts on deviations, and call it a day.

Agent pipelines don't work this way. A single user query triggers a plan-act-observe loop where the number of iterations, the set of tools invoked, and the branching decisions are all determined at runtime by the model itself. You cannot instrument a static call graph because there is no static call graph.

Several specific properties of agent workloads snap the assumptions that spans are built on:

Non-deterministic control flow. The same input produces different execution traces on different runs. A tool call that succeeds on Tuesday may time out on Wednesday, triggering three retries and a fallback branch that never existed in your pre-production tests.

Unbounded payload size. In multi-turn workflows, context accumulates. A 10-step agent with a 4,000-token system prompt and 500-token average tool outputs carries over 40,000 input tokens into the final LLM call — just from context carryover, before generating any output. Traditional spans assume bounded payloads.

Nested retries with no error propagation. This is the most dangerous failure mode. A tool returns a transient error. The agent retries identically. Same error, same retry, three more times. No exception reaches the caller. The user sees a 45-second delay instead of 15 seconds, and the LLM bill increases threefold — but your error rate dashboard stays at zero.

Loop detection blind spots. When an agent enters a loop — calling the same tool repeatedly with the same arguments — standard traces show it as unexplained latency growth. There's no built-in concept of "this span is the Nth retry of the same operation" in classical tracing.

What Purpose-Built Agent Observability Actually Needs

The missing primitives fall into four categories.

Token budget tracking as a first-class metric. Token counts are to agents what memory usage is to traditional services — a finite resource whose exhaustion causes silent degradation. The right observability stack tracks input tokens, output tokens, cached tokens, and cost per trace in real time. More importantly, it tracks cost per successful outcome, not cost per request. A pipeline that retries three times before succeeding has a very different cost-per-success profile than one that gets it right on the first attempt, even if both show up as a single "request" in a naive dashboard.

Tool call latency as a typed span. Every tool invocation should be captured as a structured span with: input arguments, output data (or error), timing, and a retry count. This lets you distinguish between "the LLM is slow" and "the external API called by the tool is slow" — a distinction that matters enormously when you're trying to optimize. It also lets you correlate tool error rates with downstream cost spikes.

Model version as a high-cardinality dimension. Most production agent pipelines run multiple models simultaneously — a cheap model for routing and classification, a more capable one for reasoning, possibly a third for synthesis. Every latency metric, error rate, and cost calculation needs to be sliced by model version. When a model upgrade ships and P99 latency doubles, you need to know immediately whether that degradation is in the routing layer or the reasoning layer.

Retry count and loop counter as explicit fields. These should not be inferred from duplicate spans. They should be surfaced as first-class numeric fields in every trace, so you can query directly: "Show me all traces where retry_count > 2." That query is the single most useful thing you can run when investigating a cost spike.

The Three Metrics That Tell You an Agent Is Degrading

Teams that run agents in production long enough converge on the same three early-warning signals. All three become visible before users notice anything.

Token inflation rate. Track the median and P95 input token count per trace over time. Token inflation is almost always the first observable symptom of prompt drift, context accumulation bugs, or unintended retry loops. If your median input tokens grow 20% week-over-week without a corresponding feature change, something is broken. This is measurable days before quality degradation reaches users.

Tool error rate spike, not overall error rate. Agent pipelines absorb tool errors internally — that's the point of the retry logic. But the tool error rate (errors before retries, not after) is a leading indicator of downstream API instability. When a dependency starts flaking, tool error rate spikes immediately. Cost and latency follow. User-visible failures arrive last. Monitoring tool error rate pre-retry gives you a window to intervene.

Cost-per-success divergence from cost-per-request. If cost-per-request is flat but cost-per-success is rising, your retry and fallback logic is burning tokens on failures that are quietly resolved before surfacing. This gap is the clearest possible signal of a silent retry cascade. A 10% increase in cost-per-success often corresponds to a retry rate that has quietly tripled.

The Tooling Landscape

Purpose-built observability platforms have emerged to address exactly these gaps. The most widely deployed as of 2026:

LangSmith integrates tightly with LangChain and LangGraph, providing automatic trace capture for LangChain-based workflows. The integration is nearly zero-config if you're already on that stack, but it's proprietary and harder to use outside the LangChain ecosystem.

Langfuse is open-source and SDK-based, with strong support for complex workflows. It's the default choice for teams that want self-hosting and aren't committed to a specific framework.

Arize Phoenix balances monitoring and evaluation with an OpenTelemetry-native foundation. Better for teams with data science involvement who want to run evals alongside production monitoring.

OpenLLMetry (by Traceloop) takes a different approach: it's an open-source instrumentation SDK that emits standard OpenTelemetry spans, letting you route to any OTel-compatible backend. It instruments OpenAI, Anthropic, LangChain, and major vector databases automatically.

Weights & Biases Weave records structured execution traces that preserve parent-child relationships across agent runs, with native support for OTel export.

The choice between them matters less than the decision to instrument early. Every team that waits until post-launch to add agent observability reports the same experience: they're debugging blind during their most critical growth period.

The OpenTelemetry GenAI Standard: What's Here and What's Missing

OpenTelemetry's GenAI semantic conventions provide a vendor-neutral foundation for agent tracing. The standardized attributes cover model name and version, token counts (input/output/cached), tool call inputs and outputs, finish reason, and provider metadata. Datadog, Splunk, and other major APM vendors now natively ingest OTel GenAI spans, which means you can instrument once and export anywhere.

What's still experimental or absent: retry and loop patterns have no standard representation. Cost attribution across nested calls lacks a converged semantic. Multi-agent coordination — where one agent spawns another — has no standard span type for the handoff. Silent failure detection has no standardized approach at all.

This means that even if you adopt OTel GenAI conventions today, you'll need to add custom span attributes for the metrics that matter most: retry counts, loop iterations, per-trace cost, and cost-per-success. The standard gives you a foundation; the agent-specific instrumentation is still your responsibility.

What to Instrument First

If you're starting from zero observability on an agent pipeline, the priority order matters. Add these in sequence:

First, add correlation IDs to every message, tool call, and decision branch. Without correlation, debugging a multi-step failure across logs is nearly impossible. This is the one thing that costs almost nothing and pays off immediately.

Second, emit a structured span for every tool invocation with retry count as an explicit field. Not inferred — explicit. This single addition makes retry cascade detection possible.

Third, track token counts per LLM call and per trace. Calculate cost in real time. Set an alert if cost-per-trace exceeds a threshold for more than five consecutive minutes.

Fourth, track model version as a dimension on every metric. This becomes essential the first time you do a model upgrade and need to understand what changed.

Fifth, add loop detection. The simplest version: if the same tool is called with the same arguments more than N times in a single trace, emit a warning event and consider terminating the run. Silent loops are the most expensive failure mode and the easiest one to prevent.

The Operational Reality

Forty percent of agentic AI projects fail before production, and cost escalation is among the primary drivers. The teams that ship and sustain production agents share a common property: they treat observability as a feature, not an afterthought. They instrument before they scale. They alert on tool error rates, not just user-visible errors. They track cost-per-success alongside cost-per-request. And they treat token inflation as the canary in the coal mine it actually is.

Your APM tool was built for a world where execution paths are known at deploy time. Agent pipelines live in a different world, and the dashboards that served you well for microservices will not tell you when your agent is degrading. Build for that reality now, before a silent retry loop turns a Tuesday afternoon into an unexpected invoice.

References:Let's stay in touch and Follow me for more thoughts and updates