Distributed Tracing for Agent Pipelines: Why Your APM Tool Is Flying Blind

April 17, 2026 · 9 min read

Software Engineer

Your Datadog dashboard is green. Your Jaeger traces look clean. Your P99 latency is within SLA. And your agent pipeline is silently burning $4,000 a day on retry loops that never surface an error.

Traditional APM tools were designed for microservices — deterministic paths, bounded payloads, predictable fan-out. Agent pipelines break every one of those assumptions. The execution path isn't known until runtime. Tool call depth varies wildly. A single "request" might spawn dozens of LLM calls across minutes. And when something goes wrong, the failure mode is usually not an exception — it's a silent retry cascade that inflates cost and latency while returning plausible-looking output.

The result is a generation of engineering teams flying blind, trusting dashboards that measure the wrong things.

Why Standard Trace/Span Models Break

Classical distributed tracing works because services have stable contracts. A POST to /api/orders always hits the same three services in the same order. You pre-define the schema, set up alerts on deviations, and call it a day.

Agent pipelines don't work this way. A single user query triggers a plan-act-observe loop where the number of iterations, the set of tools invoked, and the branching decisions are all determined at runtime by the model itself. You cannot instrument a static call graph because there is no static call graph.

Several specific properties of agent workloads snap the assumptions that spans are built on:

Non-deterministic control flow. The same input produces different execution traces on different runs. A tool call that succeeds on Tuesday may time out on Wednesday, triggering three retries and a fallback branch that never existed in your pre-production tests.

Unbounded payload size. In multi-turn workflows, context accumulates. A 10-step agent with a 4,000-token system prompt and 500-token average tool outputs carries over 40,000 input tokens into the final LLM call — just from context carryover, before generating any output. Traditional spans assume bounded payloads.

Nested retries with no error propagation. This is the most dangerous failure mode. A tool returns a transient error. The agent retries identically. Same error, same retry, three more times. No exception reaches the caller. The user sees a 45-second delay instead of 15 seconds, and the LLM bill increases threefold — but your error rate dashboard stays at zero.

Loop detection blind spots. When an agent enters a loop — calling the same tool repeatedly with the same arguments — standard traces show it as unexplained latency growth. There's no built-in concept of "this span is the Nth retry of the same operation" in classical tracing.

What Purpose-Built Agent Observability Actually Needs

The missing primitives fall into four categories.

Token budget tracking as a first-class metric. Token counts are to agents what memory usage is to traditional services — a finite resource whose exhaustion causes silent degradation. The right observability stack tracks input tokens, output tokens, cached tokens, and cost per trace in real time. More importantly, it tracks cost per successful outcome, not cost per request. A pipeline that retries three times before succeeding has a very different cost-per-success profile than one that gets it right on the first attempt, even if both show up as a single "request" in a naive dashboard.

Tool call latency as a typed span. Every tool invocation should be captured as a structured span with: input arguments, output data (or error), timing, and a retry count. This lets you distinguish between "the LLM is slow" and "the external API called by the tool is slow" — a distinction that matters enormously when you're trying to optimize. It also lets you correlate tool error rates with downstream cost spikes.

Model version as a high-cardinality dimension. Most production agent pipelines run multiple models simultaneously — a cheap model for routing and classification, a more capable one for reasoning, possibly a third for synthesis. Every latency metric, error rate, and cost calculation needs to be sliced by model version. When a model upgrade ships and P99 latency doubles, you need to know immediately whether that degradation is in the routing layer or the reasoning layer.

Retry count and loop counter as explicit fields. These should not be inferred from duplicate spans. They should be surfaced as first-class numeric fields in every trace, so you can query directly: "Show me all traces where retry_count > 2." That query is the single most useful thing you can run when investigating a cost spike.

The Three Metrics That Tell You an Agent Is Degrading

Teams that run agents in production long enough converge on the same three early-warning signals. All three become visible before users notice anything.

Token inflation rate. Track the median and P95 input token count per trace over time. Token inflation is almost always the first observable symptom of prompt drift, context accumulation bugs, or unintended retry loops. If your median input tokens grow 20% week-over-week without a corresponding feature change, something is broken. This is measurable days before quality degradation reaches users.

Tool error rate spike, not overall error rate. Agent pipelines absorb tool errors internally — that's the point of the retry logic. But the tool error rate (errors before retries, not after) is a leading indicator of downstream API instability. When a dependency starts flaking, tool error rate spikes immediately. Cost and latency follow. User-visible failures arrive last. Monitoring tool error rate pre-retry gives you a window to intervene.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Distributed Tracing for Agent Pipelines: Why Your APM Tool Is Flying Blind

Why Standard Trace/Span Models Break

What Purpose-Built Agent Observability Actually Needs

The Three Metrics That Tell You an Agent Is Degrading

Recommended Reading

About Tian Pan

Why Standard Trace/Span Models Break​

What Purpose-Built Agent Observability Actually Needs​

The Three Metrics That Tell You an Agent Is Degrading​

Recommended Reading

About Tian Pan

Why Standard Trace/Span Models Break

What Purpose-Built Agent Observability Actually Needs

The Three Metrics That Tell You an Agent Is Degrading