Skip to main content

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness

· 10 min read
Tian Pan
Software Engineer

A user reports that the agent gave a wrong answer. You open the trace. The model's reasoning looks fine. The tool calls all returned 200 OK. The harness logs show no retries, no truncation, no anomalies. And yet the answer is wrong. So you spend the next two hours stitching together three separate log streams in three different formats with three different clocks, and you eventually discover that a tool quietly returned {"result": null} for one specific query shape, the model rationalized the null into a plausible-sounding fact, and the harness happily forwarded the hallucination to the user. None of the three layers logged anything alarming on its own. The failure lived in the joints.

This is the dominant failure pattern in production agent systems, and most teams are debugging it with single-layer tools. The model team blames the tool. The tool team blames the model. The platform team blames the harness. Everyone is partially right, because an agent failure is almost never a single-component bug — it is a misalignment between three components that each operate on different mental models of "a step." Until your tracing infrastructure reflects that reality, you will keep paying for the same incident in different costumes.

The single-layer mental model is the root cause of most "we can't reproduce this" agent bugs. A traditional stack trace works because every frame shares a clock, a process, and a function-call abstraction. An agent trace doesn't get that for free. The model lives in a remote API with its own latency, sampling state, and cache. The tools live in a heterogeneous fleet of services with their own retries and timeouts. The harness sits between them with its own state machine, context-window arithmetic, and policy logic. Joining these into a single coherent trace is a distributed-systems observability problem that most teams haven't recognized as one.

Three layers, three flavors of lying

Each layer fails in its own characteristic way, and each disguises its failure as somebody else's.

The model lies through hallucinated tool results — when the user's question maps cleanly to a tool call, the model often issues the call correctly and waits for the response, but when the question is ambiguous or the tool result is malformed, the model will invent a plausible-sounding answer and present it as if it had executed a tool. The visible artifact is a confident, well-formatted response. The trace shows no tool span at all, or a tool span whose result the model contradicts in its final answer. If you are not specifically looking for "the model claims X, but X never appears in any tool response we logged," you will not find this. It looks identical to a tool that returned bad data.

The tool lies through schema drift — the tool's contract says it returns {"items": [...]} but a recent deploy changed it to {"results": [...]}, and the model, which was prompted with the old schema, parses the new response as empty. The tool returns 200 OK and the agent's downstream behavior is "no items found, sorry." The tool team's dashboards are green. The model team sees a model that handled an empty response gracefully. The actual fault — a breaking schema change shipped without consumer coordination — is invisible from any single layer's point of view.

The harness lies through silent state mutations — context truncation that drops a critical instruction, retries that re-enter a prompt with subtly different state, sampling overrides applied per-tenant that nobody documented, stop-sequence collisions that cut a response mid-thought. The harness's logs say "request completed normally." The model's logs say "I generated tokens until I saw a stop sequence." Neither one volunteers that the stop sequence was inside a JSON value the user typed. You only find out when you manually replay the request and notice that the prompt the model actually received is shorter than the prompt your code thinks it sent.

What a unified trace actually needs

The fix is not "more logs." Most teams already produce too many. The fix is a single, time-ordered, structurally typed view that interleaves events from all three layers and treats them as nodes in one graph rather than rows in three tables.

The minimum viable shape of that view captures, for every agent run, an interleaved sequence of model turns, tool invocations, and harness events on a shared monotonic timeline. Each model turn carries the full prompt as the model actually saw it (including retrieved context, system prompt, conversation history, and any per-request modifications), the sampling state that produced the output (temperature, top-p, seed if available, model snapshot), and the raw output before any harness post-processing. Each tool invocation carries the exact arguments the model emitted, the exact response the tool returned, and the latency. Each harness event captures retries, context truncations, stop-sequence hits, sampling overrides, and any policy decisions that altered the request or response.

The OpenTelemetry GenAI semantic conventions, which stabilized over 2024–2025, give this view a vocabulary. Spans for invoke_agent, chat, execute_tool, and embeddings provide cross-vendor interoperability so a trace captured in one platform replays meaningfully in another. The conventions are still marked experimental as of early 2026, but adoption across LangSmith, Langfuse, Arize Phoenix, Datadog, Honeycomb, and Helicone has been fast enough that "OTel GenAI" is now the de facto baseline. If you are building tracing today and you are not emitting OTel GenAI spans, you are building yourself into a vendor lock-in that next year's debugger refactor will have to undo.

A cause-hypothesis panel beats a wall of logs

Even with a unified trace, the on-call engineer at 2 a.m. is still scanning a 4,000-token JSON blob trying to spot the anomaly. The next meaningful step is automated cause attribution: a panel that, given a flagged failure, proposes which layer is most likely responsible based on signature patterns.

Some patterns are mechanical and worth automating first. If the model's final answer references an entity that does not appear in any logged tool response, the hypothesis is "model hallucination." If a tool span returned 200 OK but with a payload whose shape diverges from the schema the model was given, the hypothesis is "tool contract drift." If the harness recorded a context truncation event whose dropped tokens contained content the model later asks about, the hypothesis is "context window underflow." If retries fired with different sampling states, the hypothesis is "non-deterministic harness behavior." None of these is conclusive on its own — but each cuts the engineer's search space from "the entire trace" to "this specific span and its neighbors."

The TraceCoder line of research from 2025 demonstrated that a multi-agent debugging loop — one agent injects diagnostic probes, one performs causal trace diagnosis, one validates fixes — can recover up to 34% relative improvement in repair quality on benchmark tasks, which suggests that "automated cause hypothesis generation" is not a nice-to-have but a leverage point. Even a heuristic-based hypothesis panel that doesn't use a model at all would beat the manual workflow most teams have today.

The reproducibility envelope is the artifact you actually need

The hardest agent bugs are the ones that don't reproduce. The user submits a prompt, the agent fails, the user resubmits the same prompt, and the agent succeeds. The team chalks it up to non-determinism and moves on. This is almost always wrong. Agents are non-deterministic in their output, but their behavior is determined by an envelope of state that is, in principle, fully capturable: model version and sampling config, the full retrieved context with retrieval scores, the tool catalog as it was registered at run time, the harness's runtime configuration including any per-tenant overrides, and any external state the tools read.

The discipline that separates "we caught the bug" from "we couldn't repro" is capturing this envelope on every run, not just on flagged failures. Logging the envelope only when something goes wrong is a recipe for missing the bugs whose anomaly signal is "this run that looked fine actually produced subtly worse output." Logging it on every run is expensive, but the cost is bounded — most envelope fields are small structured data, and you can hash retrieved-context blobs to deduplicate at the storage layer.

A reproducibility envelope also enables structural diffing across runs: when a regression slips in, you replay the failing input against the previous prompt version, the diff highlights the exact envelope element that changed (a single retrieved document that vanished from the index, a sampling parameter that drifted, a tool whose contract shifted), and the postmortem moves from speculation to forensics. Without the envelope, you are diffing 4,000-token text outputs and squinting.

Make the cross-layer mental model explicit, or pay for the blame Olympics

The architectural realization that closes the loop: an agent is a distributed system across three components — the model, the tools, the harness — and observability for distributed systems is a solved problem in principle, but the agent ecosystem is still acting like it isn't. The cost of pretending otherwise shows up in incident reviews where the model team and the tool team and the platform team each present logs from their layer, none of those logs contradict the others, and no one can explain what happened. That is the blame Olympics, and it is what your tracing tooling is implicitly designed to produce when it captures three layers separately and never joins them.

The teams that have moved past this share a few habits. They emit OTel GenAI spans from every layer so traces are joinable by default. They capture the reproducibility envelope on every run, not just failures. They run a structural-diff workflow against a known-good baseline whenever a regression is suspected, before anyone speculates about cause. They invest in a cause-hypothesis panel — even a crude one — so the on-call engineer is not parsing JSON by eye. And they treat their tracing infrastructure as a first-class product surface that improves quarter-over-quarter, not a checkbox marked at launch.

The forward-looking move for engineering leaders is to stop treating agent observability as a feature of one of the three layers. The model layer's tools won't catch tool-contract drift; the tool layer's tools won't catch hallucinated tool results; the harness layer's tools won't catch sampling-state regressions. The joint observability has to live somewhere, and if you don't decide where, the answer becomes "in the head of whichever engineer is on-call when the next ambiguous failure lands." That is not a strategy. That is a tax on your most senior engineers, paid in incident time, until you build the debugger your stack actually needs.

References:Let's stay in touch and Follow me for more thoughts and updates