Skip to main content

2 posts tagged with "agent-observability"

View all tags

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness

· 10 min read
Tian Pan
Software Engineer

A user reports that the agent gave a wrong answer. You open the trace. The model's reasoning looks fine. The tool calls all returned 200 OK. The harness logs show no retries, no truncation, no anomalies. And yet the answer is wrong. So you spend the next two hours stitching together three separate log streams in three different formats with three different clocks, and you eventually discover that a tool quietly returned {"result": null} for one specific query shape, the model rationalized the null into a plausible-sounding fact, and the harness happily forwarded the hallucination to the user. None of the three layers logged anything alarming on its own. The failure lived in the joints.

This is the dominant failure pattern in production agent systems, and most teams are debugging it with single-layer tools. The model team blames the tool. The tool team blames the model. The platform team blames the harness. Everyone is partially right, because an agent failure is almost never a single-component bug — it is a misalignment between three components that each operate on different mental models of "a step." Until your tracing infrastructure reflects that reality, you will keep paying for the same incident in different costumes.

Why Your Existing Observability Stack Won't Save You When AI Agents Break

· 11 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows zero errors. Latency is nominal. All services return HTTP 200. Meanwhile, your AI agent just booked a meeting in the wrong timezone, hallucinated a customer's order history, and burned $4 in tokens doing it.

This is what makes agent observability genuinely hard: the metrics you already have tell you almost nothing about whether agents are actually working.

Traditional distributed tracing was built on a set of assumptions about how software fails. LLM agents violate all of them, and the gap between "my infrastructure is healthy" and "my agent did the right thing" is where most debugging pain lives.