Skip to main content

25 posts tagged with "debugging"

View all tags

The Agent Debugger Has No Breakpoints: Why Trace-First Workflows Replace Step-Through

· 10 min read
Tian Pan
Software Engineer

The first time you try to debug an agent the way you'd debug a service, you discover that the muscle memory has nothing to grip. You set a hypothetical breakpoint — there's no IDE pane to put it in, but you imagine one — at the step where the planner picked the wrong tool. You rerun with the same input. The planner picks the right tool this time. You rerun again. It picks a third tool you've never seen before. The bug is real, your colleague reproduced it twice this morning, and the debugger you've used for fifteen years is suddenly a museum piece.

The mental model that breaks here isn't "use a debugger." It's the much deeper assumption underneath: that a program, given the same inputs, produces the same execution. Every affordance in a modern debugger — breakpoints, step-over, watch expressions, conditional breaks, hot reload — is built on top of that determinism. You pause execution because pausing is meaningful. You step forward because the next step is knowable. You inspect a variable because its value is a fact, not a draw from a distribution.

The Rerun Antipattern: Why Rolling Again Doesn't Find Bugs

· 10 min read
Tian Pan
Software Engineer

The first thing most engineers do when an AI feature misbehaves is click "run" again. The model is stochastic, the thinking goes, so maybe this run was just unlucky. When the second attempt produces something that looks reasonable, the ticket gets closed. The team moves on. The actual bug — a stale tool response, a retrieval miss, a system-prompt conflict that fires only on inputs containing a specific token — sits in production, intact, waiting for the next user to trip it.

This is the rerun antipattern, and it is the most expensive debugging habit AI teams have inherited from the chatbot era. It feels rigorous because the model genuinely is non-deterministic. It looks like a variance probe. But almost no one writes down a hypothesis before they reroll, no one decides in advance how many runs would constitute evidence, and no one accounts for the tokens. What's happening is closer to slot-machine debugging: you pull the lever until the lights stop flashing red, and you walk away convinced the machine is fine.

Multi-Axis Agent Bisection: When the Regression Lives in the Interaction

· 11 min read
Tian Pan
Software Engineer

Quality regressed overnight. The on-call engineer pulls up the dashboard, traces a few bad sessions, and starts the obvious bisection: the model provider rotated to a new snapshot at 02:00 UTC, so revert to the pinned older alias. Eval suite still red. Roll back yesterday's prompt change. Still red. Pin the retrieval index back to last week's version. Still red. Each owning team rolls back their own axis in isolation and reports "not us." Three hours in, nobody owns the diagnosis because nobody owns the interaction surface where the regression actually lives — the new model interpreting the new tool description in a way the old model never would have.

This is the failure mode single-axis tooling can't solve. git bisect works because the search space is one-dimensional: a linear sequence of commits. An agent doesn't have one timeline. It has four or five timelines running in parallel — model snapshot, system prompt, tool catalog, retrieval index, sampling config — each with its own owner, its own deploy cadence, and its own "rollback" button that returns just its axis to a known state. The regression you're chasing is often a two-factor interaction, and bisecting along any single axis returns false negatives because the bug only fires on the cross-product cell where the new model meets the new tool description.

Agent State Diff: Why Eyeballing Two Traces Doesn't Scale

· 9 min read
Tian Pan
Software Engineer

A regression slips into production. The team picks the failing input, replays it against last week's prompt, and gets a different output. Now they have to figure out why — and the answer is buried in three megabytes of differing text, divergent tool-call sequences, and shuffled retrieved chunks that no human can productively diff. So they paste both transcripts into a side-by-side viewer, scroll for twenty minutes, conclude "the model just felt different today," and ship a hotfix that doesn't address the root cause because they never found it.

This is the agent state diff problem, and it is the first place where general-purpose engineering tooling stops working for agentic systems. A traditional regression bisect runs against deterministic code: the same input produces the same output, and git bisect walks history until you find the commit that broke it. Agent runs aren't deterministic, the inputs aren't a single string, and the "history" is a multi-axis envelope — model snapshot, sampling config, retrieved context, tool catalog, harness flags — any of which can independently change behavior.

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness

· 10 min read
Tian Pan
Software Engineer

A user reports that the agent gave a wrong answer. You open the trace. The model's reasoning looks fine. The tool calls all returned 200 OK. The harness logs show no retries, no truncation, no anomalies. And yet the answer is wrong. So you spend the next two hours stitching together three separate log streams in three different formats with three different clocks, and you eventually discover that a tool quietly returned {"result": null} for one specific query shape, the model rationalized the null into a plausible-sounding fact, and the harness happily forwarded the hallucination to the user. None of the three layers logged anything alarming on its own. The failure lived in the joints.

This is the dominant failure pattern in production agent systems, and most teams are debugging it with single-layer tools. The model team blames the tool. The tool team blames the model. The platform team blames the harness. Everyone is partially right, because an agent failure is almost never a single-component bug — it is a misalignment between three components that each operate on different mental models of "a step." Until your tracing infrastructure reflects that reality, you will keep paying for the same incident in different costumes.

LLM Self-Debugging: When the Explanation Is the Signal vs. When It's the Lie

· 8 min read
Tian Pan
Software Engineer

When your LLM agent fails, the most tempting thing in the world is to ask it why. It will answer fluently, specifically, and with what feels like self-awareness. It might say: "I misunderstood the user's intent and retrieved documents about X when I should have targeted Y." That sounds exactly like a root cause. You write it down, open the prompt editor, and spend forty minutes chasing the wrong problem.

This is the central trap of LLM self-debugging. The model's explanation and the model's actual failure mechanism are two different things. Sometimes they overlap. Often they don't. Knowing which situation you're in before you act on the explanation is the discipline that separates fast debugging from expensive detours.

AI-Native Logging: Capture Decisions, Not Just I/O

· 10 min read
Tian Pan
Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

The Agentic Debugger's Trap: When Your Agent Patches Faster Than You Can Diagnose

· 10 min read
Tian Pan
Software Engineer

A staff engineer I worked with last quarter caught a bug that had already been "fixed" three times in the previous six weeks. Three different engineers. Three different files. Three green CI runs. Three accepted agent-generated patches. Each patch made the failing test pass and the user-reported error disappear. Each one moved the bug somewhere else, where it waited until a different surface area triggered it again. The fourth time it surfaced, the data corruption it caused had been silently compounding for forty days.

The bug was a single off-by-one in a pagination cursor. The agent had been right that the symptom would go away. It had been wrong about why. And the engineers — competent, senior, well-intentioned — had each accepted a passing patch before they understood the failure mechanism.

This is the agentic debugger's trap: your agent can produce a fix faster than you can build the mental model needed to evaluate whether the fix is correct. Patch velocity outruns diagnosis. The bug count drops, the CI dashboard goes green, and you ship a codebase whose failure modes you no longer understand.

The 'Try a Bigger Model' Reflex Is a Refactor Smell

· 10 min read
Tian Pan
Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

· 11 min read
Tian Pan
Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

Pattern-Matching Failures: When Your LLM Solves the Wrong Problem Fluently

· 11 min read
Tian Pan
Software Engineer

A user pastes a long, complicated bug report into your AI assistant. It looks like a classic null-pointer question, with the same phrasing and code layout as thousands of Stack Overflow posts. The model responds confidently, cites the usual fix, and sounds authoritative. The user thanks it. The bug is still there. The report was actually about a race condition; the null-pointer framing was incidental to how the user described the symptom.

This is the single hardest bug class to catch in a production LLM system. The model did not refuse. It did not hedge. It did not hallucinate a fake API. It solved the wrong problem, fluently, and everyone downstream — the user, your eval pipeline, your guardrails — saw a plausible on-topic answer and moved on. I call these pattern-matching failures: the model latched onto surface features of the query and produced a confident answer to something adjacent to what was actually asked.

Data Lineage for AI Systems: Tracking the Path from Source to Response

· 10 min read
Tian Pan
Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.