Skip to main content

Agent Latency Budgets Are Trees, Not Lines — You Have Been Debugging the Wrong Axis

· 12 min read
Tian Pan
Software Engineer

A user reports "the assistant felt slow this morning." The on-call engineer pulls up the flame graph, sorts tool calls by duration descending, finds the slowest one — a 2.1-second vector search — optimizes it down to 900ms, ships the fix, and marks the incident resolved. A week later the same complaint arrives. The vector search is still 900ms. But the end-to-end latency on that query type has actually gotten worse. Nothing in the flame graph explains why.

This is what happens when an engineer debugs a tree on the line axis. Agent latency is not a waterfall of sequential steps — it is a nested tree of planning calls, tool subtrees, parallel fan-outs, retries, and recursive sub-agents. When the budget is structural but the tooling treats it as linear, local optimizations miss the actual violation, which lives in how time is distributed across branches, not how long any single call takes. You can make every leaf faster and still ship a p99 that is getting worse.

The symptom is usually that the team keeps shipping wins on flame-graph optimizations and the user-facing latency dashboard stays flat or drifts up. That is the signature of a tree problem that is being debugged as a line problem. If you have hit that pattern more than once, the observability and the budgeting model both need to change — not just the slow tool.

The Line Mental Model Does Not Survive First Contact With An Agent Loop

When engineers reason about request latency, most of us still carry a line-shaped mental model from request-response APIs: user sends query → gateway → service A → service B → database → back. Latency is the sum along the critical path. You flame-graph it, you find the fat span, you fix the fat span.

Agent systems look superficially similar — there is still a "request" and a "response" at the edges — but the middle is a different shape. A single user turn typically triggers:

  • One or more planning LLM calls to decide what to do.
  • A batch of tool calls the planner emits, some of which the SDK can run in parallel.
  • Per-tool sub-execution that itself fans out (a retrieval tool that calls an embedding API, a reranker, and a vector store).
  • Sub-agents that are their own nested loops with their own planning rounds and tool subtrees.
  • Retries on transient failures, which multiply across tiers.
  • A synthesis LLM call at the end that stitches everything together.

The Claude Agent SDK documents this explicitly: tools requested in a single turn can run concurrently when they are marked read-only, and sequential otherwise. The orchestrator-worker pattern is now standard — a lead agent spawns workers for parallel exploration. Fan-out is the default, not the exception.

The critical path through that structure is not a line. It is the longest root-to-leaf path through a tree whose shape is partially determined by the planner at runtime. The total wall-clock latency is governed by how much slack sits on each sibling branch relative to the deepest one. Optimizing a leaf on a short branch moves the number zero; optimizing a leaf on the critical branch moves it by exactly the amount you shaved.

The reason teams mis-attribute is tooling. Most tracing UIs default to a flat list of spans sorted by duration. That view hides depth, hides siblings, hides slack, and makes the "biggest bar" feel causal when it often is not on the critical path at all.

Decompose The Budget The Way The System Actually Executes

The fix starts with budgeting. Give every node in the tree an explicit SLO, not just the root. A concrete decomposition for a customer-support agent with a 6-second user-perceived budget might look like this:

  • Root: 6000ms total
    • Planner call 1: 800ms
    • Tool fan-out (parallel, governed by the slowest sibling): 3000ms
      • Retrieval subtree: 2500ms
        • Embedding: 150ms
        • Vector search: 400ms
        • Rerank: 600ms
        • Tool wrapper overhead + network: 350ms
        • Reserve (slack): 1000ms
      • CRM lookup: 1500ms
      • Inventory check: 800ms
    • Planner call 2 (synthesis): 1500ms
    • Framework overhead, serialization, streaming setup: 700ms

Two things come out of this exercise that no flame graph will tell you:

First, the retrieval subtree and the planner synthesis step are structurally on the critical path every time, because they are sequential and deep. The CRM lookup and inventory check sit on short branches — a 400ms improvement there moves total latency by zero as long as retrieval is slower. Every team has a version of the "we optimized the 800ms CRM call" story where the root p95 did not move.

Second, each node has sibling slack — the gap between its own duration and the max across its siblings. Sibling slack is the single most important quantity in tree-shaped latency work, and it is almost never on any dashboard. If the CRM call takes 1500ms but its slowest sibling takes 2500ms, you have 1000ms of free margin on that branch. Burn it on a second retry, on a hedged request, on a better but slower model — anything that improves reliability or quality without moving the critical path. Conversely, if your critical-path subtree is exceeding its budget, you need to attack that specific branch, and the flame graph will usually be pointing at a different one that looks scarier because the bar is longer but sits on slack.

Per-node SLOs also change the meaning of an alert. "Retrieval subtree exceeded its 2500ms budget" is actionable — it tells you which subsystem to page. "p95 is up" without branch attribution paints a target on every team at once and usually ends in a week-long war room.

Deadline Propagation Is The Contract That Makes The Tree Executable

Budgets without enforcement are documentation. The enforcement mechanism that distributed systems have used for a decade is deadline propagation — and it is criminally underused in agent stacks.

The gRPC community formalized the pattern: when a client issues an RPC with a deadline, the server inherits a budget equal to the remaining time, subtracts its own work, and passes the reduced deadline to any downstream calls. If any node sees the deadline has already expired, it fails fast instead of wasting resources. The practice is so established that gRPC implementations in Java and Go propagate deadlines automatically; userver, Microsoft's gRPC guide, and countless production systems all treat it as baseline hygiene.

Agent frameworks have mostly not adopted this. A typical setup gives each tool call a fixed per-call timeout — say, 10 seconds — regardless of how much of the user's total budget has already been consumed. If the planner burned 4 seconds thinking and then fires three tool calls with 10-second timeouts, you have implicitly authorized 34 seconds of wall clock against a 6-second user SLO. The retry layer makes this worse — the tool's retry budget is in tool-local time, not in request-tree time, so a single slow tool can push the request past its deadline while every individual component is "within spec."

The minimum contract an agent loop needs:

  • Every user request opens a budget token with an absolute deadline (not a duration — a wall-clock timestamp).
  • Every tool call, LLM call, and sub-agent receives a context carrying remaining budget.
  • Each node subtracts its own estimated cost before forwarding the deadline to children.
  • Any node observing deadline-exceeded aborts immediately without burning model tokens or tool quota.
  • The retry policy is bounded by remaining budget, not by a per-call retry count.

Once this is in place, "deadline exceeded" becomes a first-class signal instead of a mysterious timeout. You can observe which subtree consumed the budget, which is the actual data you need to act.

Observability Must Surface Slack, Not Just Duration

If the budget is a tree, the dashboard has to be one too. The observability pattern that matters is not "slowest span" but "critical path ownership over time" and "per-subtree slack distribution."

Four views that change how teams debug:

Per-subtree slack histogram. For each named subtree (retrieval, CRM lookup, inventory, synthesis), plot the distribution of subtree_duration / subtree_budget. A healthy subtree sits around 0.5–0.8. A subtree consistently at 0.95+ is the one to optimize. A subtree at 0.2 is giving you headroom you could spend on quality. Most teams discover after building this view that their intuition about "which subsystem is slow" was wrong by a full branch.

Critical path attribution by query class. Not every request takes the same path through the tree. Group by query class (say, account-question vs. product-question vs. refund-request) and show which subtree owns the critical path for each class. "Retrieval is the bottleneck" is a lie when it is only true for 40% of traffic.

Deadline-exceeded decomposition. When a request exceeds budget, attribute the excess to a specific node: "retrieval went 600ms over its 2500ms allocation" is a bug report. "The request took 7s" is not.

Sibling-branch racing diagram. When fan-out happens, show each sibling's completion time relative to the slowest sibling that gated the parent. This is the single clearest visualization of slack and immediately reveals which branches are candidates for more retries, hedged requests, or a quality upgrade.

None of these views require custom telemetry — OpenTelemetry spans with parent-child relationships carry enough information. But the default trace UI does not render them. Either extend your observability platform's dashboards or export spans to a tool that can.

Early Cancel And Sibling Slack Are The Two Biggest Levers

With budgets, propagation, and tree-aware observability in place, two optimizations become trivially available.

Early cancellation of doomed branches. If the planner fires a retrieval call and a CRM lookup in parallel, and the retrieval fails hard at 100ms, there is no reason to let the CRM lookup run for another 1400ms — the synthesis step needs both. Most agent frameworks do not propagate cancellation through the tool subtree when a sibling fails, so the system burns budget and quota on work whose output will be discarded. Wire cancellation tokens through the parallel fan-out. This alone can cut p95 meaningfully for query classes where tool errors correlate.

Spend sibling slack on quality or reliability, not speed. The reflex when you discover a branch has 1000ms of slack is to shrink it. That is wrong in agent systems. Use the slack. Hedge the call against a backup provider. Run a second retrieval variant and take the better result. Upgrade to a more expensive reranker. Add a light verification step. Slack that sits unused is quality you are not shipping. The critical-path branch gets the speed work; the short branches get the quality work.

There is a research thread around speculative tool execution — PASTE and its successors — that goes further, executing likely-next tool calls while the planner is still generating its rationale. Pattern-aware speculation has been shown to reduce average task completion time by nearly 50% on agent workloads with stable control flows. That is a more aggressive version of the same principle: if you have slack, spend it on work the planner is about to ask for anyway.

The Organizational Failure Mode Is A Shared Budget Nobody Owns

Tree-shaped budgets expose an organizational question that line-shaped thinking hid: who owns each node's SLO? The platform team usually owns the root, the infra team owns the model calls, tool teams own their tool nodes, and the agent team owns the planning and orchestration. When the retrieval subtree blows its budget, whose page is it? When the CRM lookup has 1500ms of slack, who decides whether to spend it?

Teams that stay healthy under this decomposition do three things:

  • Make each subtree's budget a first-class SLI with a named owner and a review cadence. Do not let "the agent is slow" be a budgetless indictment of whoever replies first.
  • Treat budget-vs-actual as a regular capacity-planning artifact, not an incident-only artifact. A subtree trending from 60% to 85% of its budget over six weeks is a planning signal, not a firefight.
  • Review budget changes like you review schema changes. A new tool gets a budget allocation at PR review time, carved explicitly out of a parent's reserve — not added to the request's natural total.

The debugging practice flips too. The question stops being "what is the slowest span" and becomes "which subtree is over its budget, and which of its children caused it?" That is a question tree-aware dashboards can actually answer.

The Axis Was Always The Problem

Flame graphs will keep drawing agent traces as a sequence of bars, and the temptation to sort-by-duration and optimize the top one will keep feeling productive. But the bar you are optimizing usually sits on a branch that is not the critical path, or on a branch with slack you could have spent better, or on a subtree that is violating a budget that was never explicit in the first place.

The fix is not a new tool. It is a mental model change — budgets are tree-shaped, deadlines propagate, slack is an asset, and the critical path is a root-to-leaf walk, not a sorted list. Once your dashboard and your agent loop agree on that shape, the "slow morning" postmortems stop being mysterious. You can point at the branch that blew its allocation, the sibling slack it left on the table, and the planner decision that made the tree deeper than you budgeted for. That is a debug story you can actually close.

References:Let's stay in touch and Follow me for more thoughts and updates