Skip to main content

The Latency Budget Your Orchestrator Spent on Its Own Planning Step

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a week-long instrumentation pass on a customer-support agent that had, on paper, a perfectly reasonable median latency. P50 was inside SLO, P95 was uncomfortable but explainable, and the tool-call traces looked healthy. Then someone bucketed the spans by type and the room got quiet. The agent was spending roughly 58% of its wall-clock per run inside spans labeled "plan," "reflect," "decide-next-step," and "self-check." Tool execution — the database lookups, the CRM writes, the auth checks — accounted for under 30%. The thing the agent was being measured on did less than the thing nobody was measuring.

That ratio is not a fluke. It is the natural state of any plan-act-observe loop that you do not actively police. The orchestrator is paid in latency for thinking and paid in latency for acting, and the thinking step is almost always cheaper to add than the acting step, so it grows unchecked. By the time you notice, "decide what to do next" has become its own line item — bigger than most of the line items you originally built the agent to serve.

The healthier framing is that planner overhead is a first-class SLO target, not a free coordination cost. Once you measure it on the same axis as tool latency, you start making different architectural calls — about model size, replanning frequency, plan caching, and whether your "agent" actually needs to be an agent at every step.

Why Planning Inflates Faster Than Anyone Expects

Three forces compound to make planner spans the silent majority of an agent run.

The first is that every planning turn re-prefills the entire conversation. Each step adds the prior tool result, the prior thought, and the prior action, so context grows monotonically across the loop. Prefill scales roughly linearly with context length, and at a typical 50ms-per-1k-tokens prefill rate, a ten-step ReAct loop that starts at 1k tokens and grows by 800 tokens per step contributes about 180ms of pure prefill overhead — before a single output token is generated. That cost is invisible in your provider's "time to first token" metric because it happens inside the model server.

The second force is that hierarchical or multi-agent orchestration multiplies planner calls per user request. A three-level hierarchy with two-second LLM calls at each tier costs you a minimum of six seconds of coordination before any worker even starts executing. A four-agent pipeline routinely accumulates around 950ms of pure coordination overhead against roughly 500ms of actual work. The orchestration layer is doing exactly what it was built to do; nobody noticed it was doing it in serial.

The third force is the most insidious: planning is the part of the agent that is easiest to make "better" by adding more. More reflection turns. A self-critique pass. A "verify the plan" step. Each addition feels like a quality improvement in isolation, and each one extends the latency budget the user actually experiences. Worse, these additions tend to be added by the team that owns prompts, not the team that owns SLOs, so they ship without anyone modeling the wall-clock impact.

The combined effect: planner overhead grows on its own, in directions that look like good engineering, with no natural pressure pushing back.

Treating Planner Overhead as an SLO Line Item

The first move is to stop measuring agent latency as a single scalar. End-to-end P95 tells you nothing about where the time went, and "where" is the only useful signal for an agent system. Decompose the trace into typed spans — plan, tool-call, retrieval, reflect, finalize — and report each one against its own budget.

A useful budget shape for a customer-facing agent looks something like this. Tool-call latency gets 40% of the total budget, because that is where the user-visible work happens. Retrieval gets 15%. Planning gets 25%. Reflection and self-checking get 10% combined. The remaining 10% is your overhead margin for serialization, network, and gateway hops. When any single bucket exceeds its share, you know exactly which prompt, model, or hop to investigate. When the total breaches SLO without any individual bucket breaching, you have an aggregation problem — too many steps, not too slow a step.

Most teams discover, the first time they cut this view, that their planner bucket is consuming 50–70% of the budget. That is the moment the conversation changes from "the agent is slow" to "we are paying premium-model latency to decide things we could have decided at compile time."

The corollary worth internalizing: a planner span that exceeds its budget is a defect, not a slow path. Treat it the way you treat a slow database query — page on it, profile it, optimize it, or remove it.

Where the Time Actually Goes Inside a Planner Span

Once you commit to the SLO-line-item framing, the next question is what to cut. Three categories of planner cost are usually present, and they have different remediations.

The largest category is re-deciding things that did not change. The agent re-derives, on every turn, that the user wants order status, that the user is authenticated, and that the tool to use is get_order. None of those facts moved. A planner that re-deduces them on every step is paying full reasoning cost to recover state it already had three turns ago. Plan caching directly attacks this — the planner emits a reusable plan keyed on the request shape, and subsequent similar requests skip the planning span entirely. Recent benchmarks on agentic plan caching report roughly 27% latency reduction on average for agent serving, with cache generation costing about 1% of the run.

The second category is planning at the wrong granularity. Step-by-step planners ask the model "what should I do next?" at every iteration. Full-horizon planners ask "what is the full sequence of steps?" once, then execute the plan and replan only on failure. For well-shaped tasks, the full-horizon variant matches the step-by-step variant in accuracy while using dramatically fewer planner tokens. The right granularity is task-dependent — exploratory tasks need fine-grained planning, well-defined tasks emphatically do not — but most production agents default to fine-grained because that is what the framework example showed.

The third category is planning with the wrong model. The large planner model is gold-plated for tasks where a small model would have done. A common refactor is to let the large model produce the plan once, then delegate the per-step execution and the simple "which tool now" decisions to a smaller, faster model. This is the core insight behind plan-and-execute architectures: the expensive reasoning happens at the top of the loop, the cheap dispatching happens inside the loop. The latency win is structural, not just numeric — you stop calling the slow model on every iteration.

When the Agent Should Stop Being an Agent

The harder question, the one that often gets dodged: does this code path need a planner at all?

Some workflows in your agent are deterministic in disguise. The user asks a known-shape question, the system retrieves a known-shape document, the model writes a known-shape response. If you trace a hundred runs and the planner picks the same three tools in the same order 95% of the time, you have a compiled workflow that is being executed as if it were a freeform planning problem. Replacing it with a static graph — the same nodes, the same edges, no per-step LLM decision — usually deletes the planner span entirely on that path. The latency improvement is dramatic, not because the planner was slow, but because the planner stopped existing.

The mental model that helps here is the one from compilers. A planning agent is interpreting your business logic at runtime; a compiled workflow has resolved the logic at design time. Interpretation is the right choice when the input distribution is genuinely open-ended. It is the wrong choice when you are paying interpretation cost for a workload that turned out to be regular.

A pragmatic split many teams converge on: keep the planner for the long tail and the genuinely novel intents, and ship compiled graphs for the head of the distribution. Most agent traffic concentrates on a small number of intents, so even a partial compilation captures most of the latency win. You can A/B the two paths on the same request type and watch the planner-span SLO move.

What Changes When Planner Overhead Becomes Visible

Making planner spans first-class has second-order effects on how the team builds.

Prompt changes get reviewed against a latency budget, not only against quality. Adding a reflection turn used to be a quiet improvement; once the reflection span has a number next to it, the team weighs it against the SLO and often finds that the quality gain is not worth the latency tax. Adding a "verify the plan" step gets the same treatment. The prompt team and the SRE team start talking to each other, because the prompt team now owns a metric the SRE team paged on at 3am.

Model selection becomes a per-span decision rather than a per-agent decision. The planner can run on a fast, large model; the executor can run on a cheap, small model; the reflector can be cut entirely on hot paths. You stop picking "the model for the agent" and start picking models for spans, which is closer to how anyone would design a non-AI system anyway.

Replanning frequency becomes a tunable parameter. Many agents replan after every tool result by default. If the result was a deterministic success — the database returned a row, the auth call returned 200 — there is nothing to replan. A guard that suppresses replanning on expected-shape results often drops planner-span count by half with no quality impact.

Finally, the team stops shipping new agents into production with the architecture set to "freeform plan-act-observe with a large model at every node." That default exists because it is the default in the tutorials, not because anyone benchmarked it. Once you have the SLO view, you make the choice deliberately or you do not make it at all.

The Takeaway

The instrumentation pass is the cheap part. Bucket your agent's spans into plan, act, retrieve, reflect, finalize. Put a budget on each. Page when any bucket breaches. Most teams running this exercise for the first time discover that the slowest part of their agent is not the database, not the model provider, and not the network — it is the orchestrator deciding what to do next, over and over, with the full context, on the largest model available.

Once planner overhead is a number on a dashboard instead of a folk belief, the optimization moves become obvious. Cache plans. Coarsen planning granularity. Demote per-step decisions to smaller models. Compile the head of the distribution. Suppress unnecessary replans. None of these are exotic techniques. They are simply unavailable to a team that cannot see, in milliseconds, what their agent is thinking about.

The bug is not that agents plan. The bug is that planning was billed as coordination overhead — free, invisible, not on the dashboard — when it had quietly become the workload.

References:Let's stay in touch and Follow me for more thoughts and updates