Latency-Aware Tool Selection: When 'Good Enough Now' Beats 'Best Available Later'
The tool description in your agent's system prompt is a six-month-old eval artifact. It says search_pricing returns "fresh inventory data with structured pricing" and the planner believes it, because nothing in the prompt has updated since the day the description was tuned. The actual search_pricing endpoint has been sitting at p95 of 11 seconds for the last forty minutes because the upstream vendor is rate-limiting your account, and the cheaper search_cache tool — which the prompt describes as "may be slightly stale" — would return the same answer in 200ms. The planner picks search_pricing anyway, because the description still reads like it did during eval, and the planner has no signal about what either tool costs to call right now.
This is the structural failure of static tool descriptions. The planner is making routing decisions on a snapshot of a world that has moved on. Tool selection isn't really a capability question — most production agents have two or three tools that overlap heavily in what they can answer — it's a cost-of-waiting question, and the cost of waiting is the thing your prompt template doesn't see.
The Static Description Problem
A typical tool description in 2026 looks like the one in Building Effective Agents and similar primers: a name, a short paragraph of what it does, an argument schema, and maybe a note about when to prefer it over a sibling tool. That description gets locked in when you ship, and the eval suite that proves the agent picks the right tool also gets locked in at the same time. Both are tuned against a world where the tool's behavior matches the description.
The world drifts. The 11-second p95 on search_pricing is not in the description. The 4% error rate spike on lookup_address after the geocoder vendor migrated regions is not in the description. The cache hit rate dropping from 92% to 71% after a schema change in the upstream system is not in the description. The planner reads what was true at eval time, picks the tool the eval rewarded, and pays the runtime cost the eval never measured.
Tool selection bugs in production almost never look like "the agent picked the wrong tool." They look like "the agent picked a reasonable tool that happens to be having a bad day." The eval suite would still pass on the prompt, because the eval mocks the tool with the description's stated behavior. The user experience is a 14-second turn that should have been a 400ms one.
This is the same instruction-decay pattern that breaks long-running agents, just compressed into a single turn. The description is correct on average over its lifetime, and wrong on the specific call you made at 14:23 today.
Cost of Waiting Is a Routing Signal, Not a Property of the Tool
The mental model worth adopting: tool cost is a time-varying signal, not a static attribute. The planner needs the value the signal has now, not the value the description was tuned against six months ago.
The signals that matter in practice:
- Current p95 latency over the last 1–5 minutes. Not the historical average — the recent percentile, because that's what the next call will pay.
- Recent error rate, scoped to the same window. A 30% error rate over the last minute is a routing signal even when the historical error rate is 0.4%.
- Circuit-breaker state. If the tool's circuit is half-open or open, the planner should know — calling it isn't just slow, it might be deliberately blocked.
- Quota / budget remaining. Useful when one tool has rate-limited capacity and a sibling tool doesn't. The Budget Tracker pattern from recent research surfaces remaining budget into the agent's reasoning loop and consistently improves accuracy across models.
- Cache freshness, if relevant. A cached tool with a 30-second-old answer is often strictly better than a fresh tool with a 9-second latency.
These are not exotic signals. Your platform already collects every one of them — they live in the metrics backend that powers the latency dashboard the oncall watches. The work is not generating the signal. The work is plumbing it into the place where the planner makes the routing decision, which is the prompt.
When "Good Enough Now" Beats "Best Available Later"
Three concrete failure modes that show up once you start measuring this.
Streaming chat with a tail-bound SLO. The agent has a 6-second budget for the first chunk and a 12-second budget for the full response. A "best" tool that returns the perfect answer in 9 seconds violates the SLO; a "decent" tool that returns the 95%-as-good answer in 1.5 seconds doesn't. The planner that's optimizing for tool correctness is optimizing for the wrong objective. The user sees timeouts, not better answers.
Voice agents. The barge-in window is roughly 800ms. Any tool with a p95 over 600ms is functionally unavailable for the voice path even when it's the most accurate option in the catalog. The planner needs to know that the tool the prompt says to prefer is actually disqualified by physics in this modality.
Synchronous flows behind a transaction. A checkout-time fraud check needs a result before the payment intent expires. If the primary fraud-scoring tool is slow today, the system needs the fallback heuristic right now, not the canonical scorer eventually. The cost of waiting for the "best" tool is a dropped sale, which the eval suite never priced.
The unifying property of these cases is that the user-visible cost of waiting is non-linear. Below the deadline, latency matters slightly. Above the deadline, latency matters infinitely — the answer becomes worthless even if it would have been perfect. Static tool descriptions can't model this because they don't see the deadline and they don't see the live tail.
- https://www.anthropic.com/engineering/writing-tools-for-agents
- https://vllm-semantic-router.com/blog/semantic-tool-selection/
- https://arxiv.org/html/2603.13426
- https://arxiv.org/html/2604.21816
- https://arxiv.org/html/2511.17006v1
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://dev.to/adamo_software/tool-use-api-design-for-llms-5-patterns-that-prevent-agent-loops-and-silent-failures-f29
- https://www.useparagon.com/learn/optimizing-tool-performance-and-scalability-for-your-ai-agent/
- https://arxiv.org/html/2508.11291
