The Latency Budget Your Agent Loop Stole from the Search Box
The launch metrics looked clean. Answer quality up, citation rate up, the eval suite green. The team that replaced the old keyword search with an agent-backed retriever shipped, took the win, and moved on. Six weeks later somebody noticed the weekly active number on that surface had drifted down twelve percent and nobody could find the regression. There was no regression. The agent worked. The users left because the box that used to answer in two hundred milliseconds now took four seconds, and nothing in the launch retro had a budget for that.
This is the latency-budget transfer problem, and almost nobody draws the org chart that catches it. A search box is not just a function call. It is a thirty-year contract with the user's nervous system: type, see results, scan, click. The 200-millisecond response is not a performance metric on a dashboard somewhere — it is the reason the user's attention is still on the screen when the results arrive. When the team underneath the box replaces a keyword index with an agent loop, the function-call surface looks identical and the SLA on the new call lives in a completely different regime. The latency budget moved from the team that owned the index to the team that owns the agent, and from the team that owns the agent to the user, and the only one who showed up to the meeting was the user.
The search box is a thirty-year muscle memory, not a UI element
Jakob Nielsen's response-time thresholds — 0.1 second feels instant, 1 second keeps the user's flow intact, 10 seconds is the outer edge before attention breaks — are older than most engineers reading this post. They are not a style guide. They are a description of what the human stack does when feedback arrives at different latencies. Inside 100 milliseconds the user feels they are directly manipulating the page. Past one second the user notices the machine is working. Past ten seconds the user starts another task.
A search box has lived inside the first bucket for the entire career of every user you ship to. Google trained the global expectation; every search box on the open web inherited it. When your team's older keyword search returned in 180ms, the user did not register the wait — they registered the answer. The latency budget was effectively zero from the user's perspective, which is why the team running the index treated their p99 in milliseconds and the product team did not have to think about it at all.
An agent loop does not live in that bucket. A single LLM call with retrieval, planning, and tool use lands somewhere between three and eight seconds for the full response. Time to first token, the fastest signal a streaming surface can give, is in the 200ms-to-1.5s range depending on provider and load. Even the cheap part of the agent loop is at or past the threshold where the user notices the machine is working. The expensive part is past the threshold where the user starts thinking about doing something else.
The product surface did not change. The user's mental model did not change. The latency regime moved by an order of magnitude. The question is not whether the new answer is better. The question is whether the user is still on the page when it arrives.
The launch retro that did not invite the team holding the budget
The team that owned the 200ms SLA on the keyword index was not at the launch decision for the agent. The decision read like an algorithm swap, not a perf change. The accuracy eval ran green, the cost model penciled out, and the rollout plan covered fallback behavior and quota. Latency appeared as a single number in the deck — the p50 of the new pipeline — and nobody asked the question the index team would have asked: what is the user-facing budget here, who owns it, and what is the threshold past which we count the launch as a loss?
This is the structural failure. The latency-budget owner changes when the implementation changes, but the org rarely catches the handoff. The infrastructure team's SLA was on a call that no longer exists. The product team's SLO was implicitly the infrastructure team's SLA — nobody had written it down, because for thirty years nobody had needed to. When the implementation gets ripped out, the SLO disappears with it, and the team that ships the new implementation inherits a budget they did not know they owned.
The fix is not to roll the launch back to the index. The fix is to put the latency budget in writing — as a product SLO, not an infrastructure one — before the launch. A useful version of that SLO looks less like "p99 under 300ms" and more like a tiered budget: time to first useful token must land inside the user's prior expectation; time to a complete answer can live in the longer bucket, but only if the surface tells the user what is happening in between. The owner of that SLO is the product team, not the inference team. The inference team owns a different SLO: the wire-level numbers that feed into it.
If you can name the person on your team who owns the user-facing latency budget for the agent surface, you have done the hard work. If you cannot, the budget is sitting unowned, which is the same as having no budget.
First-useful-token is the only number the user actually feels
Streaming made latency a moving target instead of a single number, and most teams measure the wrong one. Time to first token — the wire metric — is what the model provider exposes and what most observability stacks chart. Time to first useful token — the moment something appears on the screen that the user can read, scan, or act on — is what governs whether the user is still there at the end.
The two are not the same. A streaming response that opens with "Sure, I can help you with that. Let me think about this question..." pushes the user's first useful token out by several hundred milliseconds. A response that opens with the answer's first concrete claim, the first relevant citation, or a one-line summary of the plan lands inside the bucket where the user is still leaning in. The wire metric stayed the same. The UX metric moved by a full attention threshold.
This is a design problem, not a perf problem. The fix lives in the prompt, the streaming format, and the surface that renders the stream. Instruct the model to lead with the load-bearing part of the response. Render structured chunks — a bullet, a card, an outline — as soon as they parse, instead of holding the response until the close paragraph arrives. Strip the polite opener if the surface has its own affordance for "I am working on this." Every token spent on preamble is a token spent against the user's attention budget.
The honest version of this metric is observable. Instrument time-to-first-useful-token alongside TTFT, define "useful" as the first chunk a user could act on without seeing the rest, and watch the histogram. The gap between TTFT and TTFUT is the slack your prompt is spending on niceties. Most teams have not measured it because the streaming API does not give it to them; you have to define it at the application layer.
- https://www.nngroup.com/articles/response-times-3-important-limits/
- https://www.codeant.ai/blogs/ai-first-token-latency
- https://research.aimultiple.com/llm-latency-benchmark/
- https://www.digitalapplied.com/blog/ai-model-latency-benchmarks-2026-ttft-throughput
- https://thefrontkit.com/blogs/what-is-streaming-ui-in-ai-applications
- https://www.parloa.com/knowledge-hub/agentic-ai-latency/
- https://redis.io/blog/how-to-improve-llm-ux-speed-latency-and-caching/
- https://fuselabcreative.com/ui-design-for-ai-agents/
