Skip to main content

33 posts tagged with "latency"

View all tags

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently built what they called a "defense in depth" pipeline for their AI assistant. An input classifier checked for prompt injection. A jailbreak filter scanned for adversarial patterns. The model generated a response. An output moderation pass scanned the result. A refusal detector checked whether the model had punted, and if so, a reformulation step re-asked the question with a softer framing. The eval suite said the prompt produced answers in 1.4 seconds. Real users were waiting 3.8 seconds at the median and over 9 seconds at the p95.

Every safety layer is a round trip. Every round trip has a network hop, a queue time, a model load, and a decode. When you stack them serially in front of and behind the generative call, the latency budget you priced your product on dissolves — and almost no one accounted for it during design review. Worse: the slowest, most expensive path through your pipeline is the one that triggers on safety-adjacent prompts, which is exactly the long tail your safety story exists to handle. You are silently subsidizing that tail from the average user's bill.

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

· 12 min read
Tian Pan
Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.

Where the 30 Seconds Went: Latency Attribution Inside an Agent Step Your APM Can't See

· 11 min read
Tian Pan
Software Engineer

The dashboard says agent.run = 28s at p95. Users say the feature feels broken. The on-call engineer opens the trace, sees a single fat bar with no children worth investigating, and starts guessing. By the time someone has rebuilt enough mental model to know whether the bottleneck is the model, the retriever, or a tool call that nobody added a span to, the incident has aged into a backlog ticket and the user has given up.

This is the failure mode at the heart of agent operations in 2026: classical APM treats an agent step as a black box, and "agent latency" is not a metric — it is the sum of seven metrics that decompose the wall-clock time differently depending on what the agent decided to do that turn. A team that doesn't expose those seven numbers ships a feature whose slowness everyone can feel and nobody can fix.

The Eval-Rig Latency Lie: Why Your p95 Doubles in Production

· 10 min read
Tian Pan
Software Engineer

The eval team puts a number on the deck: "p95 latency is 1.2s." The launch ships. A week later, oncall posts a graph: production p95 is 4.8s and climbing through the dinner-time peak. Engineers spend the next five days arguing about whether something regressed, instrumenting model versions, opening tickets with the provider — and eventually discover that nothing changed except where the number was measured. The eval rig was reporting the latency of a quiet machine running serial calls against a warm cache. Production is a different system. The p95 was never wrong; it was answering a different question.

This is the eval-rig latency lie. It is not about bad benchmarks — most teams use reasonable tools and report the numbers honestly. It is about the gap between "the latency of the model" and "the latency a user experiences," and the fact that the rig you build for development almost always measures the first while implying the second. Once you internalize this, latency SLOs derived from a benchmark stop looking like product commitments and start looking like claims about a private testing environment that nobody else can reproduce.

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

Why Your Voice Agent Feels Rude: Turn-Taking Is a Latency Budget You Never Wrote Down

· 11 min read
Tian Pan
Software Engineer

The first time you ship a voice agent, you'll get the same complaint twice: "It interrupted me," and "It feels rude." Both are the same bug. The agent isn't impolite — it's running on a latency budget you never wrote down. The chat-style instinct that says "respond when complete" produces a system that, in voice, feels like talking to someone who keeps stepping on your sentences and going silent at all the wrong moments.

Conversational turn-taking in humans happens in a window of roughly 100 to 300 milliseconds, and it does so across every language ever measured. A median 200ms inter-speaker gap isn't an aspiration; it's the baseline humans calibrate against. Anything slower reads as confusion, anything faster reads as interruption, and a voice agent that doesn't model the rhythm explicitly is going to land in one bucket or the other on every turn.

The fix isn't a faster model. It's accepting that voice AI is a soft real-time system whose budget is set by human conversational physics, and writing the budget down before you ship.

GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint

· 9 min read
Tian Pan
Software Engineer

Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.

This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.

Your P99 Is Following a Stranger's Traffic: The Noisy-Neighbor Tax in Hosted LLM Inference

· 10 min read
Tian Pan
Software Engineer

Your dashboards are clean. The deployment from yesterday rolled back cleanly. The model version is pinned. The prompt didn't change. But your TTFT p99 just doubled, your customer success channel is on fire, and the only honest answer you can give is "it's the provider." That answer feels small — like a shrug — and it usually leads to a follow-up question that nobody on your team can answer: prove it.

This is the part of hosted LLM inference that the marketing pages do not discuss. When you call a frontier model API, you are sharing a GPU, a PCIe fabric, a continuous batch, and a KV-cache budget with workloads you cannot see. Your p99 is a function of their bursts. The economics of large-scale inference depend on multiplexing tenants tightly enough that hardware utilization stays north of 60-70%, which means your tail latency is structurally coupled to the largest, jankiest, lumpiest tenant on the same shard. You are not buying capacity; you are buying a slice of a queue that someone else is also standing in.

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.