Skip to main content

33 posts tagged with "latency"

View all tags

Voice Agents Are Not Chat Agents With a Microphone: The Half-Duplex Tax

· 10 min read
Tian Pan
Software Engineer

A voice agent that scores perfectly on every transcript-level benchmark can still feel subtly wrong on a real call. The words are right. The reasoning is right. The latency number on your dashboard reads 520ms end-to-end, which was the target. And yet the person on the other end keeps stumbling, talking over the agent, restarting their sentences, hanging up early. The team ships a better model, the numbers improve, the feeling does not.

The reason has almost nothing to do with what the model says and almost everything to do with when it says it. Voice is not text with audio attached. Human conversation runs on a tight half-duplex protocol with barge-in, backchannel, and overlapping speech, and the timing budgets are measured in milliseconds. Most voice agent problems, once you get past the first week of hallucination fixes, are turn-negotiation problems. And turn negotiation is architectural — you cannot prompt your way out of it.

Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops

· 10 min read
Tian Pan
Software Engineer

If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.

This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.

Deadline Propagation in Agent Chains: What Happens to Your p95 SLO at Hop Three

· 10 min read
Tian Pan
Software Engineer

Most engineers building multi-step agent pipelines discover the same problem about two weeks after their first production incident: they set a 5-second timeout on their API gateway, their agent pipeline has four hops, and the system behaves as though there is no timeout at all. The agent at hop three doesn't know the upstream caller gave up three seconds ago. It keeps running, keeps calling tools, keeps generating tokens—and the user is already gone.

This isn't a configuration mistake. It's a structural problem. Latency constraints don't propagate across agent boundaries by default, and none of the major orchestration frameworks make deadline propagation easy. The result is a class of failures that looks like latency problems but is actually a context propagation problem.

The Agent Loading State Problem: Designing for the 45-Second UX Abyss

· 11 min read
Tian Pan
Software Engineer

There is a hole in your product between second ten and second forty-five where nothing you designed still works. Users abandon a silent UI around the ten-second mark — Jakob Nielsen pinned that threshold back in the nineties, and modern eye-tracking studies have not moved it by more than a second or two. Modern agent work routinely takes thirty to one hundred twenty seconds. Multi-step planning, retrieval, a couple of tool calls, maybe a reflection pass before the final write — the latency budget is not a budget anymore, it is a crater.

Most teams discover this the first time they ship an agent feature and watch session recordings. Users hammer the submit button. They paste the query into a second tab. They close the window and retry from scratch, convinced it is broken. The feature works; the waiting does not. The gap between "spinner appeared" and "answer arrived" is the most neglected surface in AI product design, and it is the one that decides whether users perceive your agent as intelligent or stuck.

Hot-Path vs. Cold-Path AI: The Architectural Decision That Decides Your p99

· 10 min read
Tian Pan
Software Engineer

Every AI feature you ship makes an architectural choice before it makes a product one: does this model call live inside the user's request, or does it run somewhere the user isn't waiting for it? The choice is usually made by whoever writes the first prototype, never revisited, and silently determines your p99 latency for the rest of the feature's life. When the post-mortem asks why a shipping dashboard became unusable at 10 a.m. every Monday, the answer is almost always that something which should have been cold-path got welded into the hot path — and a model that is fine at p50 becomes catastrophic at p99 when traffic fans out.

The hot-path / cold-path distinction is older than LLMs. CQRS, streaming architectures, lambda architectures — they all draw the same line between "must respond now" and "can arrive eventually." What's different about AI workloads is that the cost of crossing the line in the wrong direction is an order of magnitude higher than it used to be. A synchronous database query that takes 50 ms turning into 200 ms is a regression. A synchronous LLM call that takes 1.2 s at p50 turning into 11 s at p99 is a business decision you didn't know you made.

The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

· 11 min read
Tian Pan
Software Engineer

Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.

Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.

Voice AI in Production: Engineering the 300ms Latency Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building voice AI discover the latency problem the same way: in production, with real users. The demo feels fine. The prototype sounds impressive. Then someone uses it on an actual phone call and says it feels robotic — not because the voice sounds bad, but because there's a slight pause before every response that makes the whole interaction feel like talking to someone with a bad satellite connection.

That pause is almost always between 600ms and 1.5 seconds. The target is under 300ms. The gap between those two numbers explains everything about how voice AI systems are actually built.

Speculative Execution in AI Pipelines: Cutting Latency by Betting on the Future

· 11 min read
Tian Pan
Software Engineer

Most LLM pipelines are embarrassingly sequential by accident. An agent calls a weather API, waits 300ms, calls a calendar API, waits another 300ms, calls a traffic API, waits again — then finally synthesizes an answer. That 900ms of total latency could have been 300ms if those three calls had run in parallel. Nobody designed the system to be sequential; it just fell out naturally from writing async calls one after another.

Speculative execution is the umbrella term for a family of techniques that cut perceived latency by doing work before you know you need it — running parallel hypotheses, pre-fetching likely next steps, and generating multiple candidate outputs simultaneously. These techniques borrow directly from CPU design, where processors have speculatively executed future instructions since the 1990s. Applied to AI pipelines, the same instinct — commit to likely outcomes, cancel the losers, accept the occasional waste — can produce dramatic speedups. But the coordination overhead can also swallow the gains whole if you're not careful about when to apply them.