Skip to main content

42 posts tagged with "latency"

View all tags

The CDN Edge Cache Your AI Feature Could Not Use Because the Response Varies Per User

· 9 min read
Tian Pan
Software Engineer

The product team set the SLO for the new AI summarizer at 200ms TTFB because that is what the rest of the product hits at p50. Nobody on the call asked where the 200ms came from. It came from a decade of static assets and JSON responses served out of a CDN edge cache with an 85% hit rate, where most requests never reached origin and the ones that did were small. The summarizer is per-user, generated fresh each call, and travels edge → origin → model provider on every request. The SLO was structurally unmeetable on day one. The team discovered this in week six, after the dashboard had been red the whole time.

This is a recurring pattern in AI feature launches. The latency bar an organization built on top of one set of physics gets inherited by a feature with completely different physics, and the gap between the inherited target and the achievable floor becomes a months-long mitigation project instead of a Day 0 design constraint. The numbers do not care that the SLO was negotiated with a customer in good faith.

The Fine-Tune Cold Start Your Provider Bills as Idle Time

· 11 min read
Tian Pan
Software Engineer

Your fine-tuned variant serves a few hundred requests per minute on a steady weekday, and the p99 latency dashboard is mostly flat. Then, at 03:14 local time on a Tuesday, p99 spikes from 800ms to 4.6 seconds for a single request, then settles back. The next night, it happens again, roughly the same shape, roughly the same hour. You file a ticket against the provider asking about the spike. The response is correct and unhelpful: their dashboard shows nothing anomalous on their side, no rate limits, no incidents, your token usage at the moment of the spike was unremarkable. The 4.6 seconds happened. The bill does not reflect it.

That gap — between a latency event a user clearly experiences and a bill that registers nothing — is the shape of the fine-tune cold start tax. It is not a bug in your code. It is not a regression on the provider's side. It is the seam where two billing models meet: the provider charges you for active inference time on the adapter, and the cost of loading the adapter into a serving slot is hidden inside the provider's infrastructure layer, where it shows up as your latency but their cost. If your traffic shape ever falls below the provider's keep-warm threshold, you pay for the round trip in p99 every time it climbs back.

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.

The Voice Agent SLO Defined in Time-to-First-Audio Your Provider Measured in Time-to-First-Token

· 10 min read
Tian Pan
Software Engineer

The product spec says the user hears a response within 600 ms of finishing their sentence. The LLM provider's dashboard reports time-to-first-token at 280 ms. You are comfortably inside SLO on every chart you check. The user still complains the agent is laggy, and when you finally sit on a call yourself, there is a noticeable pause before the voice comes back — somewhere north of 600 ms, every time. The dashboard is not lying. It is measuring a number that does not include the TTS pipeline, the audio transport, or the jitter buffer on the receiving end. The 350 ms gap between the last token streamed and the first audio frame is real, it just is not on the LLM team's chart.

The bug is not in the model. The bug is in the SLO. It was defined at the wrong layer of the stack. The provider's egress is not the user's ear, and any latency contract that pretends otherwise will look healthy in production while the product feels broken.

Where You Defined 'First Token' Decided Whether Your Latency SLO Was Real

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a reasoning-tier upgrade on a Tuesday and started getting support tickets on Wednesday. Users were saying the assistant felt "broken," "frozen," "hung." The on-call engineer pulled up the latency dashboard and found nothing unusual. p99 first-token latency was 612 ms — comfortably under the 800 ms SLO that the team had spent a quarter establishing. The dashboard was green. The phone was ringing.

The bug turned out to be a single instrumentation decision made fourteen months earlier, before reasoning models existed in production. The metric labeled "first token" measured the timestamp on the first chunk emitted by the provider. After the upgrade, the first chunk was a reasoning token — invisible to the user, never rendered, but counted as "first" by the SLO. The model was emitting four to seven seconds of internal thoughts before the first user-visible character streamed. Every chart stayed green. Every user waited in the dark.

This is not a story about a bad metric. The metric was correct for the model it was designed against. It is a story about what happens when the boundary you instrumented stops being the boundary your users feel — and how dangerously easy it is to ship that drift without noticing.

The Latency Budget Your Agent Loop Stole from the Search Box

· 12 min read
Tian Pan
Software Engineer

The launch metrics looked clean. Answer quality up, citation rate up, the eval suite green. The team that replaced the old keyword search with an agent-backed retriever shipped, took the win, and moved on. Six weeks later somebody noticed the weekly active number on that surface had drifted down twelve percent and nobody could find the regression. There was no regression. The agent worked. The users left because the box that used to answer in two hundred milliseconds now took four seconds, and nothing in the launch retro had a budget for that.

This is the latency-budget transfer problem, and almost nobody draws the org chart that catches it. A search box is not just a function call. It is a thirty-year contract with the user's nervous system: type, see results, scan, click. The 200-millisecond response is not a performance metric on a dashboard somewhere — it is the reason the user's attention is still on the screen when the results arrive. When the team underneath the box replaces a keyword index with an agent loop, the function-call surface looks identical and the SLA on the new call lives in a completely different regime. The latency budget moved from the team that owned the index to the team that owns the agent, and from the team that owns the agent to the user, and the only one who showed up to the meeting was the user.

The Trace That Stops at the Provider Boundary

· 11 min read
Tian Pan
Software Engineer

You did the tracing work. Retrieval has a span. Tool calls have spans. The orchestration loop has a span. A trace ID rides through every internal hop on W3C traceparent headers, just like the SRE playbook says. Then the request hits messages.create, the SDK records a single span called llm.call, and the next 2.8 seconds of your pipeline turn into a black rectangle on the flame graph with no internal structure. The 800 milliseconds before the first token shows up: opaque. The 2 seconds of decode after that: opaque. The share of the wall clock that was network, queue wait, prefill, or per-token decode: unknowable from your trace.

When a customer reports "the assistant felt slow today," your dashboard can confirm the slowness. It cannot localize it. The most expensive minute of your pipeline — measured in dollars, in p95, in user-visible lag — lives inside a vendor's data center, and the contract you accepted when you signed up gives you almost no visibility into it. You are on call for a black box.

Warm Pools and Cold Truths: The Hidden Latency Floor of Serverless LLM Inference

· 9 min read
Tian Pan
Software Engineer

Autoscaling your GPU inference to zero looks like obvious cost discipline. The GPU is the most expensive line item on the bill, traffic is bursty, and the idle hours are pure waste. So you turn on scale-to-zero, watch the cloud invoice drop, and congratulate yourself.

Then a user shows up after a quiet stretch, and their first request takes sixty seconds to return a single token. Production deployments running serverless LLM inference routinely report cold starts exceeding 40 seconds before the first token appears — against roughly 30 milliseconds per token once the model is warm. That is a thousand-fold latency gap between the cold path and the warm path, and it is entirely a function of how idle your traffic happens to be.

This is the trade nobody puts on the slide. Scale-to-zero does not eliminate cost; it converts a steady dollar cost into a spiky latency cost, and then hides that latency cost in the p99 tail where the dashboard rarely looks.

The AI Feature With Two Latencies: You Measure One, Your Users Feel the Other

· 9 min read
Tian Pan
Software Engineer

A traditional HTTP request has one latency that matters: the time from request to response. The p95 of that number is the contract. SRE watches it, the SLO is written against it, and when it regresses someone gets paged. One number, one dashboard, one truth.

A streaming AI feature broke that model the moment the response became a stream, and most teams haven't noticed. There are now two latencies, and they diverge. Time-to-first-token is how long the user stares at a spinner before anything happens. Time-to-completion is how long until the answer is fully written. They are shaped by different forces, fixed by different levers, and felt by the user at completely different emotional weights — and almost every team instruments only the second one, because that's the number the HTTP framework hands them for free.

Latency-Aware Tool Selection: When 'Good Enough Now' Beats 'Best Available Later'

· 10 min read
Tian Pan
Software Engineer

The tool description in your agent's system prompt is a six-month-old eval artifact. It says search_pricing returns "fresh inventory data with structured pricing" and the planner believes it, because nothing in the prompt has updated since the day the description was tuned. The actual search_pricing endpoint has been sitting at p95 of 11 seconds for the last forty minutes because the upstream vendor is rate-limiting your account, and the cheaper search_cache tool — which the prompt describes as "may be slightly stale" — would return the same answer in 200ms. The planner picks search_pricing anyway, because the description still reads like it did during eval, and the planner has no signal about what either tool costs to call right now.

This is the structural failure of static tool descriptions. The planner is making routing decisions on a snapshot of a world that has moved on. Tool selection isn't really a capability question — most production agents have two or three tools that overlap heavily in what they can answer — it's a cost-of-waiting question, and the cost of waiting is the thing your prompt template doesn't see.

The Latency Budget Negotiation: How to Tell Product That 'Real-Time' Costs Capability

· 11 min read
Tian Pan
Software Engineer

A product manager walks into a planning meeting with a one-line requirement: "responses under two seconds, like ChatGPT." The agent under discussion makes six tool calls, hits two retrieval indexes, runs a reasoning model with a thinking budget, and validates its output with a second-pass critic. End-to-end p50 is currently nine seconds. The engineering team has three options: say yes and quietly degrade the agent into something worse, say no and watch the PM go shopping for a vendor whose demo video promises the moon, or do the thing nobody teaches in onboarding — open a structured negotiation where every second of latency is convertible to a capability the agent gives up.

Most teams pick option one. The agent ships at two seconds, accuracy drops twelve points, the launch is called a success because the headline latency number was met, and three months later the team is fighting a quality regression that nobody can attribute to a single change because the regression was the launch itself. The latency target was never priced. It was inherited from a product spec that treated speed as free.

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.