Skip to main content

41 posts tagged with "performance"

View all tags

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

· 10 min read
Tian Pan
Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

The Context Length Arms Race: Why Filling the Window Is the Wrong Goal

· 7 min read
Tian Pan
Software Engineer

Every six months, a model ships with a bigger context window. GPT-4.1 hit 1 million tokens. Gemini 2.5 followed at 2 million. Llama 4 is now advertising 10 million. The implicit promise is: dump everything in, stop worrying about what to include, let the model figure it out.

That promise does not hold up in production. A 2024 study evaluating 18 leading LLMs found that every single model showed performance degradation as input length increased. Not some models — every model. The context window is a ceiling, not a floor, and the teams that treat it as a floor are discovering that the hard way.

End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.

The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.

The Parallelism Trap in Agentic Pipelines: When Fan-Out Makes Latency Worse

· 8 min read
Tian Pan
Software Engineer

Your agent pipeline is slow, so you split the work across five parallel sub-agents. The p50 drops. You ship it. Three days later, an on-call page fires: a batch of user requests is timing out. You dig in and find that p99 has climbed from 4 seconds to 22 seconds. Nothing in the individual agents changed. The timeout was caused by the orchestration layer waiting for the slowest of the five, which ran into a retrieval hiccup that only happens 1% of the time — but now it happens to any request that touches all five paths.

This is the parallelism trap: a pattern that looks like an obvious speedup but restructures your latency distribution in ways that hurt real users more than the p50 improvement helps them. Across production benchmarks, single agents match or outperform multi-agent pipelines on 64% of evaluated tasks. When parallel fan-out wins, it wins cleanly — but only for a specific class of problems. The mistake is treating fan-out as the default.

The Preprocessing Bottleneck That Kills AI Pipeline Throughput

· 10 min read
Tian Pan
Software Engineer

A team builds a RAG-backed feature, measures end-to-end latency, finds it unacceptably slow, and immediately starts optimizing the model call. They try a smaller model, batch requests, tune temperature and token limits. After two sprints of work, latency drops by 15%. The feature is still too slow. What they never measured: the 600ms they're spending chunking text and generating embeddings before the LLM ever receives a prompt.

This pattern is common enough that it has a name in distributed systems: optimizing the wrong component. In AI pipelines, the LLM call is visible and easy to measure. Everything before it is invisible until you explicitly instrument it — and that's exactly where throughput dies.

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

Your Vector Store Has Hot Keys: Why ANN Indexes Lie About Production Cost

· 10 min read
Tian Pan
Software Engineer

The vector index your team picked was benchmarked on a workload that doesn't exist in production. Every public ANN benchmark — VIBE, ann-benchmarks, the comparison table on the database vendor's landing page — runs queries sampled uniformly from the corpus, so every neighbor lookup costs roughly the same and every shard sees roughly equal load. Real retrieval traffic does not look like that. It looks Zipfian: a small fraction of queries (today's news, the trending product, the recurring support intent, the few hundred questions a customer support team gets all day) hits a small fraction of embeddings a hundred times more often than the median. The benchmark says HNSW recall is 0.97 at 50ms p99. Production says one shard is melting and the rest are bored.

The mismatch is not a tuning problem. It's that vector retrieval inherits the access-skew profile of every other database workload, and the indexes the field has standardized on were not designed with that profile in mind. The cache layer your KV store gets for free — the OS page cache warming up the rows you read most often, the LRU on a hot key — does not exist for ANN, because the graph is walked in graph order, not access order. The hot embeddings stay cold in memory because the search algorithm's traversal pattern looks random to the page cache, and your "popular" cluster lives on a single shard whose CPU runs hot while the rest of the fleet idles.

Agent Latency Budgets Are Trees, Not Lines — You Have Been Debugging the Wrong Axis

· 12 min read
Tian Pan
Software Engineer

A user reports "the assistant felt slow this morning." The on-call engineer pulls up the flame graph, sorts tool calls by duration descending, finds the slowest one — a 2.1-second vector search — optimizes it down to 900ms, ships the fix, and marks the incident resolved. A week later the same complaint arrives. The vector search is still 900ms. But the end-to-end latency on that query type has actually gotten worse. Nothing in the flame graph explains why.

This is what happens when an engineer debugs a tree on the line axis. Agent latency is not a waterfall of sequential steps — it is a nested tree of planning calls, tool subtrees, parallel fan-outs, retries, and recursive sub-agents. When the budget is structural but the tooling treats it as linear, local optimizations miss the actual violation, which lives in how time is distributed across branches, not how long any single call takes. You can make every leaf faster and still ship a p99 that is getting worse.