Skip to main content

39 posts tagged with "performance"

View all tags

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

The Context Length Arms Race: Why Filling the Window Is the Wrong Goal

· 7 min read
Tian Pan
Software Engineer

Every six months, a model ships with a bigger context window. GPT-4.1 hit 1 million tokens. Gemini 2.5 followed at 2 million. Llama 4 is now advertising 10 million. The implicit promise is: dump everything in, stop worrying about what to include, let the model figure it out.

That promise does not hold up in production. A 2024 study evaluating 18 leading LLMs found that every single model showed performance degradation as input length increased. Not some models — every model. The context window is a ceiling, not a floor, and the teams that treat it as a floor are discovering that the hard way.

End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.

The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.

The Parallelism Trap in Agentic Pipelines: When Fan-Out Makes Latency Worse

· 8 min read
Tian Pan
Software Engineer

Your agent pipeline is slow, so you split the work across five parallel sub-agents. The p50 drops. You ship it. Three days later, an on-call page fires: a batch of user requests is timing out. You dig in and find that p99 has climbed from 4 seconds to 22 seconds. Nothing in the individual agents changed. The timeout was caused by the orchestration layer waiting for the slowest of the five, which ran into a retrieval hiccup that only happens 1% of the time — but now it happens to any request that touches all five paths.

This is the parallelism trap: a pattern that looks like an obvious speedup but restructures your latency distribution in ways that hurt real users more than the p50 improvement helps them. Across production benchmarks, single agents match or outperform multi-agent pipelines on 64% of evaluated tasks. When parallel fan-out wins, it wins cleanly — but only for a specific class of problems. The mistake is treating fan-out as the default.

The Preprocessing Bottleneck That Kills AI Pipeline Throughput

· 10 min read
Tian Pan
Software Engineer

A team builds a RAG-backed feature, measures end-to-end latency, finds it unacceptably slow, and immediately starts optimizing the model call. They try a smaller model, batch requests, tune temperature and token limits. After two sprints of work, latency drops by 15%. The feature is still too slow. What they never measured: the 600ms they're spending chunking text and generating embeddings before the LLM ever receives a prompt.

This pattern is common enough that it has a name in distributed systems: optimizing the wrong component. In AI pipelines, the LLM call is visible and easy to measure. Everything before it is invisible until you explicitly instrument it — and that's exactly where throughput dies.

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

Your Vector Store Has Hot Keys: Why ANN Indexes Lie About Production Cost

· 10 min read
Tian Pan
Software Engineer

The vector index your team picked was benchmarked on a workload that doesn't exist in production. Every public ANN benchmark — VIBE, ann-benchmarks, the comparison table on the database vendor's landing page — runs queries sampled uniformly from the corpus, so every neighbor lookup costs roughly the same and every shard sees roughly equal load. Real retrieval traffic does not look like that. It looks Zipfian: a small fraction of queries (today's news, the trending product, the recurring support intent, the few hundred questions a customer support team gets all day) hits a small fraction of embeddings a hundred times more often than the median. The benchmark says HNSW recall is 0.97 at 50ms p99. Production says one shard is melting and the rest are bored.

The mismatch is not a tuning problem. It's that vector retrieval inherits the access-skew profile of every other database workload, and the indexes the field has standardized on were not designed with that profile in mind. The cache layer your KV store gets for free — the OS page cache warming up the rows you read most often, the LRU on a hot key — does not exist for ANN, because the graph is walked in graph order, not access order. The hot embeddings stay cold in memory because the search algorithm's traversal pattern looks random to the page cache, and your "popular" cluster lives on a single shard whose CPU runs hot while the rest of the fleet idles.

Agent Latency Budgets Are Trees, Not Lines — You Have Been Debugging the Wrong Axis

· 12 min read
Tian Pan
Software Engineer

A user reports "the assistant felt slow this morning." The on-call engineer pulls up the flame graph, sorts tool calls by duration descending, finds the slowest one — a 2.1-second vector search — optimizes it down to 900ms, ships the fix, and marks the incident resolved. A week later the same complaint arrives. The vector search is still 900ms. But the end-to-end latency on that query type has actually gotten worse. Nothing in the flame graph explains why.

This is what happens when an engineer debugs a tree on the line axis. Agent latency is not a waterfall of sequential steps — it is a nested tree of planning calls, tool subtrees, parallel fan-outs, retries, and recursive sub-agents. When the budget is structural but the tooling treats it as linear, local optimizations miss the actual violation, which lives in how time is distributed across branches, not how long any single call takes. You can make every leaf faster and still ship a p99 that is getting worse.

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

· 11 min read
Tian Pan
Software Engineer

Your users don't have a stopwatch. They have feelings. And those feelings diverge from wall-clock reality in ways that matter enormously for how you build AI interfaces. A response that appears character-by-character over three seconds will consistently feel faster to users than a response that materializes all at once after one second — even though the batch system is objectively faster. This isn't irrational or a bug in human cognition. It's a well-documented perceptual phenomenon, and if you're building AI products without accounting for it, you're optimizing for the wrong metric.

This post breaks down the psychology behind latency perception, the metrics that actually predict user satisfaction, the frontend patterns that exploit these perceptual quirks, and when streaming adds more complexity than it's worth.