Skip to main content

43 posts tagged with "performance"

View all tags

The CI Host Whose CPU Governor Decided Your Agent Benchmark's Outcome

· 9 min read
Tian Pan
Software Engineer

A team I worked with spent three days hunting a 22% latency regression in their agent loop. They blamed a new tool router. They blamed a switched model version. They blamed the JSON schema validator they had quietly upgraded the week before. They eventually found the culprit two layers below their code: a runner image had rolled forward, the new image defaulted the cpufreq governor to schedutil instead of performance, and the burstiness of an agent's tool-call loop made schedutil's ramp-up latency visible in p95. The model was fine. The agent was fine. The kernel changed its mind about how to clock the CPU between micro-bursts of work, and the entire benchmark moved.

This is the failure mode most agent teams never see, because they never look. Your CI benchmark numbers are not measurements of the model or the agent. They are measurements of a stack that happens to include a model, a network, a shared VM, a hypervisor scheduler, a cache hierarchy with unknown neighbors, and — most quietly — a frequency-scaling policy that gets to decide whether a given millisecond of compute runs at 1.0 GHz or 3.6 GHz.

The Latency Budget Your Orchestrator Spent on Its Own Planning Step

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a week-long instrumentation pass on a customer-support agent that had, on paper, a perfectly reasonable median latency. P50 was inside SLO, P95 was uncomfortable but explainable, and the tool-call traces looked healthy. Then someone bucketed the spans by type and the room got quiet. The agent was spending roughly 58% of its wall-clock per run inside spans labeled "plan," "reflect," "decide-next-step," and "self-check." Tool execution — the database lookups, the CRM writes, the auth checks — accounted for under 30%. The thing the agent was being measured on did less than the thing nobody was measuring.

That ratio is not a fluke. It is the natural state of any plan-act-observe loop that you do not actively police. The orchestrator is paid in latency for thinking and paid in latency for acting, and the thinking step is almost always cheaper to add than the acting step, so it grows unchecked. By the time you notice, "decide what to do next" has become its own line item — bigger than most of the line items you originally built the agent to serve.

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

· 10 min read
Tian Pan
Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

The Context Length Arms Race: Why Filling the Window Is the Wrong Goal

· 7 min read
Tian Pan
Software Engineer

Every six months, a model ships with a bigger context window. GPT-4.1 hit 1 million tokens. Gemini 2.5 followed at 2 million. Llama 4 is now advertising 10 million. The implicit promise is: dump everything in, stop worrying about what to include, let the model figure it out.

That promise does not hold up in production. A 2024 study evaluating 18 leading LLMs found that every single model showed performance degradation as input length increased. Not some models — every model. The context window is a ceiling, not a floor, and the teams that treat it as a floor are discovering that the hard way.

End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.

The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.

The Parallelism Trap in Agentic Pipelines: When Fan-Out Makes Latency Worse

· 8 min read
Tian Pan
Software Engineer

Your agent pipeline is slow, so you split the work across five parallel sub-agents. The p50 drops. You ship it. Three days later, an on-call page fires: a batch of user requests is timing out. You dig in and find that p99 has climbed from 4 seconds to 22 seconds. Nothing in the individual agents changed. The timeout was caused by the orchestration layer waiting for the slowest of the five, which ran into a retrieval hiccup that only happens 1% of the time — but now it happens to any request that touches all five paths.

This is the parallelism trap: a pattern that looks like an obvious speedup but restructures your latency distribution in ways that hurt real users more than the p50 improvement helps them. Across production benchmarks, single agents match or outperform multi-agent pipelines on 64% of evaluated tasks. When parallel fan-out wins, it wins cleanly — but only for a specific class of problems. The mistake is treating fan-out as the default.

The Preprocessing Bottleneck That Kills AI Pipeline Throughput

· 10 min read
Tian Pan
Software Engineer

A team builds a RAG-backed feature, measures end-to-end latency, finds it unacceptably slow, and immediately starts optimizing the model call. They try a smaller model, batch requests, tune temperature and token limits. After two sprints of work, latency drops by 15%. The feature is still too slow. What they never measured: the 600ms they're spending chunking text and generating embeddings before the LLM ever receives a prompt.

This pattern is common enough that it has a name in distributed systems: optimizing the wrong component. In AI pipelines, the LLM call is visible and easy to measure. Everything before it is invisible until you explicitly instrument it — and that's exactly where throughput dies.

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.