Skip to main content

41 posts tagged with "performance"

View all tags

Speculative Decoding in Production: Free Tokens and Hidden Traps

· 9 min read
Tian Pan
Software Engineer

Most LLM inference bottlenecks come down to one uncomfortable fact: the GPU is waiting on memory bandwidth, not compute. Each token generated requires loading the entire model's weights from HBM, and that transfer dominates runtime. Speculative decoding was designed to exploit this gap — but the gains depend on conditions your benchmark almost certainly didn't test.

Teams that ship speculative decoding into production often see it underperform lab numbers by 40–60%. Not because the technique is flawed, but because the workload characteristics differ in ways that matter: larger batch sizes, shorter outputs, stricter output constraints. Understanding when speculative decoding actually helps — and when it silently hurts — is the prerequisite for deploying it responsibly.

Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents

· 10 min read
Tian Pan
Software Engineer

The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.

Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.

The Intent Classification Layer Most Agent Routers Skip

· 11 min read
Tian Pan
Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

TTFT Is the Only Latency Metric Your Users Actually Feel

· 9 min read
Tian Pan
Software Engineer

Your model generates a 500-word response in 8 seconds. A competing model generates the same response in 12 seconds. Intuitively, yours should feel faster. But if your first token arrives at 2.5 seconds and theirs arrives at 400 milliseconds, your users will describe your product as slow — regardless of total generation time. This is the central paradox of LLM latency: the metric your infrastructure team optimizes for (end-to-end generation time, tokens per second) is not the metric your users experience. Time-to-first-token is.

TTFT is not a detail. It is the primary signal users use to judge whether your AI feature is responsive. Getting it wrong means building fast systems that feel slow.

Write Amplification in Agentic Systems: Why One Tool Call Hits Six Databases

· 10 min read
Tian Pan
Software Engineer

When an agent decides to remember something — "the user prefers email over Slack" — it looks like a single write. In practice, it is six writes: a new embedding in the vector store, a row in the relational database, an entry in the session cache, a record in the event log, an entry in the audit trail, and an update to the context store. Each one happens because a different part of the system has a legitimate need for the data, and each one introduces a new failure surface.

This is write amplification at the infrastructure layer, and it's one of the quieter operational crises in production agent deployments. It does not cause dramatic failures. It causes partial failures: the user's preference is searchable semantically but the relational query returns stale data; the audit log shows an action that never fully completed; the cache is warm but the context store wasn't updated, so the next session starts without the learned pattern.

Understanding why this happens — and what to do about it — requires borrowing from database internals rather than the agent framework documentation.

Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic

· 11 min read
Tian Pan
Software Engineer

Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.

This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.

Why Your Database Melts When AI Features Ship: LLM-Aware Connection Pool Design

· 9 min read
Tian Pan
Software Engineer

Your connection pool was fine until someone shipped the AI feature. Login works, dashboards load, CRUD operations hum along at single-digit millisecond latencies. Then the team deploys a RAG-powered search, an agent-driven workflow, or an LLM-backed summarization endpoint — and within hours, your core product starts timing out. The database didn't get slower. Your pool just got eaten alive by a workload it was never designed to handle.

This is the LLM connection pool problem, and it's hitting teams across the industry as AI features move from prototype to production. The fix isn't "just add more connections." In fact, that usually makes things worse.

The Cold Start Tax on Serverless AI Agents

· 11 min read
Tian Pan
Software Engineer

A standard Lambda function with a thin Python handler cold-starts in about 250ms. Your AI agent, calling the same runtime with a few SDK imports added, cold-starts in 8–12 seconds. Add local model inference and you're at 40–120 seconds. The first user to hit a scaled-down deployment waits the length of a TV commercial before the agent responds. That gap — not latency per inference token, not throughput, but the initial startup cost — is where most serverless AI deployments quietly fail their users.

The problem isn't unique to serverless, but serverless makes it visible. When you run agents on always-on infrastructure, you pay for idle capacity and cold starts never happen. When you embrace scale-to-zero to cut costs, every period of low traffic becomes a trap waiting for the next request.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.

The Context Stuffing Antipattern: Why More Context Makes LLMs Worse

· 9 min read
Tian Pan
Software Engineer

When 1M-token context windows shipped, many teams took it as permission to stop thinking about context design. The reasoning was intuitive: if the model can see everything, just give it everything. Dump the document. Pass the full conversation history. Forward every tool output to the next agent call. Let the model sort it out.

This is the context stuffing antipattern, and it produces a characteristic failure mode: systems that work fine in early demos, then hit a reliability ceiling in production that no amount of prompt tweaking seems to fix. Accuracy degrades on questions that should be straightforward. Answers become hedged and non-committal. Agents start hallucinating joins between documents that aren't related. The model "saw" all the right information — it just couldn't find it.

Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving

· 11 min read
Tian Pan
Software Engineer

Most LLM serving infrastructure failures in production aren't model failures—they're scheduling failures. Teams stand up a capable model, load test it, and discover they're burning expensive GPU time at 35% utilization while users wait. The culprit is almost always static batching: a default inherited from conventional deep learning that fundamentally doesn't fit how language models generate text.

Continuous batching—also called iteration-level scheduling or in-flight batching—is the mechanism that fixes this. It's not a tuning knob; it's an architectural change to how the serving loop runs. The difference between a system using it and one that isn't can be 4–8x in throughput for the same hardware.

Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You

· 8 min read
Tian Pan
Software Engineer

Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.

Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.