Skip to main content

39 posts tagged with "performance"

View all tags

Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops

· 10 min read
Tian Pan
Software Engineer

If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.

This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

· 11 min read
Tian Pan
Software Engineer

Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.

Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.

AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.

Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource

· 11 min read
Tian Pan
Software Engineer

Your GPU dashboard is lying to you. At 60% utilization, your inference cluster looks healthy. Users are experiencing 8-second time-to-first-token. The on-call engineer checks memory — also fine. Compute — fine. And yet the queue is growing and latency is spiking. This is what happens when you apply traditional capacity planning to LLM workloads: the metrics you trust point to the wrong places, and the actual bottleneck stays invisible until users start complaining.

The root problem is that LLMs consume a fundamentally different kind of resource. CPU services trade compute and memory. LLM services trade tokens — and tokens don't behave like requests.

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System

· 9 min read
Tian Pan
Software Engineer

You swap your expensive LLM for a faster, cheaper distilled model. Latency goes up. Costs increase. Quality degrades. You roll back, confused, having just spent three weeks on optimization work that made everything worse.

This isn't a hypothetical. It's one of the most common failure modes in production AI systems, and it stems from a seductive but wrong mental model: that optimizing a component optimizes the system.

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

· 11 min read
Tian Pan
Software Engineer

You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.

Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.

This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.

On-Device LLM Inference in Production: When Edge Models Are Right and What They Actually Cost

· 10 min read
Tian Pan
Software Engineer

Most teams decide to use on-device LLM inference the same way they decide to rewrite their database: impulsively, in response to a problem that a cheaper solution could have solved. The pitch is always compelling—no network round-trips, full privacy, zero inference costs—and the initial prototype validates it. Then six months post-ship, the model silently starts returning worse outputs, a new OS update breaks quantization compatibility, and your users on budget Android phones are running a version you can't push an update to.

This guide is about making that decision with eyes open. On-device inference is genuinely the right call in specific situations, but the cost structure is different from what teams expect, and the production failure modes are almost entirely unlike cloud LLM deployment.

When Code Beats the Model: A Decision Framework for Replacing LLM Calls with Deterministic Logic

· 8 min read
Tian Pan
Software Engineer

Most AI engineering teams have the same story. They start with a hard problem that genuinely needs an LLM. Then, once the LLM infrastructure is in place, every new problem starts looking like a nail for the same hammer. Six months later, they're calling GPT-4o to check whether an email address contains an "@" symbol — and they're paying for it.

The "just use the model" reflex is now the dominant driver of unnecessary complexity, inflated costs, and fragile production systems in AI applications. It's not that engineers are careless. It's that LLMs are genuinely impressive, the tooling has lowered the barrier to using them, and once you've built an LLM pipeline, adding another call feels trivially cheap. It isn't.

Database Connection Pools Are the Hidden Bottleneck in Your AI Pipeline

· 9 min read
Tian Pan
Software Engineer

Your AI feature ships. Response times look reasonable in staging. A week later, production starts throwing mysterious p99 spikes — latency jumps from 800ms to 8 seconds under moderate load, with no GPU pressure, no model errors, and no obvious cause. You add more replicas. It doesn't help. You profile the model server. It's fine. You add caching. Still no improvement.

Eventually someone checks the database connection pool wait time. It's been sitting at 95% utilization since day three.

This is the most common category of AI production incident that nobody talks about, because connection pool exhaustion looks like model slowness. The symptoms appear in the wrong layer — you see high latency on LLM calls, not on database queries — so the diagnosis takes days while users experience degraded responses.

The Edge Inference Decision Framework: When to Run AI Models Locally Instead of in the Cloud

· 12 min read
Tian Pan
Software Engineer

Most teams make the cloud-vs-edge decision by gut instinct: cloud is easier, so they default to cloud. Then a HIPAA audit hits, or the latency SLO slips by 400ms, or the monthly invoice arrives. Only then do they ask whether some of that inference should have been local all along.

The answer is almost never "all cloud" or "all edge." The teams running production AI at scale have settled on a tiered architecture: an on-device or on-premise model handles the majority of requests, and a cloud frontier model catches what the smaller model can't. Getting that routing right is an engineering decision, not an intuition.

This is the decision framework for making it rigorously.

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

· 10 min read
Tian Pan
Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

Speculative Decoding in Production: Free Tokens and Hidden Traps

· 9 min read
Tian Pan
Software Engineer

Most LLM inference bottlenecks come down to one uncomfortable fact: the GPU is waiting on memory bandwidth, not compute. Each token generated requires loading the entire model's weights from HBM, and that transfer dominates runtime. Speculative decoding was designed to exploit this gap — but the gains depend on conditions your benchmark almost certainly didn't test.

Teams that ship speculative decoding into production often see it underperform lab numbers by 40–60%. Not because the technique is flawed, but because the workload characteristics differ in ways that matter: larger batch sizes, shorter outputs, stricter output constraints. Understanding when speculative decoding actually helps — and when it silently hurts — is the prerequisite for deploying it responsibly.

Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents

· 10 min read
Tian Pan
Software Engineer

The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.

Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.