Skip to main content

18 posts tagged with "llm-inference"

View all tags

The Quantization Quality Cliff: When int4 Passes the Median Eval and Fails on the Long Tail

· 11 min read
Tian Pan
Software Engineer

A team swaps an fp16 model for an int4 quantization to halve serving cost. The eval suite scores within a point of the original on the curated test set. The rollout ships under the rationale "indistinguishable on the benchmark." Six weeks later, support is fielding catastrophic-failure quotes from regulated customers — code that compiles to nonsense, low-resource-language responses that drift into another script, multi-hop arithmetic that confidently returns numbers off by an order of magnitude. The benchmark didn't lie. It just measured the median, and quantization is not a uniform tax on the median. It is a non-uniform tax on the tail.

This is the quantization quality cliff: the moment your eval suite, your rollout discipline, and your cost-savings narrative all simultaneously fail because the metric you used to approve the swap had no signal on the capabilities you destroyed. Recent benchmarks make the magnitude concrete. On long-context tasks, 8-bit quantization preserves accuracy with roughly a 0.8% drop, while 4-bit methods lose up to 59% on the same workload — a regression invisible to any test set that doesn't oversample tail inputs. Median moved one point. Tail moved fifteen, or thirty, or fifty.

Capacity Math for Agent Loops: Why Your Provisioned Throughput Is Half of What You Think

· 11 min read
Tian Pan
Software Engineer

A team I worked with launched what they called a "modest" feature: an internal research assistant for a few hundred analysts. Their capacity model said one user request equals one model call, so they sized provisioned throughput against peak user QPS with the standard 30 percent burst headroom. On launch day they hit 429s within an hour, traffic that should have used 40 percent of their reserved capacity saturated 100 percent, and the postmortem revealed a number nobody had multiplied in: the average request triggered 11 model calls, not one.

This is the most common capacity miss I see in agent rollouts. The math is not subtle and the failure mode is not exotic. The team asked the wrong unit question — they planned in user requests when the meter ticks in model calls — and the reservation they paid real money for evaporated under a load they would have called light if it had been a chat product.

GPU Capacity Planning When Demand Is a Cliff, Not a Curve

· 10 min read
Tian Pan
Software Engineer

The first time an agent platform falls over, the postmortem usually contains a sentence that reads something like: "We had eight weeks of headroom on Friday. By Monday afternoon, we were at 140% of provisioned capacity." Nobody is lying. The capacity model was correct, applied to a workload it was never designed for. Classical capacity planning assumes demand grows along a smooth curve where weekly seasonality is the dominant signal and the worst case is a Black Friday you can plan against six months out. Agent workloads break that assumption hard.

The shape of agent demand is not a curve. It is a cliff. Three things produce the cliff and they compound. A single enterprise customer onboarding can shift baseline by 10x overnight on a contractual notice you've already signed. An agent loop can amplify a tiny increase in user activity into a fanout-multiplied surge that hits inference 30x harder than the user-facing graph suggests. A single product change — enabling tool use, lengthening context, switching to a larger model — can move per-task token consumption by an order of magnitude with no change in user count.

If your capacity planning is in QPS and your headroom budget is "75% utilization is healthy," you are not planning. You are gambling that none of those three cliffs lands on the same week.

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

· 12 min read
Tian Pan
Software Engineer

The "identical output" guarantee that ships with every speculative decoding paper is a guarantee about token distributions, not about what your user sees. Read the proofs carefully and you find a clean mathematical equivalence: the rejection-sampling acceptance criterion is designed so that the output distribution after speculation is exactly the distribution the target model would have produced on its own. That guarantee binds the bytes that leave the inference engine. It says nothing about the bytes that arrived on the user's screen five hundred milliseconds ago and have to be taken back.

If you stream draft tokens to the client the moment the small model emits them, you are running an A/B test on your own users every time the verifier rejects a suffix. Half a paragraph rewrites itself. A function name changes after the IDE has already syntax-highlighted it. A TTS voice has already pronounced "the answer is likely no" before the verifier swaps in "the answer is yes, with caveats." The math says the final distribution is the same as the slow path. The user's experience says they watched the model change its mind in public.

This is the part of speculative decoding that doesn't make it into the speedup numbers. It is also the part that turns "free 3× throughput" into a half-quarter of streaming-protocol work that nobody scoped.

GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint

· 9 min read
Tian Pan
Software Engineer

Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.

This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.

Speculative Decoding in Production: Free Tokens and Hidden Traps

· 9 min read
Tian Pan
Software Engineer

Most LLM inference bottlenecks come down to one uncomfortable fact: the GPU is waiting on memory bandwidth, not compute. Each token generated requires loading the entire model's weights from HBM, and that transfer dominates runtime. Speculative decoding was designed to exploit this gap — but the gains depend on conditions your benchmark almost certainly didn't test.

Teams that ship speculative decoding into production often see it underperform lab numbers by 40–60%. Not because the technique is flawed, but because the workload characteristics differ in ways that matter: larger batch sizes, shorter outputs, stricter output constraints. Understanding when speculative decoding actually helps — and when it silently hurts — is the prerequisite for deploying it responsibly.

Edge LLM Inference: When Latency, Privacy, or Cost Force You Off the Cloud

· 9 min read
Tian Pan
Software Engineer

A fine-tuned 7B parameter model running on a single RTX 4090 can outperform GPT-4 on domain-specific tasks while costing you nothing per token after the initial hardware investment. That is not a theoretical claim — Diabetica-7B, a diabetes-focused model, hit 87.2% accuracy on clinical queries, beating both GPT-4 and Claude 3.5 on the same benchmark. The catch? Getting there requires understanding exactly when edge inference makes sense and when it is an expensive distraction.

Most teams default to cloud APIs because they are easy — make an HTTP call, get tokens back. But that simplicity has costs that scale in ways engineers do not anticipate until it is too late, and those costs are not always measured in dollars.

Beam Search for Code Agents: Why Greedy Generation Is a Reliability Trap

· 11 min read
Tian Pan
Software Engineer

A code agent that passes 90% of HumanEval is not a reliable code agent. It's a code agent that performs well on problems designed to be solvable in a single pass. Give it a competitive programming problem with strict constraints, or a multi-file refactor with subtle interdependencies, and watch the pass rate crater to 20–30%. The model isn't failing because it lacks knowledge. It's failing because greedy, single-pass generation commits to the first plausible-looking token sequence and never looks back.

The fix isn't a better model. It's a better generation strategy. Recent research has established that applying tree exploration to code generation — branching across multiple candidate solutions, scoring partial programs, and pruning unpromising paths — improves pass rates by 30–130% on hard problems, with no change to the underlying model weights.

Hybrid Cloud-Edge LLM Architecture: Routing Inference Where It Actually Belongs

· 9 min read
Tian Pan
Software Engineer

Most teams pick a side: run everything in the cloud, or compress a model to fit on-device. Both choices leave money and performance on the table. The teams getting the best results in 2025-2026 are doing neither — they're building hybrid architectures that route each inference request to the right tier based on complexity, latency budget, and data sensitivity.

The core insight is simple but underappreciated: 70-80% of production queries don't need a frontier model. They need a fast answer from a small model that sits close to the user. The remaining 20-30% genuinely benefit from a cloud-hosted heavyweight. The engineering challenge is building the routing layer that makes this split invisible.

Hybrid Cloud-Edge LLM Inference: The Routing Layer That Determines Your Cost, Latency, and Privacy Profile

· 10 min read
Tian Pan
Software Engineer

Most teams pick a side: run everything in the cloud, or push everything to the edge. Both are wrong for the majority of production workloads. The interesting engineering happens in the routing layer between them — the component that decides, per-request, whether a query deserves a 70B frontier model on an H100 or a 3B quantized model running on local silicon.

This routing decision isn't just about latency. It's a three-variable optimization across cost, privacy, and capability — and the optimal split changes based on your traffic patterns, regulatory environment, and what "good enough" means for each query type. Teams that get the routing right cut inference costs 60–80% while improving p95 latency. Teams that get it wrong either overspend on cloud GPUs for trivial queries or ship degraded answers from edge models that can't handle the complexity.

Hybrid Cloud-Edge LLM Inference: The Latency-Privacy-Cost Triangle That Determines Where Your Model Runs

· 11 min read
Tian Pan
Software Engineer

Most teams run every LLM call through a cloud API. It's the path of least resistance: no hardware to manage, no models to optimize, and the latest frontier capabilities are one HTTP request away. But as AI moves deeper into production — processing sensitive documents, powering real-time interactions, running on mobile devices — the assumption that cloud is always the right answer starts to crack.

The cracks show up in three places simultaneously. Latency: a 200ms network round-trip that's invisible in a chatbot becomes unacceptable in voice AI or real-time code completion. Privacy: data that leaves the device creates compliance surface area that legal teams increasingly won't sign off on. Cost: at high request volumes with low utilization variance, you're paying a significant premium for infrastructure you could own.