Skip to main content

13 posts tagged with "llm-inference"

View all tags

GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint

· 9 min read
Tian Pan
Software Engineer

Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.

This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.

Speculative Decoding in Production: Free Tokens and Hidden Traps

· 9 min read
Tian Pan
Software Engineer

Most LLM inference bottlenecks come down to one uncomfortable fact: the GPU is waiting on memory bandwidth, not compute. Each token generated requires loading the entire model's weights from HBM, and that transfer dominates runtime. Speculative decoding was designed to exploit this gap — but the gains depend on conditions your benchmark almost certainly didn't test.

Teams that ship speculative decoding into production often see it underperform lab numbers by 40–60%. Not because the technique is flawed, but because the workload characteristics differ in ways that matter: larger batch sizes, shorter outputs, stricter output constraints. Understanding when speculative decoding actually helps — and when it silently hurts — is the prerequisite for deploying it responsibly.

Edge LLM Inference: When Latency, Privacy, or Cost Force You Off the Cloud

· 9 min read
Tian Pan
Software Engineer

A fine-tuned 7B parameter model running on a single RTX 4090 can outperform GPT-4 on domain-specific tasks while costing you nothing per token after the initial hardware investment. That is not a theoretical claim — Diabetica-7B, a diabetes-focused model, hit 87.2% accuracy on clinical queries, beating both GPT-4 and Claude 3.5 on the same benchmark. The catch? Getting there requires understanding exactly when edge inference makes sense and when it is an expensive distraction.

Most teams default to cloud APIs because they are easy — make an HTTP call, get tokens back. But that simplicity has costs that scale in ways engineers do not anticipate until it is too late, and those costs are not always measured in dollars.

Beam Search for Code Agents: Why Greedy Generation Is a Reliability Trap

· 11 min read
Tian Pan
Software Engineer

A code agent that passes 90% of HumanEval is not a reliable code agent. It's a code agent that performs well on problems designed to be solvable in a single pass. Give it a competitive programming problem with strict constraints, or a multi-file refactor with subtle interdependencies, and watch the pass rate crater to 20–30%. The model isn't failing because it lacks knowledge. It's failing because greedy, single-pass generation commits to the first plausible-looking token sequence and never looks back.

The fix isn't a better model. It's a better generation strategy. Recent research has established that applying tree exploration to code generation — branching across multiple candidate solutions, scoring partial programs, and pruning unpromising paths — improves pass rates by 30–130% on hard problems, with no change to the underlying model weights.

Hybrid Cloud-Edge LLM Architecture: Routing Inference Where It Actually Belongs

· 9 min read
Tian Pan
Software Engineer

Most teams pick a side: run everything in the cloud, or compress a model to fit on-device. Both choices leave money and performance on the table. The teams getting the best results in 2025-2026 are doing neither — they're building hybrid architectures that route each inference request to the right tier based on complexity, latency budget, and data sensitivity.

The core insight is simple but underappreciated: 70-80% of production queries don't need a frontier model. They need a fast answer from a small model that sits close to the user. The remaining 20-30% genuinely benefit from a cloud-hosted heavyweight. The engineering challenge is building the routing layer that makes this split invisible.

Hybrid Cloud-Edge LLM Inference: The Routing Layer That Determines Your Cost, Latency, and Privacy Profile

· 10 min read
Tian Pan
Software Engineer

Most teams pick a side: run everything in the cloud, or push everything to the edge. Both are wrong for the majority of production workloads. The interesting engineering happens in the routing layer between them — the component that decides, per-request, whether a query deserves a 70B frontier model on an H100 or a 3B quantized model running on local silicon.

This routing decision isn't just about latency. It's a three-variable optimization across cost, privacy, and capability — and the optimal split changes based on your traffic patterns, regulatory environment, and what "good enough" means for each query type. Teams that get the routing right cut inference costs 60–80% while improving p95 latency. Teams that get it wrong either overspend on cloud GPUs for trivial queries or ship degraded answers from edge models that can't handle the complexity.

Hybrid Cloud-Edge LLM Inference: The Latency-Privacy-Cost Triangle That Determines Where Your Model Runs

· 11 min read
Tian Pan
Software Engineer

Most teams run every LLM call through a cloud API. It's the path of least resistance: no hardware to manage, no models to optimize, and the latest frontier capabilities are one HTTP request away. But as AI moves deeper into production — processing sensitive documents, powering real-time interactions, running on mobile devices — the assumption that cloud is always the right answer starts to crack.

The cracks show up in three places simultaneously. Latency: a 200ms network round-trip that's invisible in a chatbot becomes unacceptable in voice AI or real-time code completion. Privacy: data that leaves the device creates compliance surface area that legal teams increasingly won't sign off on. Cost: at high request volumes with low utilization variance, you're paying a significant premium for infrastructure you could own.

Hybrid Cloud-Edge LLM Inference: When On-Device Models Beat the Cloud

· 11 min read
Tian Pan
Software Engineer

Every token your LLM generates in the cloud costs money, adds latency, and sends user data across a network boundary. Every token generated on-device avoids all three—but caps out at what a phone or laptop GPU can handle. The interesting engineering happens at the boundary: deciding which queries deserve the cloud's frontier capabilities and which are better served by a 3B parameter model running locally in under 20 milliseconds.

The hybrid cloud-edge inference pattern isn't theoretical. Apple Intelligence routes between on-device models and Private Cloud Compute. Google's Gemini Nano runs directly on Pixel and Samsung devices while escalating complex requests to cloud Gemini. These aren't demos—they're shipping at billion-device scale. And the underlying architecture is now accessible to any team willing to think carefully about the latency-privacy-cost triangle.

LLM Queuing Theory: Why Your Load Balancer Thinks in Requests While Your GPU Thinks in Tokens

· 11 min read
Tian Pan
Software Engineer

Your load balancer distributes requests evenly across your GPU fleet. Each instance gets roughly the same number of concurrent requests. Everything looks balanced. Yet one instance is crawling at 40 tokens per second while another hums along at 200. The dashboard shows equal request counts, but your users are experiencing wildly different latencies.

The problem is fundamental: traditional load balancing operates at the request level, but LLM inference costs scale with tokens. A single request asking for a 4,000-token essay consumes 50x more GPU time than a request generating an 80-token classification. Treating them as equivalent units is like a highway toll booth counting vehicles without distinguishing motorcycles from 18-wheelers.

This mismatch between request-level thinking and token-level reality is where classical queuing theory meets its most interesting modern challenge.

MoE Models in Production: The Serving Quirks Dense-Model Benchmarks Hide

· 10 min read
Tian Pan
Software Engineer

Benchmarks told you Mixtral 8x7B costs half as much as a 46B dense model to run. What they didn't tell you is that it needs roughly 8.6× more GPU memory than an equivalent dense model, responds with wildly different latency depending on which token hit which expert, and falls apart at medium batch sizes in ways that take days to diagnose. Mixture-of-Experts architectures have become the backbone of nearly every frontier model — DeepSeek-V3, Llama 4, Gemini 1.5, Grok, Mistral Large — but the serving assumptions that work for dense models break in subtle, expensive ways for MoE.

If you're planning to self-host or route traffic to any of these models, here's what dense-model intuition gets wrong.

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

· 10 min read
Tian Pan
Software Engineer

Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.

The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.

Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving

· 11 min read
Tian Pan
Software Engineer

Most LLM serving infrastructure failures in production aren't model failures—they're scheduling failures. Teams stand up a capable model, load test it, and discover they're burning expensive GPU time at 35% utilization while users wait. The culprit is almost always static batching: a default inherited from conventional deep learning that fundamentally doesn't fit how language models generate text.

Continuous batching—also called iteration-level scheduling or in-flight batching—is the mechanism that fixes this. It's not a tuning knob; it's an architectural change to how the serving loop runs. The difference between a system using it and one that isn't can be 4–8x in throughput for the same hardware.