25 posts tagged with "inference"

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

April 19, 2026 · 11 min read

Software Engineer

You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.

Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.

This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.

LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

The promise sounds clean: fine-tune lightweight LoRA adapters for each specialized skill — one for professional tone, one for JSON formatting, one for medical terminology, one for safety guardrails — then combine them at serving time. Teams ship this design, it works fine in development, and then falls apart in production when two adapters start fighting over the same weight regions and the output quality collapses to something indistinguishable from the untrained base model. Not slightly worse. Completely untuned.

This post is about what happens when you compose adapters in practice, why naive merging fails so reliably, and what strategies actually work at production scale.

The On-Device LLM Problem Nobody Talks About: Model Update Propagation

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

Most engineers who build on-device LLM features spend their time solving the problems that are easy to see: quantization, latency, memory limits. The model fits on the phone, inference is fast enough, and the demo looks great. Then they ship to millions of devices and discover a harder problem that nobody warned them about: you now have millions of independent compute nodes running different versions of your AI model, and you have no reliable way to know which one any given user is running.

Cloud inference is boring in the best way. You update the model, redeploy the server, and within minutes the entire user base is running the new version. On-device inference breaks this assumption entirely. A user who last opened your app three months ago is still running the model that was current then — and there's no clean way to force an update, no server-side rollback, and no simple way to detect the mismatch without adding instrumentation you probably didn't build from the start.

This version fragmentation is the central operational challenge of on-device AI, and it has consequences that reach far beyond a slow rollout. It creates silent capability drift, complicates incident response, and turns your "AI feature" into a heterogeneous fleet of independently-behaving systems that you're responsible for but can't directly control.

Browser-Native LLM Inference: The WebGPU Engineering You Didn't Know You Needed

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Most AI features are architected the same way: user input travels to an API, a cloud GPU processes it, and a response travels back. That round trip is so normalized that engineers rarely question it. But it carries a hidden tax: 200–800ms of network latency on every interaction, an API key that must live somewhere accessible (and therefore vulnerable), and a hard dependency on uptime you don't control.

Browser-native LLM inference via WebGPU breaks all three of those assumptions. The model runs on the user's GPU, inside a browser sandbox, with no network round-trip. This isn't a future capability — as of late 2025, WebGPU ships by default across Chrome, Firefox, Edge, and Safari, covering roughly 82.7% of global browser traffic. The engineering question has shifted from "can we do this?" to "when does it beat the cloud, and how do we route intelligently between the two?"

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

On-Device LLM Inference: When to Move AI Off the Cloud

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover that running AI inference in the cloud has sharp edges only after they've already hit them: a HIPAA audit that traces back to PHI crossing API boundaries, latency numbers in staging that look fine until a user on a spotty connection reports "it just spins," or a per-inference API bill that looked reasonable at 10,000 requests per day and catastrophic at 10 million. On-device inference is often the right answer — but the reasons teams reach for it, and the problems they hit when they do, are rarely the same ones that show up in blog post comparisons.

This is a practical guide to the decision: when local execution beats cloud APIs, which small models actually deliver, and what the deployment lifecycle looks like once the benchmark demo is over.

Grammar-Constrained Generation: The Output Reliability Technique Most Teams Skip

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams that need structured LLM output follow the same playbook: write a prompt that says "respond only with valid JSON," parse the response, run Pydantic validation, and if it fails, retry with the error message appended. This works often enough to ship. It also fails in production at exactly the worst moments — under load, on edge-case inputs, and with cheaper models that don't follow instructions as reliably as GPT-4.

Grammar-constrained generation is a fundamentally different approach. Instead of asking the model nicely and checking afterward, it makes structurally invalid outputs mathematically impossible. The model cannot emit a missing brace, a non-existent enum value, or a required field it forgot — because those tokens are filtered out before sampling. Not unlikely. Impossible.

Most teams skip it. They shouldn't.

GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

April 14, 2026 · 10 min read

Tian Pan

Software Engineer

Most GPU clusters running LLM inference are wasting between 30% and 50% of their available compute. Not because engineers are careless, but because the scheduling problem is genuinely hard—and the tools most teams reach for first were never designed for it.

The standard approach is to stand up Kubernetes, request whole GPUs per pod, and let the scheduler figure it out. This works fine for training jobs. For inference across a heterogeneous set of models, it quietly destroys utilization. A cluster running three different 7B models with sporadic traffic will find each GPU busy less than 15% of the time, while remaining fully "allocated" and refusing to schedule new work.

The root cause is a mismatch between how Kubernetes thinks about GPUs and what LLM inference actually requires.

The Non-Determinism Tax: Building Reliable Pipelines on Probabilistic Infrastructure

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Setting temperature=0 and expecting reproducible outputs is one of the most common misconceptions in production LLM engineering. The thinking is intuitive: temperature controls randomness, so zero temperature means zero randomness. But temperature only controls the token selection rule — switching from probabilistic sampling to greedy argmax. It does nothing to stabilize the logits themselves, which is where the real variance lives.

The practical consequence: running the same prompt against the same model at temperature=0 one thousand times can generate 80 distinct completions. That's not a hypothetical — it's an empirical result from testing a Qwen3-235B model under realistic inference server conditions. Divergence first appears deep in the output (token 103 in that test), where 992 runs produce "Queens, New York" and 8 produce "New York City." Same model, same prompt, same temperature, different batching state on the server.

Knowledge Distillation Economics: When Compressing a Frontier Model Actually Pays Off

April 9, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams that reach for knowledge distillation do it for the wrong reasons and at the wrong time. They see a 70B model blowing their inference budget, read that distillation can produce a 7B student that's "just as good," and start immediately. Six weeks later they have a distilled model that scores well on their validation set, ships to production, and begins producing confident nonsense at scale. The validation set was drawn from the same distribution as the teacher's synthetic training data. Real traffic was not.

Distillation is an optimization tool, not a capability upgrade. The economics only work under specific conditions — and the failure modes are subtle enough that teams often don't detect them until users do.

When Thinking Models Actually Help: A Production Decision Framework for Inference-Time Compute

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

There is a study where researchers asked a reasoning model to compare two numbers: 0.9 and 0.11. One model took 42 seconds to answer. The math took a millisecond. The model spent the remaining 41.9 seconds thinking — badly. It re-examined its answer, doubted itself, reconsidered, and arrived at the correct conclusion it had already reached in its first three tokens.

This is the overthinking problem, and it is not a corner case. It is what happens when you apply inference-time compute indiscriminately to tasks that don't need it.

The emergence of reasoning models — o1, o3, DeepSeek R1, Claude with extended thinking — represents a genuine capability leap for hard problems. It also introduces a new class of production mistakes: deploying expensive, slow deliberation where fast, cheap generation was perfectly adequate. Getting this decision right is increasingly central to building AI systems that actually work.

LLM Latency Decomposition: Why TTFT and Throughput Are Different Problems

March 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most engineers building on LLMs treat latency as a single dial. They tune something — a batch size, a quantization level, an instance type — observe whether "it got faster," and call it done. This works until you hit production and discover that your p50 TTFT looks fine while your p99 is over 3 seconds, or that the optimization that doubled your throughput somehow made individual users feel the system got slower.

TTFT and throughput are not two ends of the same slider. They are caused by fundamentally different physics, degraded by different bottlenecks, and fixed by different techniques. Treating them as interchangeable is the root cause of most LLM inference incidents I've seen in production.

About Tian Pan