33 posts tagged with "inference"

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

April 17, 2026 · 10 min read

Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

On-Device LLM Inference: When to Move AI Off the Cloud

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams discover that running AI inference in the cloud has sharp edges only after they've already hit them: a HIPAA audit that traces back to PHI crossing API boundaries, latency numbers in staging that look fine until a user on a spotty connection reports "it just spins," or a per-inference API bill that looked reasonable at 10,000 requests per day and catastrophic at 10 million. On-device inference is often the right answer — but the reasons teams reach for it, and the problems they hit when they do, are rarely the same ones that show up in blog post comparisons.

This is a practical guide to the decision: when local execution beats cloud APIs, which small models actually deliver, and what the deployment lifecycle looks like once the benchmark demo is over.

Grammar-Constrained Generation: The Output Reliability Technique Most Teams Skip

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams that need structured LLM output follow the same playbook: write a prompt that says "respond only with valid JSON," parse the response, run Pydantic validation, and if it fails, retry with the error message appended. This works often enough to ship. It also fails in production at exactly the worst moments — under load, on edge-case inputs, and with cheaper models that don't follow instructions as reliably as GPT-4.

Grammar-constrained generation is a fundamentally different approach. Instead of asking the model nicely and checking afterward, it makes structurally invalid outputs mathematically impossible. The model cannot emit a missing brace, a non-existent enum value, or a required field it forgot — because those tokens are filtered out before sampling. Not unlikely. Impossible.

Most teams skip it. They shouldn't.

GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

April 14, 2026 · 10 min read

Tian Pan

Software Engineer

Most GPU clusters running LLM inference are wasting between 30% and 50% of their available compute. Not because engineers are careless, but because the scheduling problem is genuinely hard—and the tools most teams reach for first were never designed for it.

The standard approach is to stand up Kubernetes, request whole GPUs per pod, and let the scheduler figure it out. This works fine for training jobs. For inference across a heterogeneous set of models, it quietly destroys utilization. A cluster running three different 7B models with sporadic traffic will find each GPU busy less than 15% of the time, while remaining fully "allocated" and refusing to schedule new work.

The root cause is a mismatch between how Kubernetes thinks about GPUs and what LLM inference actually requires.

The Non-Determinism Tax: Building Reliable Pipelines on Probabilistic Infrastructure

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Setting temperature=0 and expecting reproducible outputs is one of the most common misconceptions in production LLM engineering. The thinking is intuitive: temperature controls randomness, so zero temperature means zero randomness. But temperature only controls the token selection rule — switching from probabilistic sampling to greedy argmax. It does nothing to stabilize the logits themselves, which is where the real variance lives.

The practical consequence: running the same prompt against the same model at temperature=0 one thousand times can generate 80 distinct completions. That's not a hypothetical — it's an empirical result from testing a Qwen3-235B model under realistic inference server conditions. Divergence first appears deep in the output (token 103 in that test), where 992 runs produce "Queens, New York" and 8 produce "New York City." Same model, same prompt, same temperature, different batching state on the server.

Knowledge Distillation Economics: When Compressing a Frontier Model Actually Pays Off

April 9, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams that reach for knowledge distillation do it for the wrong reasons and at the wrong time. They see a 70B model blowing their inference budget, read that distillation can produce a 7B student that's "just as good," and start immediately. Six weeks later they have a distilled model that scores well on their validation set, ships to production, and begins producing confident nonsense at scale. The validation set was drawn from the same distribution as the teacher's synthetic training data. Real traffic was not.

Distillation is an optimization tool, not a capability upgrade. The economics only work under specific conditions — and the failure modes are subtle enough that teams often don't detect them until users do.

When Thinking Models Actually Help: A Production Decision Framework for Inference-Time Compute

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

There is a study where researchers asked a reasoning model to compare two numbers: 0.9 and 0.11. One model took 42 seconds to answer. The math took a millisecond. The model spent the remaining 41.9 seconds thinking — badly. It re-examined its answer, doubted itself, reconsidered, and arrived at the correct conclusion it had already reached in its first three tokens.

This is the overthinking problem, and it is not a corner case. It is what happens when you apply inference-time compute indiscriminately to tasks that don't need it.

The emergence of reasoning models — o1, o3, DeepSeek R1, Claude with extended thinking — represents a genuine capability leap for hard problems. It also introduces a new class of production mistakes: deploying expensive, slow deliberation where fast, cheap generation was perfectly adequate. Getting this decision right is increasingly central to building AI systems that actually work.

LLM Latency Decomposition: Why TTFT and Throughput Are Different Problems

March 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most engineers building on LLMs treat latency as a single dial. They tune something — a batch size, a quantization level, an instance type — observe whether "it got faster," and call it done. This works until you hit production and discover that your p50 TTFT looks fine while your p99 is over 3 seconds, or that the optimization that doubled your throughput somehow made individual users feel the system got slower.

TTFT and throughput are not two ends of the same slider. They are caused by fundamentally different physics, degraded by different bottlenecks, and fixed by different techniques. Treating them as interchangeable is the root cause of most LLM inference incidents I've seen in production.

LLM Latency in Production: What Actually Moves the Needle

October 30, 2025 · 10 min read

Tian Pan

Software Engineer

Most LLM latency advice falls into one of two failure modes: it focuses on the wrong metric, or it recommends optimizations that are too hardware-specific to apply unless you're running your own inference cluster. If you're building on top of a hosted API or a managed inference provider, a lot of that advice is noise.

This post focuses on what actually moves the needle — techniques that apply whether you control the stack or not, grounded in production data rather than benchmark lab conditions.

About Tian Pan