Skip to main content

17 posts tagged with "inference"

View all tags

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

The Model Bill Is 30% of Your Inference Cost

· 8 min read
Tian Pan
Software Engineer

A finance lead at a mid-sized AI company told me last quarter they had "optimized their LLM spend" by switching their agent backbone from Sonnet to Haiku. The token bill dropped 22%. The total inference cost per resolved ticket went down 4%. When we pulled the full decomposition, the model line item was roughly a third of the per-request cost. Retrieval, reranking, observability, retry amplification, and the human-in-the-loop review queue ate the rest — and none of those got cheaper when they swapped models.

This is the most common accounting error I see in AI teams right now. Token cost is the line item on the invoice you pay every month, so it becomes the number everyone optimizes. But for any non-trivial production system — RAG, agents, anything with tool use or evaluation gates — the model inference is often 30 to 50% of the real unit economics. The rest sits in places your engineering dashboard doesn't surface and your finance team doesn't categorize as "AI spend."

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System

· 9 min read
Tian Pan
Software Engineer

You swap your expensive LLM for a faster, cheaper distilled model. Latency goes up. Costs increase. Quality degrades. You roll back, confused, having just spent three weeks on optimization work that made everything worse.

This isn't a hypothetical. It's one of the most common failure modes in production AI systems, and it stems from a seductive but wrong mental model: that optimizing a component optimizes the system.

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

· 11 min read
Tian Pan
Software Engineer

You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.

Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.

This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.

LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars

· 9 min read
Tian Pan
Software Engineer

The promise sounds clean: fine-tune lightweight LoRA adapters for each specialized skill — one for professional tone, one for JSON formatting, one for medical terminology, one for safety guardrails — then combine them at serving time. Teams ship this design, it works fine in development, and then falls apart in production when two adapters start fighting over the same weight regions and the output quality collapses to something indistinguishable from the untrained base model. Not slightly worse. Completely untuned.

This post is about what happens when you compose adapters in practice, why naive merging fails so reliably, and what strategies actually work at production scale.

The On-Device LLM Problem Nobody Talks About: Model Update Propagation

· 12 min read
Tian Pan
Software Engineer

Most engineers who build on-device LLM features spend their time solving the problems that are easy to see: quantization, latency, memory limits. The model fits on the phone, inference is fast enough, and the demo looks great. Then they ship to millions of devices and discover a harder problem that nobody warned them about: you now have millions of independent compute nodes running different versions of your AI model, and you have no reliable way to know which one any given user is running.

Cloud inference is boring in the best way. You update the model, redeploy the server, and within minutes the entire user base is running the new version. On-device inference breaks this assumption entirely. A user who last opened your app three months ago is still running the model that was current then — and there's no clean way to force an update, no server-side rollback, and no simple way to detect the mismatch without adding instrumentation you probably didn't build from the start.

This version fragmentation is the central operational challenge of on-device AI, and it has consequences that reach far beyond a slow rollout. It creates silent capability drift, complicates incident response, and turns your "AI feature" into a heterogeneous fleet of independently-behaving systems that you're responsible for but can't directly control.

Browser-Native LLM Inference: The WebGPU Engineering You Didn't Know You Needed

· 10 min read
Tian Pan
Software Engineer

Most AI features are architected the same way: user input travels to an API, a cloud GPU processes it, and a response travels back. That round trip is so normalized that engineers rarely question it. But it carries a hidden tax: 200–800ms of network latency on every interaction, an API key that must live somewhere accessible (and therefore vulnerable), and a hard dependency on uptime you don't control.

Browser-native LLM inference via WebGPU breaks all three of those assumptions. The model runs on the user's GPU, inside a browser sandbox, with no network round-trip. This isn't a future capability — as of late 2025, WebGPU ships by default across Chrome, Firefox, Edge, and Safari, covering roughly 82.7% of global browser traffic. The engineering question has shifted from "can we do this?" to "when does it beat the cloud, and how do we route intelligently between the two?"

The Compression Decision: Quantization, Distillation, and On-Device Inference for Latency-Critical AI Features

· 10 min read
Tian Pan
Software Engineer

Model routing is the first optimization most teams reach for. Route simple queries to a small cheap model, complex ones to a large capable model. It works well for managing cost and throughput. What it cannot fix is the wall you hit when the physics of cloud inference collide with a latency requirement of 100ms or less. A network round-trip from a mid-tier data center already consumes 30–80ms before a single token is generated. At that point, routing is irrelevant — you need to either run the model closer to the user or run a substantially smaller model. Both paths require compression decisions that most teams approach without a framework.

This is a guide for making those decisions. The three techniques — quantization, knowledge distillation, and on-device deployment — solve overlapping problems but have very different cost structures, quality profiles, and operational consequences.

On-Device LLM Inference: When to Move AI Off the Cloud

· 11 min read
Tian Pan
Software Engineer

Most teams discover that running AI inference in the cloud has sharp edges only after they've already hit them: a HIPAA audit that traces back to PHI crossing API boundaries, latency numbers in staging that look fine until a user on a spotty connection reports "it just spins," or a per-inference API bill that looked reasonable at 10,000 requests per day and catastrophic at 10 million. On-device inference is often the right answer — but the reasons teams reach for it, and the problems they hit when they do, are rarely the same ones that show up in blog post comparisons.

This is a practical guide to the decision: when local execution beats cloud APIs, which small models actually deliver, and what the deployment lifecycle looks like once the benchmark demo is over.

Grammar-Constrained Generation: The Output Reliability Technique Most Teams Skip

· 10 min read
Tian Pan
Software Engineer

Most teams that need structured LLM output follow the same playbook: write a prompt that says "respond only with valid JSON," parse the response, run Pydantic validation, and if it fails, retry with the error message appended. This works often enough to ship. It also fails in production at exactly the worst moments — under load, on edge-case inputs, and with cheaper models that don't follow instructions as reliably as GPT-4.

Grammar-constrained generation is a fundamentally different approach. Instead of asking the model nicely and checking afterward, it makes structurally invalid outputs mathematically impossible. The model cannot emit a missing brace, a non-existent enum value, or a required field it forgot — because those tokens are filtered out before sampling. Not unlikely. Impossible.

Most teams skip it. They shouldn't.

GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

· 10 min read
Tian Pan
Software Engineer

Most GPU clusters running LLM inference are wasting between 30% and 50% of their available compute. Not because engineers are careless, but because the scheduling problem is genuinely hard—and the tools most teams reach for first were never designed for it.

The standard approach is to stand up Kubernetes, request whole GPUs per pod, and let the scheduler figure it out. This works fine for training jobs. For inference across a heterogeneous set of models, it quietly destroys utilization. A cluster running three different 7B models with sporadic traffic will find each GPU busy less than 15% of the time, while remaining fully "allocated" and refusing to schedule new work.

The root cause is a mismatch between how Kubernetes thinks about GPUs and what LLM inference actually requires.