639 posts tagged with "llm"

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System

April 19, 2026 · 9 min read

Software Engineer

You swap your expensive LLM for a faster, cheaper distilled model. Latency goes up. Costs increase. Quality degrades. You roll back, confused, having just spent three weeks on optimization work that made everything worse.

This isn't a hypothetical. It's one of the most common failure modes in production AI systems, and it stems from a seductive but wrong mental model: that optimizing a component optimizes the system.

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.

Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.

This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.

Invisible Model Drift: How Silent Provider Updates Break Production AI

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your prompts worked on Monday. On Wednesday, users start complaining that responses feel off — answers are shorter, the JSON parsing downstream is breaking intermittently, the classifier that had been 94% accurate is now hovering around 79%. You haven't deployed anything. The model you're calling still has the same name in your config. But something changed.

This is invisible model drift: the silent, undocumented behavior changes that LLM providers push without announcement. It is one of the least-discussed operational hazards in AI engineering, and it hits teams that have done everything "right" — with evals, with monitoring, with stable prompt engineering. The model just changed underneath them.

Knowledge Distillation Without Fine-Tuning: Extracting Frontier Model Capabilities Into Cheaper Inference Paths

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

A 770-million-parameter model beating a 540-billion-parameter model at its own task sounds impossible. But that is exactly what distilled T5 models achieved against few-shot PaLM—using only 80% of the training examples, a 700x size reduction, and inference that costs a fraction of a cent per call instead of dollars. The trick wasn't a better architecture or a cleverer training recipe. It was generating labeled data from the big model and training the small one on it.

This is knowledge distillation. And you do not need to fine-tune the teacher to make it work.

The Latent Capability Ceiling: When a Bigger Model Won't Fix Your Problem

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

There is a pattern that plays out on almost every AI project that runs long enough. The team builds a prototype, the demo looks good, but in production the outputs aren't consistent enough. Someone suggests switching to the latest frontier model — GPT-4o instead of GPT-3.5, Claude Opus instead of Sonnet, Gemini Ultra instead of Pro. Sometimes it helps. Eventually it stops helping. The team finds themselves paying 5–10x more per inference, latency has doubled, and the task accuracy is still 78% instead of the 90% they need.

This is the latent capability ceiling: the point at which the raw scale of the language model you're using is no longer the limiting factor. It's a real phenomenon backed by empirical data, and most teams hit it without recognizing it — because the reflex to "use a bigger model" is cheap, fast, and often works early in a project.

The Idempotency Crisis: LLM Agents as Event Stream Consumers

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

Every event streaming system eventually delivers the same message twice. Network hiccups, broker restarts, offset commit failures — at-least-once delivery is not a bug; it's the contract. Traditional consumers handle this gracefully because they're deterministic: process the same event twice, get the same result, write the same record. The second write is a no-op.

LLMs are not deterministic processors. The same prompt with the same input produces different outputs on each run. Even with temperature=0, floating-point arithmetic, batch composition effects, and hardware scheduling variations introduce variance. Research measuring "deterministic" LLM settings found accuracy differences up to 15% across naturally occurring runs, with best-to-worst performance gaps reaching 70%. At-least-once delivery plus a non-deterministic processor does not give you at-most-once behavior. It gives you unpredictable behavior — and that's a crisis waiting to happen in production.

LLM-Powered Data Pipelines: The ETL Tier Nobody Benchmarks

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most conversations about LLMs in production orbit around chat interfaces, copilots, and autonomous agents. But if you audit where enterprise LLM tokens are actually being consumed, a different picture emerges: a quiet majority of usage is happening inside batch data pipelines — extracting fields from documents, classifying support tickets, normalizing messy vendor records, enriching raw events with semantic labels. Nobody is writing conference talks about this tier. Nobody is benchmarking it seriously either. And that silence is costing teams real money and real accuracy.

This is the ETL tier that practitioners build first, justify last, and monitor least. It is also, for most organizations, the layer where LLM spend has the highest leverage — and the highest potential for invisible failure.

LLM Vendor Lock-In Is a Spectrum, Not a Binary

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

A team builds a production feature on GPT-4. Months later, they decide to evaluate Claude for cost reasons. They spend two weeks "migrating"—but the core API swap takes an afternoon. The remaining ten days go toward fixing broken system prompts, re-testing refusal edge cases, debugging JSON parsers that choke on unexpected prose, and re-tuning tool-calling schemas that behave differently across providers. Migration estimates that assumed a simple connector swap balloon into a multi-layer rebuild.

This is the LLM vendor lock-in problem in practice. And the teams that get burned aren't the ones who chose the wrong provider—they're the ones who didn't recognize that lock-in exists on multiple axes, each with a different risk profile.

Long-Session Context Degradation: How Multi-Turn Conversations Go Stale

April 19, 2026 · 8 min read

Tian Pan

Software Engineer

The first time a user's 80-turn support conversation suddenly started contradicting advice given 60 turns ago, the team blamed a bug. There was no bug. The model was simply lost. Across all major frontier models, multi-turn conversations show an average 39% performance drop compared to single-turn interactions on the same tasks. Most teams never measure this. They assume context windows are roughly as powerful as their token limit suggests, and they build products accordingly.

That assumption is quietly wrong. Long sessions don't just get slower or more expensive — they get unreliable in ways that are nearly impossible to notice until users are already frustrated.

The Mental Model Shift That Separates Good AI Engineers from the Rest

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

The most common pattern among engineers who struggle with AI work isn't a lack of technical knowledge. It's that they keep asking the wrong question. They want to know: "Does this work?" What they should be asking is: "At what rate does this fail, and is that rate acceptable for this use case?"

That single shift — from binary correctness to acceptable failure rates — is the core of what experienced AI engineers think differently about. It sounds simple. It isn't. Everything downstream of it is different: how you debug, how you test, how you deploy, what you monitor, what you build your confidence on. Engineers who haven't made this shift will keep fighting their tools and losing.

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.

Multi-Modal Agents in Production: What Text-Only Evals Never Catch

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.

Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.

About Tian Pan