780 posts tagged with "ai-engineering"

The Inference Optimization Trap: Why Making One Model Faster Can Slow Down Your System

April 19, 2026 · 9 min read

Software Engineer

You swap your expensive LLM for a faster, cheaper distilled model. Latency goes up. Costs increase. Quality degrades. You roll back, confused, having just spent three weeks on optimization work that made everything worse.

This isn't a hypothetical. It's one of the most common failure modes in production AI systems, and it stems from a seductive but wrong mental model: that optimizing a component optimizes the system.

Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

A healthcare company ran GPT-4 on 10,000 documents per day. Annual bill: $50,000. After fine-tuning a 27B open-source model on frontier outputs, the same workload cost$ 5,000—a 90% reduction. The smaller model also outperformed the frontier model by 60% on their specific task, because it had been shown thousands of examples of exactly the right behavior.

This is knowledge distillation in its modern form: you pay the frontier model API costs once to generate training data, then run a small specialized model forever. The math works because inference is cheap when you own the weights, and task-specific models beat general-purpose models on narrow tasks given enough examples.

But "collect outputs, fine-tune, ship" is not a complete recipe. Most teams that attempt distillation hit one of three invisible walls: bad synthetic data that teaches the student wrong behaviors, no reliable signal for when the student is actually ready, or silent quality collapse in production that doesn't surface until users complain. This post covers the pipeline decisions that determine whether distillation works.

Knowledge Distillation Without Fine-Tuning: Extracting Frontier Model Capabilities Into Cheaper Inference Paths

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

A 770-million-parameter model beating a 540-billion-parameter model at its own task sounds impossible. But that is exactly what distilled T5 models achieved against few-shot PaLM—using only 80% of the training examples, a 700x size reduction, and inference that costs a fraction of a cent per call instead of dollars. The trick wasn't a better architecture or a cleverer training recipe. It was generating labeled data from the big model and training the small one on it.

This is knowledge distillation. And you do not need to fine-tune the teacher to make it work.

The Idempotency Crisis: LLM Agents as Event Stream Consumers

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

Every event streaming system eventually delivers the same message twice. Network hiccups, broker restarts, offset commit failures — at-least-once delivery is not a bug; it's the contract. Traditional consumers handle this gracefully because they're deterministic: process the same event twice, get the same result, write the same record. The second write is a no-op.

LLMs are not deterministic processors. The same prompt with the same input produces different outputs on each run. Even with temperature=0, floating-point arithmetic, batch composition effects, and hardware scheduling variations introduce variance. Research measuring "deterministic" LLM settings found accuracy differences up to 15% across naturally occurring runs, with best-to-worst performance gaps reaching 70%. At-least-once delivery plus a non-deterministic processor does not give you at-most-once behavior. It gives you unpredictable behavior — and that's a crisis waiting to happen in production.

The Mental Model Shift That Separates Good AI Engineers from the Rest

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

The most common pattern among engineers who struggle with AI work isn't a lack of technical knowledge. It's that they keep asking the wrong question. They want to know: "Does this work?" What they should be asking is: "At what rate does this fail, and is that rate acceptable for this use case?"

That single shift — from binary correctness to acceptable failure rates — is the core of what experienced AI engineers think differently about. It sounds simple. It isn't. Everything downstream of it is different: how you debug, how you test, how you deploy, what you monitor, what you build your confidence on. Engineers who haven't made this shift will keep fighting their tools and losing.

Model Deprecation Is a Production Incident Waiting to Happen

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

A model you deployed six months ago has a sunset date on the calendar. You probably didn't mark it. Your on-call rotation doesn't know about it. There's no ticket in the backlog. And when the provider finally pulls the plug, you'll get a 404 Model not found error in production at the worst possible time, with no rollback plan ready.

This is the standard story for most engineering teams using hosted LLMs. Model deprecation gets categorized as a vendor concern, not an operational one — right until the moment it becomes an incident.

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.

Multi-Modal Agents in Production: What Text-Only Evals Never Catch

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.

Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.

Multimodal AI in Production: The Gap Between Benchmarks and Reality

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams adopting multimodal AI make the same mistake: they evaluate models on curated benchmark datasets and assume production performance will track. It doesn't. The gap between a vision model acing MMMU and that same model reliably extracting structured data from your invoices at scale is wide enough to sink a product launch. Vision encoders add latency that benchmark leaderboards don't measure. Spatial reasoning fails on the chart types your users actually send. Audio models that score well on clean speech disintegrate under real-world noise. And the task categories where multimodal genuinely outperforms text-only are narrower than vendors suggest.

This post is a field guide to that gap — where it shows up, why it exists, and which deployment patterns hold up under production load.

The 90% Reliability Wall: Why AI Features Plateau and What to Do About It

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature ships at 92% accuracy. The team celebrates. Three months later, progress has flatlined — the error rate stopped falling despite more data, more compute, and two model upgrades. Sound familiar?

This is the 90% reliability wall, and it is not a coincidence. It emerges from three converging forces: the exponential cost of marginal accuracy gains, the difference between errors you can eliminate and errors that are structurally unavoidable, and the compound amplification of failure in production environments that benchmarks never capture. Teams that do not understand which force they are fighting will waste quarters trying to solve problems that are not solvable.

On-Call for Stochastic Systems: Why Your AI Runbook Needs a Rewrite

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

You get paged at 2 AM. Latency is up, error rates are spiking. You SSH in, pull logs, and—nothing. No stack trace pointing to a bad deploy. No null pointer exception on line 247. Just a stream of model outputs that are subtly, unpredictably wrong in ways that only become obvious when you read 50 of them in a row.

This is what incidents look like in LLM-powered systems. And the traditional alert-triage-fix loop was not built for it.

The standard on-call playbook assumes three things: failures are deterministic (same input, same bad output), root cause is locatable (some code changed, some resource exhausted), and rollback is straightforward (revert the deploy, done). None of these hold for stochastic AI systems. The same prompt produces different outputs. Root cause is usually a probability distribution, not a line of code. And you cannot "rollback" a model that a third-party provider updated silently overnight.

The Orchestration Framework Trap: When LangChain Makes You Slower to Ship

April 19, 2026 · 8 min read

Tian Pan

Software Engineer

At some point in 2024, a pattern started appearing in engineering postmortems across AI teams: "We rewrote it without LangChain and shipping became significantly faster." The teams in these postmortems hadn't made a technical mistake in adopting the framework — they'd made a timing mistake. LangChain was the right tool for the prototype and the wrong tool for month seven.

The same story played out enough times that it has a name now: the orchestration framework trap. You adopt a framework that genuinely accelerates early work, and the productivity gain masks a growing structural debt. By the time the debt is visible, you're deep in internals that were never meant to be touched.

About Tian Pan