11 posts tagged with "production-ml"

When Your Evals Disagree: A Signal Hierarchy for the Week the Numbers Contradict Each Other

April 27, 2026 · 12 min read

Software Engineer

It's Tuesday morning, the week after a prompt change shipped to half your traffic. You open four dashboards. The held-out golden set scored by the LLM judge says +8%. The human-rater panel that samples production weekly says no change. The A/B test on downstream conversion says −2%. The thumbs-up rate is flat. Four signals, four verdicts, and a standup in fifteen minutes where someone is going to ask whether you ship the prompt or roll it back.

The temptation is to pick the number that confirms what you already wanted to do — and the team will, because nobody on the call has a written rule for which signal wins. The disagreement isn't a measurement bug. It's the predictable output of a system that bolted four evaluators together without a hierarchy, and the cost of not having one is that every release week becomes a debate about whose number to trust.

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.

The Data Flywheel Trap: Why Your Feedback Loop May Be Spinning in Place

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Every product leader has heard the pitch: more users generate more data, better data trains better models, better models attract more users. The data flywheel is the moat that compounds. It's why AI incumbents win.

The pitch is not wrong. But the implementation almost always is. In practice, most data flywheels have multiple leakage points — places where the feedback loop appears to be spinning but is actually amplifying bias, reinforcing stale patterns, or optimizing a proxy that diverges from the real objective. The engineers building these systems rarely know which type of leakage they have, because all of them look identical from the outside: engagement goes up, the model keeps improving on the metrics you can measure, and the system slowly becomes less useful in ways that are hard to attribute.

This is the data flywheel trap. Understanding its failure modes is the prerequisite to building one that actually works.

Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

A healthcare company ran GPT-4 on 10,000 documents per day. Annual bill: $50,000. After fine-tuning a 27B open-source model on frontier outputs, the same workload cost$ 5,000—a 90% reduction. The smaller model also outperformed the frontier model by 60% on their specific task, because it had been shown thousands of examples of exactly the right behavior.

This is knowledge distillation in its modern form: you pay the frontier model API costs once to generate training data, then run a small specialized model forever. The math works because inference is cheap when you own the weights, and task-specific models beat general-purpose models on narrow tasks given enough examples.

But "collect outputs, fine-tune, ship" is not a complete recipe. Most teams that attempt distillation hit one of three invisible walls: bad synthetic data that teaches the student wrong behaviors, no reliable signal for when the student is actually ready, or silent quality collapse in production that doesn't surface until users complain. This post covers the pipeline decisions that determine whether distillation works.

LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

The promise sounds clean: fine-tune lightweight LoRA adapters for each specialized skill — one for professional tone, one for JSON formatting, one for medical terminology, one for safety guardrails — then combine them at serving time. Teams ship this design, it works fine in development, and then falls apart in production when two adapters start fighting over the same weight regions and the output quality collapses to something indistinguishable from the untrained base model. Not slightly worse. Completely untuned.

This post is about what happens when you compose adapters in practice, why naive merging fails so reliably, and what strategies actually work at production scale.

Why Vision Models Ace Benchmarks but Fail on Your Enterprise PDFs

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

A benchmark result of 97% accuracy on a document understanding dataset looks compelling until you run it against your company's actual invoice archive and realize it's quietly garbling 30% of the line items. The model doesn't throw an error. It doesn't return low confidence. It just produces output that looks plausible and is wrong.

This is the defining failure mode of production document AI: silent corruption. Unlike a crash or an exception, silent corruption propagates. The garbled table cell flows into the downstream aggregation, the aggregation feeds a report, the report drives a decision. By the time you notice, tracing the root cause is archaeology.

The gap between benchmark performance and production performance in document AI is real, persistent, and poorly understood by teams evaluating these models. Understanding why it exists — and how to defend against it — is the engineering problem this post addresses.

Speculative Decoding in Production: Free Tokens and Hidden Traps

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

Most LLM inference bottlenecks come down to one uncomfortable fact: the GPU is waiting on memory bandwidth, not compute. Each token generated requires loading the entire model's weights from HBM, and that transfer dominates runtime. Speculative decoding was designed to exploit this gap — but the gains depend on conditions your benchmark almost certainly didn't test.

Teams that ship speculative decoding into production often see it underperform lab numbers by 40–60%. Not because the technique is flawed, but because the workload characteristics differ in ways that matter: larger batch sizes, shorter outputs, stricter output constraints. Understanding when speculative decoding actually helps — and when it silently hurts — is the prerequisite for deploying it responsibly.

The Embedding Drift Problem: How Your Semantic Search Silently Degrades

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

Your semantic search is probably getting worse right now, and your dashboards are not telling you.

There is no error log. No p99 spike. No failed health check. Queries still return results with high cosine similarity scores. But the relevance is quietly deteriorating, one missed term at a time, as the language your users type diverges from the language your embedding model was trained on.

This is the embedding drift problem. It is insidious precisely because it produces no visible failure signal — only a slow erosion of retrieval quality that users attribute to the product being "not that useful anymore" before they stop using it entirely.

AI Technical Debt: Four Categories That Never Show Up in Your Sprint Retro

April 12, 2026 · 11 min read

Tian Pan

Software Engineer

Your sprint retro covers the usual suspects: flaky tests, that migration someone keeps punting, the API endpoint held together with duct tape. But if you're shipping AI features, the most expensive debt in your codebase is the kind nobody puts on a sticky note.

Traditional technical debt accumulates linearly. You cut a corner, you pay interest on it later, you refactor when the pain gets bad enough. AI technical debt compounds. A prompt that degrades silently produces training signals that pollute your evals, which misguide your next round of prompt changes, which further erodes the quality your users experience. By the time someone notices, three layers of assumptions have rotted underneath you.

Model Merging in Production: Weight Averaging Your Way to a Multi-Task Specialist

April 12, 2026 · 13 min read

Tian Pan

Software Engineer

By early 2024, the top of the Open LLM Leaderboard was dominated almost entirely by models that were never trained — they were merged. Teams were taking two or three fine-tuned variants of Mistral-7B, averaging their weights using a YAML config file, and beating purpose-trained models at a fraction of the compute cost. The technique looks trivially simple from the outside: add some tensors together, divide by two, ship it. The reality is more nuanced, and the failure modes are sharp enough to sink a production deployment if you don't understand what's happening under the hood.

This is a practical guide to model merging for ML engineers who want to use it in production: what the methods actually do mathematically, when they work, when they silently degrade, and how to pick the right tool for a given set of constituent models.

Embedding Models in Production: Selection, Versioning, and the Index Drift Problem

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG answered correctly yesterday. Today it contradicts itself. Nothing obvious changed — except your embedding provider quietly shipped a model update and your index is now a Frankenstein of mixed vector spaces.

Embedding models are the unsexy foundation of every retrieval-augmented system, and they fail in ways that are uniquely hard to diagnose. Unlike a prompt change or a model parameter tweak, embedding model problems surface slowly, as silent quality degradation that your evals don't catch until users start complaining. This post covers three things: how to pick the right embedding model for your domain (MTEB scores mislead more than they help), what actually happens when you upgrade a model, and the versioning patterns that let you swap models without rebuilding from scratch.

About Tian Pan