Skip to main content

9 posts tagged with "production-ml"

View all tags

The Data Flywheel Trap: Why Your Feedback Loop May Be Spinning in Place

· 11 min read
Tian Pan
Software Engineer

Every product leader has heard the pitch: more users generate more data, better data trains better models, better models attract more users. The data flywheel is the moat that compounds. It's why AI incumbents win.

The pitch is not wrong. But the implementation almost always is. In practice, most data flywheels have multiple leakage points — places where the feedback loop appears to be spinning but is actually amplifying bias, reinforcing stale patterns, or optimizing a proxy that diverges from the real objective. The engineers building these systems rarely know which type of leakage they have, because all of them look identical from the outside: engagement goes up, the model keeps improving on the metrics you can measure, and the system slowly becomes less useful in ways that are hard to attribute.

This is the data flywheel trap. Understanding its failure modes is the prerequisite to building one that actually works.

Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks

· 9 min read
Tian Pan
Software Engineer

A healthcare company ran GPT-4 on 10,000 documents per day. Annual bill: 50,000.Afterfinetuninga27Bopensourcemodelonfrontieroutputs,thesameworkloadcost50,000. After fine-tuning a 27B open-source model on frontier outputs, the same workload cost 5,000—a 90% reduction. The smaller model also outperformed the frontier model by 60% on their specific task, because it had been shown thousands of examples of exactly the right behavior.

This is knowledge distillation in its modern form: you pay the frontier model API costs once to generate training data, then run a small specialized model forever. The math works because inference is cheap when you own the weights, and task-specific models beat general-purpose models on narrow tasks given enough examples.

But "collect outputs, fine-tune, ship" is not a complete recipe. Most teams that attempt distillation hit one of three invisible walls: bad synthetic data that teaches the student wrong behaviors, no reliable signal for when the student is actually ready, or silent quality collapse in production that doesn't surface until users complain. This post covers the pipeline decisions that determine whether distillation works.

LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars

· 9 min read
Tian Pan
Software Engineer

The promise sounds clean: fine-tune lightweight LoRA adapters for each specialized skill — one for professional tone, one for JSON formatting, one for medical terminology, one for safety guardrails — then combine them at serving time. Teams ship this design, it works fine in development, and then falls apart in production when two adapters start fighting over the same weight regions and the output quality collapses to something indistinguishable from the untrained base model. Not slightly worse. Completely untuned.

This post is about what happens when you compose adapters in practice, why naive merging fails so reliably, and what strategies actually work at production scale.

Why Vision Models Ace Benchmarks but Fail on Your Enterprise PDFs

· 9 min read
Tian Pan
Software Engineer

A benchmark result of 97% accuracy on a document understanding dataset looks compelling until you run it against your company's actual invoice archive and realize it's quietly garbling 30% of the line items. The model doesn't throw an error. It doesn't return low confidence. It just produces output that looks plausible and is wrong.

This is the defining failure mode of production document AI: silent corruption. Unlike a crash or an exception, silent corruption propagates. The garbled table cell flows into the downstream aggregation, the aggregation feeds a report, the report drives a decision. By the time you notice, tracing the root cause is archaeology.

The gap between benchmark performance and production performance in document AI is real, persistent, and poorly understood by teams evaluating these models. Understanding why it exists — and how to defend against it — is the engineering problem this post addresses.

Speculative Decoding in Production: Free Tokens and Hidden Traps

· 9 min read
Tian Pan
Software Engineer

Most LLM inference bottlenecks come down to one uncomfortable fact: the GPU is waiting on memory bandwidth, not compute. Each token generated requires loading the entire model's weights from HBM, and that transfer dominates runtime. Speculative decoding was designed to exploit this gap — but the gains depend on conditions your benchmark almost certainly didn't test.

Teams that ship speculative decoding into production often see it underperform lab numbers by 40–60%. Not because the technique is flawed, but because the workload characteristics differ in ways that matter: larger batch sizes, shorter outputs, stricter output constraints. Understanding when speculative decoding actually helps — and when it silently hurts — is the prerequisite for deploying it responsibly.

The Embedding Drift Problem: How Your Semantic Search Silently Degrades

· 9 min read
Tian Pan
Software Engineer

Your semantic search is probably getting worse right now, and your dashboards are not telling you.

There is no error log. No p99 spike. No failed health check. Queries still return results with high cosine similarity scores. But the relevance is quietly deteriorating, one missed term at a time, as the language your users type diverges from the language your embedding model was trained on.

This is the embedding drift problem. It is insidious precisely because it produces no visible failure signal — only a slow erosion of retrieval quality that users attribute to the product being "not that useful anymore" before they stop using it entirely.

AI Technical Debt: Four Categories That Never Show Up in Your Sprint Retro

· 11 min read
Tian Pan
Software Engineer

Your sprint retro covers the usual suspects: flaky tests, that migration someone keeps punting, the API endpoint held together with duct tape. But if you're shipping AI features, the most expensive debt in your codebase is the kind nobody puts on a sticky note.

Traditional technical debt accumulates linearly. You cut a corner, you pay interest on it later, you refactor when the pain gets bad enough. AI technical debt compounds. A prompt that degrades silently produces training signals that pollute your evals, which misguide your next round of prompt changes, which further erodes the quality your users experience. By the time someone notices, three layers of assumptions have rotted underneath you.

Model Merging in Production: Weight Averaging Your Way to a Multi-Task Specialist

· 13 min read
Tian Pan
Software Engineer

By early 2024, the top of the Open LLM Leaderboard was dominated almost entirely by models that were never trained — they were merged. Teams were taking two or three fine-tuned variants of Mistral-7B, averaging their weights using a YAML config file, and beating purpose-trained models at a fraction of the compute cost. The technique looks trivially simple from the outside: add some tensors together, divide by two, ship it. The reality is more nuanced, and the failure modes are sharp enough to sink a production deployment if you don't understand what's happening under the hood.

This is a practical guide to model merging for ML engineers who want to use it in production: what the methods actually do mathematically, when they work, when they silently degrade, and how to pick the right tool for a given set of constituent models.

Embedding Models in Production: Selection, Versioning, and the Index Drift Problem

· 10 min read
Tian Pan
Software Engineer

Your RAG answered correctly yesterday. Today it contradicts itself. Nothing obvious changed — except your embedding provider quietly shipped a model update and your index is now a Frankenstein of mixed vector spaces.

Embedding models are the unsexy foundation of every retrieval-augmented system, and they fail in ways that are uniquely hard to diagnose. Unlike a prompt change or a model parameter tweak, embedding model problems surface slowly, as silent quality degradation that your evals don't catch until users start complaining. This post covers three things: how to pick the right embedding model for your domain (MTEB scores mislead more than they help), what actually happens when you upgrade a model, and the versioning patterns that let you swap models without rebuilding from scratch.