Skip to main content

43 posts tagged with "mlops"

View all tags

The Centralized AI Platform Trap: Why Shared ML Teams Kill Product Velocity

· 8 min read
Tian Pan
Software Engineer

Most engineering organizations discover the problem the same way: AI demos go well, leadership pushes for broader adoption, and someone decides the right answer is a dedicated team to own "AI infrastructure." The team gets headcount, a roadmap, and a mandate to accelerate AI across the organization.

Eighteen months later, product teams are filing tickets to get their prompts deployed. The platform team is overwhelmed. Features that took days to demo are taking quarters to ship. And the team originally created to speed up AI adoption has become its primary bottleneck.

This is the centralized AI platform trap — and it's surprisingly easy to fall into.

The Feedback Flywheel Stall: Why Most AI Products Stop Improving After Month Three

· 9 min read
Tian Pan
Software Engineer

Every AI product pitch deck has the same slide: more users generate more data, which trains better models, which attract more users. The data flywheel. It sounds like a perpetual motion machine for product quality. And for the first few months, it actually works — accuracy climbs, users are happy, and the metrics all point up and to the right.

Then, somewhere around month three, the curve flattens. The model stops getting meaningfully better. The annotation queue grows but the accuracy needle barely moves. Your team is still collecting data, still retraining, still shipping — but the flywheel has quietly stalled.

This isn't a rare failure mode. Studies show that 40% of companies deploying AI models experience noticeable performance degradation within the first year, and up to 32% of production scoring pipelines encounter distributional shifts within six months. The flywheel doesn't break with a bang. It decays with a whisper.

The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production

· 13 min read
Tian Pan
Software Engineer

Every team that has shipped an LLM-powered product has faced the same moment: a new foundation model drops with better benchmarks, lower costs, or both — and someone asks, "Can we just swap it in?" The answer is always yes in staging and frequently catastrophic in production.

The gap between "runs on the new model" and "behaves correctly on the new model" is where production incidents live. Model migrations fail not because the new model is worse, but because the migration process assumes behavioral equivalence where none exists. Prompt formatting conventions differ between providers. System prompt interpretation varies across model families. Edge cases that the old model handled gracefully — through learned quirks you never documented — surface as regressions that your eval suite wasn't designed to catch.

Embedding Models in Production: Selection, Versioning, and the Index Drift Problem

· 10 min read
Tian Pan
Software Engineer

Your RAG answered correctly yesterday. Today it contradicts itself. Nothing obvious changed — except your embedding provider quietly shipped a model update and your index is now a Frankenstein of mixed vector spaces.

Embedding models are the unsexy foundation of every retrieval-augmented system, and they fail in ways that are uniquely hard to diagnose. Unlike a prompt change or a model parameter tweak, embedding model problems surface slowly, as silent quality degradation that your evals don't catch until users start complaining. This post covers three things: how to pick the right embedding model for your domain (MTEB scores mislead more than they help), what actually happens when you upgrade a model, and the versioning patterns that let you swap models without rebuilding from scratch.

Releasing AI Features Without Breaking Production: Shadow Mode, Canary Deployments, and A/B Testing for LLMs

· 11 min read
Tian Pan
Software Engineer

A team swaps GPT-4o for a newer model on a Tuesday afternoon. By Thursday, support tickets are up 30%, but nobody can tell why — the new model is slightly shorter with responses, refuses some edge-case requests the old one handled, and formats dates differently in a way that breaks a downstream parser. The team reverts. Two sprints of work, gone.

This story plays out constantly. The problem isn't that the new model was worse — it may have been better on most things. The problem is that the team released it with the same process they'd use to ship a bug fix: merge, deploy, watch. That works for code. It fails for LLMs.

Prompt Versioning in Production: The Engineering Discipline Teams Learn the Hard Way

· 10 min read
Tian Pan
Software Engineer

You get paged at 2am. Users are reporting garbage output. You SSH in, check logs, stare at traces — everything looks structurally fine. The model is responding. Latency is normal. But something is wrong with the answers. Then the question lands in your incident channel: "Which prompt version is actually running right now?"

If you can't answer that question in under thirty seconds, you have a prompt versioning problem.

Prompts are treated like configuration in most early-stage LLM projects. A product manager edits a string in a .env file, a developer pastes an updated instruction into a hardcoded constant, someone else pastes a slightly different version into a staging Slack channel. Eventually the versions diverge, and nobody has a complete picture of what's running where. The experimentation-phase casualness that got you to launch becomes a liability the moment you have real users.

Fine-Tuning vs. Prompting: A Decision Framework for Production LLMs

· 8 min read
Tian Pan
Software Engineer

Most teams reach for fine-tuning too early or too late. The ones who fine-tune too early burn weeks on a training pipeline before realizing a better system prompt would have solved the problem. The ones who wait too long run expensive 70B inferences on millions of repetitive tasks while accepting accuracy that a fine-tuned 7B model could have beaten—at a tenth of the cost.

The decision is not about which technique is "better." It's about matching the right tool to your specific constraints: data volume, latency budget, accuracy requirements, and how stable the task definition is. Here's how to think through it.