Skip to main content

63 posts tagged with "mlops"

View all tags

The Annotation Pipeline Is Production Infrastructure

· 11 min read
Tian Pan
Software Engineer

Most teams treat their annotation pipeline the same way they treat their CI script from 2019: it works, mostly, and nobody wants to touch it. A shared spreadsheet with color-coded rows. A Google Form routing tasks to a Slack channel. Three contractors working asynchronously, comparing notes in a thread.

Then a model ships with degraded quality, an eval regresses in a confusing direction, and the post-mortem eventually surfaces the obvious: the labels were wrong, and no one built anything to detect it.

Annotation is not a data problem. It is a software engineering problem. The teams that treat it that way — with queues, schemas, monitoring, and structured disagreement handling — build AI products that improve over time. The teams that don't are in a cycle of re-labeling they can't quite explain.

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.

The Adapter Compatibility Cliff: When Your Fine-Tune Meets the New Base Model

· 11 min read
Tian Pan
Software Engineer

Fine-tuning a language model gives you a competitive edge until the provider updates the base model underneath your adapter. At that point, one of two things happens: your service crashes with a shape mismatch error, or — far more dangerously — it silently starts returning degraded outputs while your monitoring shows nothing unusual. Most teams discover the second scenario only when users start complaining that "the AI got dumber."

This is the adapter compatibility cliff. You trained a LoRA adapter on model version N. The provider shipped version N+1. Your adapter is now running on a foundation it was never designed for, and there is no migration path.

Agent Behavioral Versioning: Why Git Commits Don't Capture What Changed

· 9 min read
Tian Pan
Software Engineer

You shipped an agent last Tuesday. Nothing in your codebase changed. On Thursday, it started refusing tool calls it had handled reliably for weeks. Your git log is clean, your tests pass, and your CI pipeline is green. But the agent is broken — and you have no version to roll back to, because the thing that changed wasn't in your repository.

This is the central paradox of agent versioning: the artifacts you track (code, configs, prompts) are necessary but insufficient to define what your agent actually does. The behavior emerges from the intersection of code, model weights, tool APIs, and runtime context — and any one of those can shift without leaving a trace in your version control system.

AI Feature Decay: The Slow Rot That Metrics Don't Catch

· 9 min read
Tian Pan
Software Engineer

Your AI feature launched to applause. Three months later, users are quietly routing around it. Your dashboards still show green — latency is fine, error rates are flat, uptime is perfect. But satisfaction scores are sliding, support tickets mention "the AI is being weird," and the feature that once handled 70% of inquiries now barely manages 50%.

This is AI feature decay: the gradual degradation of an AI-powered feature not from model changes or code bugs, but from the world shifting underneath it. Unlike traditional software that fails with stack traces, AI features degrade silently. The system runs, the model responds, and the output is delivered — it's just no longer what users need.

The AI Team Topology Problem: Why Your Org Chart Determines Whether AI Ships

· 8 min read
Tian Pan
Software Engineer

Most AI features die in the gap between "works in notebook" and "works in production." Not because the model is bad, but because the team that built the model and the team that owns the product have never sat in the same room. The AI team topology problem — where AI engineers sit in your org chart — is quietly the biggest predictor of whether your AI investments ship or stall.

The numbers bear this out. Only about half of ML projects make it from prototype to production — at less mature organizations, the failure rate reaches 90%. Meanwhile, CircleCI's 2026 State of Software Delivery report found that AI-assisted code generation boosted feature branch throughput by 59%, yet production branch output actually declined 7% for median teams. Code is being written faster than ever. It's just not shipping.

Edge LLM Inference: When Latency, Privacy, or Cost Force You Off the Cloud

· 9 min read
Tian Pan
Software Engineer

A fine-tuned 7B parameter model running on a single RTX 4090 can outperform GPT-4 on domain-specific tasks while costing you nothing per token after the initial hardware investment. That is not a theoretical claim — Diabetica-7B, a diabetes-focused model, hit 87.2% accuracy on clinical queries, beating both GPT-4 and Claude 3.5 on the same benchmark. The catch? Getting there requires understanding exactly when edge inference makes sense and when it is an expensive distraction.

Most teams default to cloud APIs because they are easy — make an HTTP call, get tokens back. But that simplicity has costs that scale in ways engineers do not anticipate until it is too late, and those costs are not always measured in dollars.

Open-Weight Models in Production: When Self-Hosting Actually Beats the API

· 8 min read
Tian Pan
Software Engineer

Every few months, someone on your team forwards a blog post about how Llama or Qwen "matches GPT-4" on some benchmark, followed by the inevitable question: "Why are we paying for API calls when we could just run this ourselves?" The math looks compelling on a napkin. The reality is that most teams who attempt self-hosting end up spending more than they saved, not because the models are bad, but because they underestimated everything that isn't the model.

That said, there are specific situations where self-hosting open-weight models is the clearly correct decision. The trick is knowing which situation you're actually in, rather than the one you wish you were in.

The Centralized AI Platform Trap: Why Shared ML Teams Kill Product Velocity

· 8 min read
Tian Pan
Software Engineer

Most engineering organizations discover the problem the same way: AI demos go well, leadership pushes for broader adoption, and someone decides the right answer is a dedicated team to own "AI infrastructure." The team gets headcount, a roadmap, and a mandate to accelerate AI across the organization.

Eighteen months later, product teams are filing tickets to get their prompts deployed. The platform team is overwhelmed. Features that took days to demo are taking quarters to ship. And the team originally created to speed up AI adoption has become its primary bottleneck.

This is the centralized AI platform trap — and it's surprisingly easy to fall into.

The Feedback Flywheel Stall: Why Most AI Products Stop Improving After Month Three

· 9 min read
Tian Pan
Software Engineer

Every AI product pitch deck has the same slide: more users generate more data, which trains better models, which attract more users. The data flywheel. It sounds like a perpetual motion machine for product quality. And for the first few months, it actually works — accuracy climbs, users are happy, and the metrics all point up and to the right.

Then, somewhere around month three, the curve flattens. The model stops getting meaningfully better. The annotation queue grows but the accuracy needle barely moves. Your team is still collecting data, still retraining, still shipping — but the flywheel has quietly stalled.

This isn't a rare failure mode. Studies show that 40% of companies deploying AI models experience noticeable performance degradation within the first year, and up to 32% of production scoring pipelines encounter distributional shifts within six months. The flywheel doesn't break with a bang. It decays with a whisper.

The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production

· 13 min read
Tian Pan
Software Engineer

Every team that has shipped an LLM-powered product has faced the same moment: a new foundation model drops with better benchmarks, lower costs, or both — and someone asks, "Can we just swap it in?" The answer is always yes in staging and frequently catastrophic in production.

The gap between "runs on the new model" and "behaves correctly on the new model" is where production incidents live. Model migrations fail not because the new model is worse, but because the migration process assumes behavioral equivalence where none exists. Prompt formatting conventions differ between providers. System prompt interpretation varies across model families. Edge cases that the old model handled gracefully — through learned quirks you never documented — surface as regressions that your eval suite wasn't designed to catch.

Embedding Models in Production: Selection, Versioning, and the Index Drift Problem

· 10 min read
Tian Pan
Software Engineer

Your RAG answered correctly yesterday. Today it contradicts itself. Nothing obvious changed — except your embedding provider quietly shipped a model update and your index is now a Frankenstein of mixed vector spaces.

Embedding models are the unsexy foundation of every retrieval-augmented system, and they fail in ways that are uniquely hard to diagnose. Unlike a prompt change or a model parameter tweak, embedding model problems surface slowly, as silent quality degradation that your evals don't catch until users start complaining. This post covers three things: how to pick the right embedding model for your domain (MTEB scores mislead more than they help), what actually happens when you upgrade a model, and the versioning patterns that let you swap models without rebuilding from scratch.