Skip to main content

129 posts tagged with "mlops"

View all tags

Multi-Region LLM Serving: The Cache Locality Problem Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

When you run a stateless HTTP API across multiple regions, the routing problem is essentially solved. Put a global load balancer in front, distribute requests by geography, and the worst thing that happens is a slightly stale cache entry. Any replica can serve any request with identical results.

LLM inference breaks every one of these assumptions. The moment you add prompt caching — which you will, because the cost difference between a cache hit and a cache miss is roughly 10x — your service becomes stateful in ways that most infrastructure teams don't anticipate until they're staring at degraded latency numbers in their second region.

The Three Hidden Debts Killing Your AI System

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped on time. Users are using it. Everything looks fine — until one quarter later when a support ticket reveals the system has been confidently wrong for weeks, your evaluation suite caught nothing, and the vector index is silently returning stale results. Nothing broke. The system returned 200 OK the whole time.

This is what AI technical debt looks like. Unlike a failing unit test or a stack overflow, it degrades softly and probabilistically. You don't get a crash — you get subtle quality erosion. Three distinct liabilities drive most of this: prompt debt, eval debt, and embedding debt. Each accumulates independently. Each compounds the others. And most engineering teams are carrying all three.

The AI Dependency Footprint: When Every Feature Adds a New Infrastructure Owner

· 9 min read
Tian Pan
Software Engineer

Your team shipped a RAG-powered search feature last quarter. It required a vector database, an embedding model, an annotation pipeline, a chunking service, and an evaluation harness. Each component made sense individually. But six months later, you discover that three of those five components have no clear owner, two are running on engineers' personal cloud accounts, and one was quietly deprecated by its vendor without anyone noticing. The 3am page comes from a component nobody even remembers adding.

This is the AI dependency footprint problem: the compounding accumulation of infrastructure that each AI feature requires, combined with the organizational reality that teams rarely plan ownership for any of it before shipping.

Continuous Fine-Tuning Without Data Contamination: The Production Pipeline

· 11 min read
Tian Pan
Software Engineer

Most teams running continuous fine-tuning discover the contamination problem the same way: their eval metrics keep improving each week, the team celebrates, and then a user reports that the model has "gotten worse." When you investigate, you realize your evaluation benchmark has been quietly leaking into your training data for months. Every metric that looked like capability gain was memorization.

The numbers are worse than intuition suggests. LLaMA 2 had over 16% of MMLU examples contaminated — with 11% severely contaminated (more than 80% token overlap). GPT-2 scored 15 percentage points higher on contaminated benchmarks versus clean ones. These are not edge cases. In a continuous fine-tuning loop, contamination is the default outcome unless you architect explicitly against it.

Fine-Tuning Dataset Provenance: The Audit Question You Can't Answer Six Months Later

· 10 min read
Tian Pan
Software Engineer

Six months after you shipped your fine-tuned model, a regulator asks: "Which training examples came from users who have since revoked consent?" You open a spreadsheet, search a Slack archive, and find yourself reconstructing history from annotation batch emails and a README that hasn't been updated since the first sprint. This is the norm, not the exception. An audit of 44 major instruction fine-tuning datasets found over 70% of their licenses listed as "unspecified," with error rates above 50% in how license categories were actually applied. The provenance problem is structural, and it bites hardest when you can least afford it.

This post is about building a provenance registry for fine-tuning data before you need it — the schema, the audit scenarios that drive its requirements, and the production patterns that make it tractable without becoming a second job.

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

The Annotation Pipeline Is Production Infrastructure

· 11 min read
Tian Pan
Software Engineer

Most teams treat their annotation pipeline the same way they treat their CI script from 2019: it works, mostly, and nobody wants to touch it. A shared spreadsheet with color-coded rows. A Google Form routing tasks to a Slack channel. Three contractors working asynchronously, comparing notes in a thread.

Then a model ships with degraded quality, an eval regresses in a confusing direction, and the post-mortem eventually surfaces the obvious: the labels were wrong, and no one built anything to detect it.

Annotation is not a data problem. It is a software engineering problem. The teams that treat it that way — with queues, schemas, monitoring, and structured disagreement handling — build AI products that improve over time. The teams that don't are in a cycle of re-labeling they can't quite explain.

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.

The Adapter Compatibility Cliff: When Your Fine-Tune Meets the New Base Model

· 11 min read
Tian Pan
Software Engineer

Fine-tuning a language model gives you a competitive edge until the provider updates the base model underneath your adapter. At that point, one of two things happens: your service crashes with a shape mismatch error, or — far more dangerously — it silently starts returning degraded outputs while your monitoring shows nothing unusual. Most teams discover the second scenario only when users start complaining that "the AI got dumber."

This is the adapter compatibility cliff. You trained a LoRA adapter on model version N. The provider shipped version N+1. Your adapter is now running on a foundation it was never designed for, and there is no migration path.

Agent Behavioral Versioning: Why Git Commits Don't Capture What Changed

· 9 min read
Tian Pan
Software Engineer

You shipped an agent last Tuesday. Nothing in your codebase changed. On Thursday, it started refusing tool calls it had handled reliably for weeks. Your git log is clean, your tests pass, and your CI pipeline is green. But the agent is broken — and you have no version to roll back to, because the thing that changed wasn't in your repository.

This is the central paradox of agent versioning: the artifacts you track (code, configs, prompts) are necessary but insufficient to define what your agent actually does. The behavior emerges from the intersection of code, model weights, tool APIs, and runtime context — and any one of those can shift without leaving a trace in your version control system.

AI Feature Decay: The Slow Rot That Metrics Don't Catch

· 9 min read
Tian Pan
Software Engineer

Your AI feature launched to applause. Three months later, users are quietly routing around it. Your dashboards still show green — latency is fine, error rates are flat, uptime is perfect. But satisfaction scores are sliding, support tickets mention "the AI is being weird," and the feature that once handled 70% of inquiries now barely manages 50%.

This is AI feature decay: the gradual degradation of an AI-powered feature not from model changes or code bugs, but from the world shifting underneath it. Unlike traditional software that fails with stack traces, AI features degrade silently. The system runs, the model responds, and the output is delivered — it's just no longer what users need.

The AI Team Topology Problem: Why Your Org Chart Determines Whether AI Ships

· 8 min read
Tian Pan
Software Engineer

Most AI features die in the gap between "works in notebook" and "works in production." Not because the model is bad, but because the team that built the model and the team that owns the product have never sat in the same room. The AI team topology problem — where AI engineers sit in your org chart — is quietly the biggest predictor of whether your AI investments ship or stall.

The numbers bear this out. Only about half of ML projects make it from prototype to production — at less mature organizations, the failure rate reaches 90%. Meanwhile, CircleCI's 2026 State of Software Delivery report found that AI-assisted code generation boosted feature branch throughput by 59%, yet production branch output actually declined 7% for median teams. Code is being written faster than ever. It's just not shipping.