Skip to main content

65 posts tagged with "mlops"

View all tags

Embedding Model Rotation Is a Database Migration, Not a Deploy

· 11 min read
Tian Pan
Software Engineer

Somewhere in a staging channel, an engineer writes "bumping the embedder to v3, new model scored +4 on MTEB, merging after the smoke test." Two days later support tickets start trickling in about search results that feel "weirdly off." A week later retrieval precision is down fourteen points, cosine scores have collapsed from 0.85 into the 0.65 range, and nobody can explain why — because the deploy looked identical to the last five model bumps. It wasn't a deploy. It was a database migration wearing a deploy's costume.

Embedding model rotation is the most misfiled change type in AI infrastructure. It lands in your system through the same channels as a prompt tweak or a generation-model pin update — a config file, a PR, a CI check — so it gets the governance of a config change. But under the hood, a new embedder does not produce a better version of your old vectors. It produces vectors that live in a different coordinate system entirely, where cosine similarity across the two manifolds is a category error. The correct mental model is not "rev the dependency." It is "swap the primary key encoding on a fifty-million-row table while serving reads."

The Orphan Adapter Problem: When Your Fine-Tune Outlives Its Base Model

· 12 min read
Tian Pan
Software Engineer

A senior engineer left six months ago. She owned the classifier adapter that routes customer support tickets — a 32-rank LoRA trained on 847 hand-labeled examples, pinned to a base model that hits end-of-life in 43 days. Nobody remembers why those 847 examples were chosen over the 2,000 they started with. The training data sits in an S3 bucket whose lifecycle policy purges objects older than one year. Her laptop was wiped. The fine-tuning notebook has a cell that calls a preprocessing function she imported from her personal dotfiles repo, now private.

This is the orphan adapter — a fine-tune that outlived its maintainers, outlived its data, and is about to outlive the base model it was trained on. It sits in your production stack, routing real user traffic, and nobody left on the team can rebuild it. The deprecation email didn't create this crisis. It just exposed it.

Why AI Feature Flags Are Not Regular Feature Flags

· 11 min read
Tian Pan
Software Engineer

Your canary deployment worked perfectly. Error rates stayed flat. Latency didn't spike. The dashboard showed green across the board. You rolled the new model out to 100% of traffic — and three weeks later your support queue filled up with users complaining that the AI "felt off" and "stopped being helpful."

This is the core problem with applying traditional feature flag mechanics to AI systems. A model can be degraded without being broken. It returns 200s, generates tokens at normal speed, and produces text that passes superficial validation — while simultaneously hallucinating more often, drifting toward terse or evasive answers, or regressing on the subtle reasoning patterns your users actually depend on. The telemetry you've been monitoring for years was never designed to catch this kind of failure.

The AI Feature Lifecycle Decay Problem: How to Catch Degradation Before Users Do

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped clean. The demo impressed, the launch metrics looked great, and the model benchmarked at 88% accuracy on your test set. Then, about three months later, a customer success manager forwards you a screenshot. The AI recommendation made no sense. You pull the logs, run a quick evaluation, and find accuracy has drifted to 71%. No alert fired. No error was thrown. Infrastructure dashboards showed green the whole time.

This pattern is not a freak occurrence. Research across 32 production datasets found that 91% of ML models degrade over time — and most of the degradation is silent. The systems keep running, the code doesn't change, but the predictions get progressively worse as the real world moves on without the model.

The AI Feature Sunset Playbook: How to Retire Underperforming AI Without Burning Trust

· 10 min read
Tian Pan
Software Engineer

Engineering teams have built more AI features in the past three years than in the prior decade. They have retired almost none of them. Deloitte found that 42% of companies abandoned at least one AI initiative in 2025 — up from 17% the year before — with average sunk costs of $7.2 million per abandoned project. Yet the features that stay in production often cause more damage than the ones that get cut: they erode user trust slowly, accumulate technical debt that compounds monthly, and consume engineering capacity that could go toward things that work.

The asymmetry is structural. AI feature launches generate announcements, stakeholder excitement, and team recognition. Retirements are treated as admissions of failure. So bad features accumulate. The fix is not willpower — it is a decision framework that makes retirement a normal, predictable engineering outcome rather than an organizational crisis.

Bias Monitoring Infrastructure for Production AI: Beyond the Pre-Launch Audit

· 10 min read
Tian Pan
Software Engineer

Your model passed its fairness review. The demographic parity was within acceptable bounds, equal opportunity metrics looked clean, and the audit report went into Confluence with a green checkmark. Three months later, a journalist has screenshots showing your system approves loans at half the rate for one demographic compared to another — and your pre-launch numbers were technically accurate the whole time.

This is the bias monitoring gap. Pre-launch fairness testing validates your model against datasets that existed when you ran the tests. Production AI systems don't operate in that static world. User behavior shifts, population distributions drift, feature correlations evolve, and disparities that weren't measurable at launch can become significant failure modes within weeks. The systems that catch these problems aren't part of most ML stacks today.

Canary Deploys for LLM Upgrades: Why Model Rollouts Break Differently Than Code Deployments

· 11 min read
Tian Pan
Software Engineer

Your CI passed. Your evals looked fine. You flipped the traffic switch and moved on. Three days later, a customer files a ticket saying every generated report has stopped including the summary field. You dig through logs and find the new model started reliably producing exec_summary instead — a silent key rename that your JSON schema validation never caught because you forgot to add it to the rollout gates. The root cause was a model upgrade. The detection lag was 72 hours.

This is not a hypothetical. It happens in production at companies that have sophisticated deployment pipelines for their application code but treat LLM version upgrades as essentially free — a config swap, not a deployment. That mental model is wrong, and the failure modes that result from it are distinctly hard to catch.

Contract Testing for AI Pipelines: Schema-Validated Handoffs Between AI Components

· 10 min read
Tian Pan
Software Engineer

Most AI pipeline failures aren't model failures. The model fires fine. The output looks like JSON. The downstream stage breaks silently because a field was renamed, a type changed, or a nested object gained a new required property that the next stage doesn't know how to handle. The pipeline runs to completion and reports success. Somewhere in the data warehouse, numbers are wrong.

This is the contract testing problem for AI pipelines, and it's one of the most underaddressed reliability risks in production AI systems. According to recent infrastructure benchmarks, the average enterprise AI system experiences nearly five pipeline failures per month—each taking over twelve hours to resolve. The dominant cause isn't poor model quality. It's data quality and schema contract violations: 64% of AI risk lives at the schema layer.

The Data Flywheel Trap: Why Your Feedback Loop May Be Spinning in Place

· 11 min read
Tian Pan
Software Engineer

Every product leader has heard the pitch: more users generate more data, better data trains better models, better models attract more users. The data flywheel is the moat that compounds. It's why AI incumbents win.

The pitch is not wrong. But the implementation almost always is. In practice, most data flywheels have multiple leakage points — places where the feedback loop appears to be spinning but is actually amplifying bias, reinforcing stale patterns, or optimizing a proxy that diverges from the real objective. The engineers building these systems rarely know which type of leakage they have, because all of them look identical from the outside: engagement goes up, the model keeps improving on the metrics you can measure, and the system slowly becomes less useful in ways that are hard to attribute.

This is the data flywheel trap. Understanding its failure modes is the prerequisite to building one that actually works.

The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability

· 9 min read
Tian Pan
Software Engineer

Most teams treat their golden eval set like a constitution — permanent, authoritative, and expensive to touch. They spend weeks curating examples, getting domain experts to label them, and wiring them into CI. Then they move on.

Six months later, the eval suite reports 87% pass rate while users are complaining about broken outputs. The evals haven't regressed — they've decayed. The dataset still measures what mattered in October. It just no longer measures what matters now.

This is the golden dataset decay problem, and it's more common than most teams admit.

What Model Cards Don't Tell You: The Production Gap Between Published Benchmarks and Real Workloads

· 9 min read
Tian Pan
Software Engineer

A model card says 89% accuracy on code generation. Your team gets 28% on the actual codebase. A model card says 100K token context window. Performance craters at 32K under your document workload. A model card passes red-team safety evaluation. A prompt injection exploit ships to your users within 72 hours of launch.

This gap isn't rare. It's the norm. In a 2025 analysis of 1,200 production deployments, 42% of companies abandoned their AI initiatives at the production integration stage — up from 17% the previous year. Most of them had read the model cards carefully.

The problem isn't that model cards lie. It's that they measure something different from what you need to know. Understanding that gap precisely — and building the internal benchmark suite to close it — is what separates teams that ship reliable AI from teams that ship regrets.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.