Skip to main content

123 posts tagged with "mlops"

View all tags

The Six-Month Cliff: Why Production AI Systems Degrade Without a Single Code Change

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped green. Latency is fine, error rates are negligible, and the HTTP responses return 200. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. An engineer digs in and discovers the system has been wrong about a third of what users ask — not because of a bad deploy, not because of a dependency upgrade, but because time passed. You shipped a snapshot into a river.

This isn't a hypothetical. Industry data shows that 91% of production LLMs experience measurable behavioral drift within 90 days of deployment. A customer support chatbot that initially handled 70% of inquiries without escalation can quietly drop to under 50% by month three — while infrastructure dashboards stay green the entire time. The six-month cliff is real, it's silent, and most teams don't have the instrumentation to see it coming.

Subgroup Fairness Testing in Production AI: Why Aggregate Accuracy Lies

· 11 min read
Tian Pan
Software Engineer

When a face recognition system reports 95% accuracy, your first instinct is to ship it. The instinct is wrong. That same system can simultaneously fail darker-skinned women at a 34% error rate while achieving 0.8% on lighter-skinned men — a 40x disparity, fully hidden inside that reassuring aggregate number.

This is the aggregate accuracy illusion, and it destroys production AI features in industries ranging from hiring to healthcare to speech recognition. The pattern is structurally identical to Simpson's Paradox: a model that looks fair in aggregate can discriminate systematically across every meaningful subgroup simultaneously. Aggregate metrics are weighted averages. When some subgroups are smaller or underrepresented in your eval set, their failure rates get diluted by the majority's success.

The fix is not a different accuracy threshold. It is disaggregated evaluation — computing your performance metrics per subgroup, defining disparity SLOs, and monitoring them continuously in production the same way you monitor latency and error rate.

The AI Feature Maintenance Cliff: Why Your AI-Powered Features Age Faster Than You Think

· 9 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, users love it, and then three months later your support inbox fills up with confused complaints. Nothing in your infrastructure changed. The code is identical. But the feature quietly stopped being good.

This is the AI feature maintenance cliff: the moment when accumulated silent degradation becomes a visible failure. Unlike traditional software bugs, which announce themselves with stack traces and failed requests, AI quality erosion returns HTTP 200 with well-formed JSON and completely wrong answers. Your dashboards are green. Your feature is broken.

A cross-institutional study covering 32 datasets across four industries found that 91% of ML models degrade over time without proactive intervention. That's not a tail risk — it's the expected outcome for every AI feature you ship and walk away from.

The AI Feature Retirement Playbook: How to Sunset What Users Barely Adopted

· 11 min read
Tian Pan
Software Engineer

Your team shipped an AI-powered summarization feature six months ago. Adoption plateaued at 8% of users. The model calls cost $4,000 a month. The one engineer who built it has moved to a different team. And now the model provider is raising prices.

Every instinct says: kill it. But killing an AI feature turns out to be significantly harder than killing any other kind of feature — and most teams find this out the hard way, mid-retirement, when the compliance questions start arriving and the power users revolt.

This is the playbook that should exist before you ship the feature, but is most useful right now, when you're staring at usage graphs that point unmistakably toward the exit.

The Three Silent Clocks of AI Technical Debt

· 10 min read
Tian Pan
Software Engineer

Traditional technical debt announces itself. A slow build, a failing test, a lint warning that's been suppressed for six months — all of these are symptoms you can grep for, assign to a ticket, and schedule into a sprint. AI-specific debt is different. It accumulates in silence, in the gaps between deploys, and it degrades your system's behavior before anyone notices that the numbers have moved.

Three debt clocks are ticking in most production AI systems right now. The first is the prompt that made sense when a specific model version was current. The second is the evaluation set that was representative of user behavior when it was assembled, but no longer is. The third is the index of embeddings still powering your retrieval layer, generated from a model that has since been deprecated. Each clock runs independently. All three compound.

The Annotation Economy: Why Every Label Source Has a Hidden Tax

· 9 min read
Tian Pan
Software Engineer

Most teams pick their annotation strategy by comparing unit costs: crowd workers run about 0.08perlabel,LLMgenerationunder0.08 per label, LLM generation under 0.003, human domain experts around $1. Run the spreadsheet, pick the cheapest option that seems "good enough," and ship. This math consistently gets teams into trouble.

The actual decision is not about cost per label in isolation. Every label source carries a hidden quality tax — compounding costs in the form of garbage gradients, misleading eval curves, or months spent debugging production failures that clean labels would have caught at training time. The cheapest source is often the most expensive one when you count the downstream cost of trusting it.

The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth

· 10 min read
Tian Pan
Software Engineer

Most teams building AI products spend weeks designing rating widgets, click-to-rate stars, thumbs-up/thumbs-down buttons. Then they look at the data six months later and find a 2% response rate — biased toward outlier experiences, dominated by people with strong opinions, and almost entirely useless for distinguishing a 7/10 output from a 9/10 one.

Meanwhile, every user session is generating a continuous stream of honest, unambiguous behavioral signals. The user who accepts a code suggestion and moves on is satisfied. The user who presses Ctrl+Z immediately is not. The user who rephrases their question four times in a row is telling you something explicit ratings will never capture: the first three responses failed. These signals exist whether you collect them or not. The question is whether you're closing the loop.

Continuous Deployment for AI Models: Your Rollback Signal Is Wrong

· 10 min read
Tian Pan
Software Engineer

Your deployment pipeline is green. Latency is nominal. Error rate: 0.02%. The new model version shipped successfully — or so your dashboard says.

Meanwhile, your customer-facing AI is subtly summarizing documents with less precision, hedging on questions it used to answer directly, and occasionally flattening the structured outputs your downstream pipeline depends on. No alerts fire. No on-call page triggers. The first signal you get is a support ticket, two weeks later.

This is the silent regression problem in AI deployments. Traditional rollback signals — HTTP errors, p99 latency, exception rates — are built for deterministic software. They cannot see behavioral drift. And as teams upgrade language models more frequently, the gap between "infrastructure is healthy" and "AI is working correctly" becomes a place where regressions hide.

The AI Feature Sunset Playbook: Decommissioning Agents Without Breaking Your Users

· 10 min read
Tian Pan
Software Engineer

Most teams discover the same thing at the worst possible time: retiring an AI feature is nothing like deprecating an API. You add a sunset date to the docs, send the usual three-email sequence, flip the flag — and then watch your support queue spike 80% while users loudly explain that the replacement "doesn't work the same way." What they mean is: the old agent's quirks, its specific failure modes, its particular brand of wrong answer, had all become load-bearing. They'd built workflows around behavior they couldn't name until it was gone.

This is the core problem with AI feature deprecation. Deterministic APIs have explicit contracts. If you remove an endpoint, every caller that relied on it gets a 404. The breakage is traceable, finite, and predictable. Probabilistic AI outputs are different — users don't integrate the contract, they integrate the behavioral distribution. Removing a model doesn't just remove a capability; it removes a specific pattern of behavior that users may have spent months adapting to without realizing it.

Embedding Drift: The Silent Degradation Killing Your Long-Lived RAG System

· 10 min read
Tian Pan
Software Engineer

Your RAG system is running fine. Latency is normal. Error rate is zero. But a user asking about "California employment law" keeps getting results about real estate — and your logs show nothing wrong.

This is embedding drift in action: the retrieval failure mode that doesn't throw exceptions, doesn't spike error rates, and doesn't show up in standard observability dashboards. It happens when your vector store accumulates embeddings produced under different conditions — different model versions, different chunking rules, different preprocessing pipelines — and the vectors start pointing in incompatible directions. The system keeps serving requests, but the semantic coordinates are no longer aligned, and retrieval quality erodes quietly over weeks or months.

Eval Set Decay: Why Your Benchmark Becomes Misleading Six Months After You Build It

· 10 min read
Tian Pan
Software Engineer

You spend three weeks curating a high-quality eval set. You write test cases that cover the edge cases your product manager worries about, sample real queries from beta users, and get a clean accuracy number that the team aligns on. Six months later, that number is still in the weekly dashboard. You just shipped a model update that looked great on evals. Users are filing tickets.

The problem isn't that the model regressed. The problem is that your eval set stopped representing reality months ago—and nobody noticed.

This failure mode has a name: eval set decay. It happens to almost every production AI team, and it's almost never caught until the damage is visible in user behavior.

Invisible Model Drift: How Silent Provider Updates Break Production AI

· 10 min read
Tian Pan
Software Engineer

Your prompts worked on Monday. On Wednesday, users start complaining that responses feel off — answers are shorter, the JSON parsing downstream is breaking intermittently, the classifier that had been 94% accurate is now hovering around 79%. You haven't deployed anything. The model you're calling still has the same name in your config. But something changed.

This is invisible model drift: the silent, undocumented behavior changes that LLM providers push without announcement. It is one of the least-discussed operational hazards in AI engineering, and it hits teams that have done everything "right" — with evals, with monitoring, with stable prompt engineering. The model just changed underneath them.