129 posts tagged with "mlops"

The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth

April 19, 2026 · 10 min read

Software Engineer

Most teams building AI products spend weeks designing rating widgets, click-to-rate stars, thumbs-up/thumbs-down buttons. Then they look at the data six months later and find a 2% response rate — biased toward outlier experiences, dominated by people with strong opinions, and almost entirely useless for distinguishing a 7/10 output from a 9/10 one.

Meanwhile, every user session is generating a continuous stream of honest, unambiguous behavioral signals. The user who accepts a code suggestion and moves on is satisfied. The user who presses Ctrl+Z immediately is not. The user who rephrases their question four times in a row is telling you something explicit ratings will never capture: the first three responses failed. These signals exist whether you collect them or not. The question is whether you're closing the loop.

Continuous Deployment for AI Models: Your Rollback Signal Is Wrong

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your deployment pipeline is green. Latency is nominal. Error rate: 0.02%. The new model version shipped successfully — or so your dashboard says.

Meanwhile, your customer-facing AI is subtly summarizing documents with less precision, hedging on questions it used to answer directly, and occasionally flattening the structured outputs your downstream pipeline depends on. No alerts fire. No on-call page triggers. The first signal you get is a support ticket, two weeks later.

This is the silent regression problem in AI deployments. Traditional rollback signals — HTTP errors, p99 latency, exception rates — are built for deterministic software. They cannot see behavioral drift. And as teams upgrade language models more frequently, the gap between "infrastructure is healthy" and "AI is working correctly" becomes a place where regressions hide.

The AI Feature Sunset Playbook: Decommissioning Agents Without Breaking Your Users

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams discover the same thing at the worst possible time: retiring an AI feature is nothing like deprecating an API. You add a sunset date to the docs, send the usual three-email sequence, flip the flag — and then watch your support queue spike 80% while users loudly explain that the replacement "doesn't work the same way." What they mean is: the old agent's quirks, its specific failure modes, its particular brand of wrong answer, had all become load-bearing. They'd built workflows around behavior they couldn't name until it was gone.

This is the core problem with AI feature deprecation. Deterministic APIs have explicit contracts. If you remove an endpoint, every caller that relied on it gets a 404. The breakage is traceable, finite, and predictable. Probabilistic AI outputs are different — users don't integrate the contract, they integrate the behavioral distribution. Removing a model doesn't just remove a capability; it removes a specific pattern of behavior that users may have spent months adapting to without realizing it.

Embedding Drift: The Silent Degradation Killing Your Long-Lived RAG System

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG system is running fine. Latency is normal. Error rate is zero. But a user asking about "California employment law" keeps getting results about real estate — and your logs show nothing wrong.

This is embedding drift in action: the retrieval failure mode that doesn't throw exceptions, doesn't spike error rates, and doesn't show up in standard observability dashboards. It happens when your vector store accumulates embeddings produced under different conditions — different model versions, different chunking rules, different preprocessing pipelines — and the vectors start pointing in incompatible directions. The system keeps serving requests, but the semantic coordinates are no longer aligned, and retrieval quality erodes quietly over weeks or months.

Eval Set Decay: Why Your Benchmark Becomes Misleading Six Months After You Build It

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

You spend three weeks curating a high-quality eval set. You write test cases that cover the edge cases your product manager worries about, sample real queries from beta users, and get a clean accuracy number that the team aligns on. Six months later, that number is still in the weekly dashboard. You just shipped a model update that looked great on evals. Users are filing tickets.

The problem isn't that the model regressed. The problem is that your eval set stopped representing reality months ago—and nobody noticed.

This failure mode has a name: eval set decay. It happens to almost every production AI team, and it's almost never caught until the damage is visible in user behavior.

Invisible Model Drift: How Silent Provider Updates Break Production AI

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your prompts worked on Monday. On Wednesday, users start complaining that responses feel off — answers are shorter, the JSON parsing downstream is breaking intermittently, the classifier that had been 94% accurate is now hovering around 79%. You haven't deployed anything. The model you're calling still has the same name in your config. But something changed.

This is invisible model drift: the silent, undocumented behavior changes that LLM providers push without announcement. It is one of the least-discussed operational hazards in AI engineering, and it hits teams that have done everything "right" — with evals, with monitoring, with stable prompt engineering. The model just changed underneath them.

LLM-Powered Data Pipelines: The ETL Tier Nobody Benchmarks

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most conversations about LLMs in production orbit around chat interfaces, copilots, and autonomous agents. But if you audit where enterprise LLM tokens are actually being consumed, a different picture emerges: a quiet majority of usage is happening inside batch data pipelines — extracting fields from documents, classifying support tickets, normalizing messy vendor records, enriching raw events with semantic labels. Nobody is writing conference talks about this tier. Nobody is benchmarking it seriously either. And that silence is costing teams real money and real accuracy.

This is the ETL tier that practitioners build first, justify last, and monitor least. It is also, for most organizations, the layer where LLM spend has the highest leverage — and the highest potential for invisible failure.

LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

The promise sounds clean: fine-tune lightweight LoRA adapters for each specialized skill — one for professional tone, one for JSON formatting, one for medical terminology, one for safety guardrails — then combine them at serving time. Teams ship this design, it works fine in development, and then falls apart in production when two adapters start fighting over the same weight regions and the output quality collapses to something indistinguishable from the untrained base model. Not slightly worse. Completely untuned.

This post is about what happens when you compose adapters in practice, why naive merging fails so reliably, and what strategies actually work at production scale.

The 90% Reliability Wall: Why AI Features Plateau and What to Do About It

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature ships at 92% accuracy. The team celebrates. Three months later, progress has flatlined — the error rate stopped falling despite more data, more compute, and two model upgrades. Sound familiar?

This is the 90% reliability wall, and it is not a coincidence. It emerges from three converging forces: the exponential cost of marginal accuracy gains, the difference between errors you can eliminate and errors that are structurally unavoidable, and the compound amplification of failure in production environments that benchmarks never capture. Teams that do not understand which force they are fighting will waste quarters trying to solve problems that are not solvable.

Serving AI at the Edge: A Decision Framework for Moving Inference Out of the Cloud

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most AI inference decisions get made the same way: the model lives in the cloud because that's where you can run it, full stop. But that calculus is changing fast. Flagship smartphones now carry neural engines capable of running 7B-parameter models at interactive speeds. A Snapdragon 8 Elite can generate tokens from a 3B model at around 10 tokens per second — fast enough for conversational use — while a Qualcomm Hexagon NPU hits 690 tokens per second on prefill. The question is no longer "can we run this on device?" but "should we, and when?"

The answer is rarely obvious. Moving inference to the edge introduces real tradeoffs: a quality tax from quantization, a maintenance burden for fleet updates, and hardware fragmentation across device SKUs. But staying in the cloud has its own costs: round-trip latency measured in hundreds of milliseconds, cloud GPU bills that compound at scale, and data sovereignty problems that no SLA can fully solve. This post lays out a practical framework for navigating those tradeoffs.

Who Owns AI Quality? The Cross-Functional Vacuum That Breaks Production Systems

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

When Air Canada's support chatbot promised customers a discount fare for recently bereaved travelers, the policy it described didn't exist. A court later ordered Air Canada to honor the hallucinated refund anyway. When a Chevrolet dealership chatbot negotiated away a 2024 Tahoe for $1, no mechanism stopped it. In both cases, the immediate question was about model quality. The real question — the one that matters operationally — was simpler: who was supposed to catch that?

The answer, in most organizations, is nobody specific. AI quality sits at the intersection of ML engineering, product management, data teams, and operations. Each function has a partial view. None claims full ownership. The result is a vacuum where things that should be caught aren't, and when something breaks, the postmortem produces a list of teams that each assumed someone else was responsible.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

April 18, 2026 · 12 min read

Tian Pan

Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

About Tian Pan