Skip to main content

116 posts tagged with "mlops"

View all tags

The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability

· 9 min read
Tian Pan
Software Engineer

Most teams treat their golden eval set like a constitution — permanent, authoritative, and expensive to touch. They spend weeks curating examples, getting domain experts to label them, and wiring them into CI. Then they move on.

Six months later, the eval suite reports 87% pass rate while users are complaining about broken outputs. The evals haven't regressed — they've decayed. The dataset still measures what mattered in October. It just no longer measures what matters now.

This is the golden dataset decay problem, and it's more common than most teams admit.

What Model Cards Don't Tell You: The Production Gap Between Published Benchmarks and Real Workloads

· 9 min read
Tian Pan
Software Engineer

A model card says 89% accuracy on code generation. Your team gets 28% on the actual codebase. A model card says 100K token context window. Performance craters at 32K under your document workload. A model card passes red-team safety evaluation. A prompt injection exploit ships to your users within 72 hours of launch.

This gap isn't rare. It's the norm. In a 2025 analysis of 1,200 production deployments, 42% of companies abandoned their AI initiatives at the production integration stage — up from 17% the previous year. Most of them had read the model cards carefully.

The problem isn't that model cards lie. It's that they measure something different from what you need to know. Understanding that gap precisely — and building the internal benchmark suite to close it — is what separates teams that ship reliable AI from teams that ship regrets.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.

Prompt Versioning Done Right: Treating LLM Instructions as Production Software

· 8 min read
Tian Pan
Software Engineer

Three words. That's all it took.

A team added three words to an existing prompt to improve "conversational flow" — a tweak that seemed harmless in the playground. Within hours, structured-output error rates spiked, a revenue-generating workflow stopped functioning, and engineers were scrambling to reconstruct what the prompt had said before the change. No version history. No rollback. Just a Slack message from someone who remembered it "roughly" and a diff against an obsolete copy in a Google Doc.

This is not a hypothetical. It is a pattern repeated across nearly every organization that ships LLM features at scale. Prompts start as strings in application code, evolve through informal edits, accumulate undocumented micro-adjustments, and eventually reach a state where nobody is confident about what's running in production or why it behaves the way it does.

The fix is not a new tool. It's discipline applied to something teams have been treating as config.

Shadow to Autopilot: A Readiness Framework for AI Feature Autonomy

· 11 min read
Tian Pan
Software Engineer

When a fintech company first deployed an AI transaction approval agent, the product team was convinced the model was ready for autonomy after a week of positive offline evals. They pushed it to co-pilot mode — where the agent suggested approvals and humans could override — and the approval rates looked great. Three weeks later, a pattern surfaced: the model was systematically under-approving transactions from non-English-speaking users in ways that correlated with name patterns, not risk signals. No one had checked segment-level performance before the rollout. The model wasn't a fraud-detection failure. It was a stage-gate failure.

Most teams understand, in principle, that AI features should be rolled out gradually. What they don't have is a concrete engineering framework for what "gradual" actually means: which metrics unlock each stage, what monitoring is required before escalation, and what triggers an automatic rollback. Without these, autonomy escalation becomes an act of organizational optimism rather than a repeatable engineering decision.

The Six-Month Cliff: Why Production AI Systems Degrade Without a Single Code Change

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped green. Latency is fine, error rates are negligible, and the HTTP responses return 200. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. An engineer digs in and discovers the system has been wrong about a third of what users ask — not because of a bad deploy, not because of a dependency upgrade, but because time passed. You shipped a snapshot into a river.

This isn't a hypothetical. Industry data shows that 91% of production LLMs experience measurable behavioral drift within 90 days of deployment. A customer support chatbot that initially handled 70% of inquiries without escalation can quietly drop to under 50% by month three — while infrastructure dashboards stay green the entire time. The six-month cliff is real, it's silent, and most teams don't have the instrumentation to see it coming.

Subgroup Fairness Testing in Production AI: Why Aggregate Accuracy Lies

· 11 min read
Tian Pan
Software Engineer

When a face recognition system reports 95% accuracy, your first instinct is to ship it. The instinct is wrong. That same system can simultaneously fail darker-skinned women at a 34% error rate while achieving 0.8% on lighter-skinned men — a 40x disparity, fully hidden inside that reassuring aggregate number.

This is the aggregate accuracy illusion, and it destroys production AI features in industries ranging from hiring to healthcare to speech recognition. The pattern is structurally identical to Simpson's Paradox: a model that looks fair in aggregate can discriminate systematically across every meaningful subgroup simultaneously. Aggregate metrics are weighted averages. When some subgroups are smaller or underrepresented in your eval set, their failure rates get diluted by the majority's success.

The fix is not a different accuracy threshold. It is disaggregated evaluation — computing your performance metrics per subgroup, defining disparity SLOs, and monitoring them continuously in production the same way you monitor latency and error rate.

The AI Feature Maintenance Cliff: Why Your AI-Powered Features Age Faster Than You Think

· 9 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, users love it, and then three months later your support inbox fills up with confused complaints. Nothing in your infrastructure changed. The code is identical. But the feature quietly stopped being good.

This is the AI feature maintenance cliff: the moment when accumulated silent degradation becomes a visible failure. Unlike traditional software bugs, which announce themselves with stack traces and failed requests, AI quality erosion returns HTTP 200 with well-formed JSON and completely wrong answers. Your dashboards are green. Your feature is broken.

A cross-institutional study covering 32 datasets across four industries found that 91% of ML models degrade over time without proactive intervention. That's not a tail risk — it's the expected outcome for every AI feature you ship and walk away from.

The AI Feature Retirement Playbook: How to Sunset What Users Barely Adopted

· 11 min read
Tian Pan
Software Engineer

Your team shipped an AI-powered summarization feature six months ago. Adoption plateaued at 8% of users. The model calls cost $4,000 a month. The one engineer who built it has moved to a different team. And now the model provider is raising prices.

Every instinct says: kill it. But killing an AI feature turns out to be significantly harder than killing any other kind of feature — and most teams find this out the hard way, mid-retirement, when the compliance questions start arriving and the power users revolt.

This is the playbook that should exist before you ship the feature, but is most useful right now, when you're staring at usage graphs that point unmistakably toward the exit.

The Three Silent Clocks of AI Technical Debt

· 10 min read
Tian Pan
Software Engineer

Traditional technical debt announces itself. A slow build, a failing test, a lint warning that's been suppressed for six months — all of these are symptoms you can grep for, assign to a ticket, and schedule into a sprint. AI-specific debt is different. It accumulates in silence, in the gaps between deploys, and it degrades your system's behavior before anyone notices that the numbers have moved.

Three debt clocks are ticking in most production AI systems right now. The first is the prompt that made sense when a specific model version was current. The second is the evaluation set that was representative of user behavior when it was assembled, but no longer is. The third is the index of embeddings still powering your retrieval layer, generated from a model that has since been deprecated. Each clock runs independently. All three compound.

The Annotation Economy: Why Every Label Source Has a Hidden Tax

· 9 min read
Tian Pan
Software Engineer

Most teams pick their annotation strategy by comparing unit costs: crowd workers run about 0.08perlabel,LLMgenerationunder0.08 per label, LLM generation under 0.003, human domain experts around $1. Run the spreadsheet, pick the cheapest option that seems "good enough," and ship. This math consistently gets teams into trouble.

The actual decision is not about cost per label in isolation. Every label source carries a hidden quality tax — compounding costs in the form of garbage gradients, misleading eval curves, or months spent debugging production failures that clean labels would have caught at training time. The cheapest source is often the most expensive one when you count the downstream cost of trusting it.

The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth

· 10 min read
Tian Pan
Software Engineer

Most teams building AI products spend weeks designing rating widgets, click-to-rate stars, thumbs-up/thumbs-down buttons. Then they look at the data six months later and find a 2% response rate — biased toward outlier experiences, dominated by people with strong opinions, and almost entirely useless for distinguishing a 7/10 output from a 9/10 one.

Meanwhile, every user session is generating a continuous stream of honest, unambiguous behavioral signals. The user who accepts a code suggestion and moves on is satisfied. The user who presses Ctrl+Z immediately is not. The user who rephrases their question four times in a row is telling you something explicit ratings will never capture: the first three responses failed. These signals exist whether you collect them or not. The question is whether you're closing the loop.