Skip to main content

123 posts tagged with "mlops"

View all tags

The Retrograde Accuracy Problem: Why AI Features Degrade as Your Product Grows

· 10 min read
Tian Pan
Software Engineer

Your AI feature ships clean. Accuracy on the eval set: 91%. Latency: acceptable. The team is proud. Six months later, users are complaining that the feature feels "dumb," support tickets are climbing, and your aggregate metrics are quietly 8% worse than launch day. Nobody changed the model. The underlying data pipeline is intact. What happened?

This is the retrograde accuracy problem. As your product grows — new features, new user segments, new edge cases, new flows — the input distribution your AI sees in production quietly drifts away from the distribution it was trained on. No model update. No data pipeline failure. The product itself outgrew the model.

Scheduling Fairness in Multi-Tenant LLM Inference: Why FIFO Is the Wrong Default

· 11 min read
Tian Pan
Software Engineer

Your company runs a shared LLM serving cluster. Two tenants use it: a customer-facing chatbot with a 500ms first-token latency SLO, and a batch document enrichment pipeline that processes thousands of long-context prompts overnight. One morning, the chatbot team pages you at 3am because their P95 TTFT spiked to 12 seconds. Root cause: the batch job started earlier than expected, filled the GPU memory with prefill work, and the chatbot's short requests sat in queue behind a parade of 8,000-token prompts. Your FIFO scheduler gave them equal priority. The chatbot's SLO was violated 4,000 times before you killed the batch job manually.

This failure mode is common, well-understood in theory, and surprisingly widespread in practice. Most teams deploy vLLM or TGI with the default FIFO scheduler, add multiple workloads over time, and only discover the priority inversion when an incident happens.

Your Eval Harness Is a Museum: How Production Failures Should Write Tomorrow's Tests

· 9 min read
Tian Pan
Software Engineer

Most AI teams build their eval suite once — carefully, thoughtfully, during the sprint before launch. They write cases for the edge scenarios they can imagine, document the expected outputs, get sign-off, and ship. Six months later, the suite still passes. The model has quietly gotten worse on the actual traffic hitting production, but the eval harness was authored before any of that traffic existed. It's still grading the answers to questions the author asked, not the questions users are asking.

That's the museum problem: an eval suite curated at one point in time accumulates relics. It proves the system handles the cases someone anticipated, not the cases that actually break it.

The Staging Environment Lie: Why Pre-Production Fails for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your staging environment passed all its checks. The LLM responded correctly to every test prompt. Latency was good. Quality scores looked fine. You shipped. Then, two days later, production started hallucinating on a class of queries your eval set never covered, your costs spiked 3x because the cache was cold, and a model update your provider pushed silently changed behavior in ways your old test suite couldn't detect. Staging said green. Production said otherwise.

This isn't a testing gap you can close by writing more test cases. Pre-production environments are structurally misleading for AI systems in ways they aren't for traditional software. The failure modes are systematic, and the fix isn't better staging — it's a different architecture.

The Two-Speed Organization: Why AI Teams and Product Teams Run on Incompatible Clocks

· 10 min read
Tian Pan
Software Engineer

Your ML team ran a promising experiment. The model beat the baseline by 8 points on your eval set. Stakeholders are excited. Then it took four months to ship — and by the time the feature launched, the product roadmap had moved on, the team that requested it had a different priority, and half the infra work got redone because the deployment target changed mid-flight. Sound familiar?

This is the clock-mismatch problem: AI teams and product teams run on fundamentally different time scales, and most organizations treat this as a coordination failure when it is actually an architectural one. You cannot fix a structural mismatch with a better standup cadence.

The Agent Portfolio Audit: How to Consolidate 15 Independent Agents Into a Platform Without Killing Team Autonomy

· 9 min read
Tian Pan
Software Engineer

Six months after launching their first AI agent, most engineering organizations discover they have fifteen of them. Not because anyone planned a fleet — because each team solved a real problem and shipped. The customer support team built a triage agent. The data team built a report-generation agent. Platform engineering built a runbook agent. Infrastructure built three more. None of them share auth, logging, tooling, or evaluation methodology. Tokens are bleeding from a dozen provider accounts and nobody can tell you which agent is responsible.

This is the moment that separates engineering organizations that can scale AI from those that can't. The answer is not to slow down agent development — it's to run a portfolio audit before entropy makes consolidation impossible.

The Ethics Review Gate Your AI Shipping Process Is Missing

· 9 min read
Tian Pan
Software Engineer

Most engineering teams treat ethics like they used to treat security: something you address after the feature ships, if someone complains. The parallels are uncomfortable. In 2004, SQL injection was a "we'll fix it later" problem. Today, every serious team has automated injection detection in CI. Ethics reviews in AI are at the same inflection point — and teams that don't build the gate now will learn the hard way why it exists.

The gap is not intent. It's structure. Security reviews have a 20-year head start on standardization: OWASP checklists, CVE scoring, penetration tests, mandatory sign-offs before production. Ethics reviews have none of that ceremony. Most teams have no defined trigger, no checklist, no exit criteria, and no named owner. The result: a healthcare algorithm that reduced identification of Black patients for care by over 50% not because engineers were malicious, but because no one ran disaggregated accuracy numbers before the thing went live. A recruiting model that systematically downranked resumes containing the word "women's" — trained on historical data, shipped without a fairness pass, discovered months into production. These aren't edge cases. They're what happens when ethics is a post-launch checkbox with no teeth.

Training Data Self-Poisoning: When Your AI Feature Corrupts Its Own Ground Truth

· 10 min read
Tian Pan
Software Engineer

Your recommendation model launched three months ago. Click-through rates are up 18%. Watch time is climbing. The dashboard is green. Leadership is happy.

And your model is quietly destroying the data it will use to train its next version.

This is training data self-poisoning: a feedback loop where a deployed AI feature shifts user behavior in ways that corrupt the interaction data the model was originally trained to learn from. The worst part is that your standard engagement metrics will tell you everything is fine — right up until they don't.

The Data Flywheel Assumption: When AI Features Compound and When They Just Accumulate Noise

· 9 min read
Tian Pan
Software Engineer

Every AI pitch deck includes a slide about the data flywheel. The story is appealing: users interact with your AI feature, that interaction generates data, the data trains a better model, the better model attracts more users, and the cycle repeats. Scale long enough and you have an insurmountable competitive moat.

The problem is that most teams shipping AI features don't have a flywheel. They have a log file. A very large, expensive-to-store log file that has never improved their model and never will—because the three preconditions for a real flywheel are missing and nobody has asked whether they're present.

Diffusion Models in Production: The Engineering Stack Nobody Discusses After the Demo

· 10 min read
Tian Pan
Software Engineer

Your image generation feature just went viral. 100,000 requests are coming in daily. The API provider's rate limit technically accommodates it. Latency crawls to 12 seconds at p95. Your NSFW classifier is flagging legitimate medical illustrations. A compliance audit surfaces that California's AI Transparency Act required watermarking since September 2024. Support has 50 open tickets from users whose content was silently blocked. By the time you realize you need a real production stack, you've already burned two weeks in crisis mode.

This is the moment "just call the API" fails—not because the API is bad, but because the demo's success exposes every assumption you made about inference latency, content policy, moderation fairness, and regulatory compliance. The engineering work nobody shows you in tutorials lives here.

The Eval Debt Ratchet: How Teams Get Buried Cleaning Up What They Shipped on Vibes

· 10 min read
Tian Pan
Software Engineer

Three months after shipping a document summarization feature, a team at a mid-size company runs a prompt improvement. The new prompt scores better on the five examples they tested manually. They deploy it Friday afternoon. Monday morning, their Slack is full of user reports: summaries are now truncating half the document and presenting the truncated version as complete. The feature looked fine. The change passed review. Nobody noticed because there was no evaluation — no golden test set, no regression baseline, no automated check. The ratchet had been turning silently for months.

This is eval debt in its most recognizable form. The team didn't skip evaluations because they were careless. They skipped them because writing evaluations for AI features is harder than it sounds, the feature shipped fast and looked good, and nobody wanted to slow down a team with momentum. Now they're paying the compound interest.

The Federated AI Team: Why Centralizing AI Expertise Creates the Problems It Was Supposed to Solve

· 10 min read
Tian Pan
Software Engineer

The central AI team was supposed to be the answer. Hire the best ML engineers into a single group, standardize the tooling, establish governance, and let product teams consume AI capabilities without needing to understand them. It's a compelling architecture — clean on an org chart, defensible in a board presentation. In practice, it reliably produces a failure mode that looks exactly like the fragmentation it was created to eliminate.

The central AI team becomes a bottleneck. Product teams queue behind it. The AI it ships feels generic to every domain that needs something specific. The ML engineers who built the platform don't know the product metrics. The product engineers who need help can't debug AI behavior without filing a ticket. A 3-month pilot succeeds; a 9-month security review buries it.

Companies in 2025 reported abandoning the majority of their AI initiatives at more than twice the rate they did in 2024. Many of those failures happened at the transition from proof of concept to production — precisely where an overstretched, disconnected central team shows its seams.