Skip to main content

109 posts tagged with "mlops"

View all tags

Your Eval Harness Is a Museum: How Production Failures Should Write Tomorrow's Tests

· 9 min read
Tian Pan
Software Engineer

Most AI teams build their eval suite once — carefully, thoughtfully, during the sprint before launch. They write cases for the edge scenarios they can imagine, document the expected outputs, get sign-off, and ship. Six months later, the suite still passes. The model has quietly gotten worse on the actual traffic hitting production, but the eval harness was authored before any of that traffic existed. It's still grading the answers to questions the author asked, not the questions users are asking.

That's the museum problem: an eval suite curated at one point in time accumulates relics. It proves the system handles the cases someone anticipated, not the cases that actually break it.

The Staging Environment Lie: Why Pre-Production Fails for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your staging environment passed all its checks. The LLM responded correctly to every test prompt. Latency was good. Quality scores looked fine. You shipped. Then, two days later, production started hallucinating on a class of queries your eval set never covered, your costs spiked 3x because the cache was cold, and a model update your provider pushed silently changed behavior in ways your old test suite couldn't detect. Staging said green. Production said otherwise.

This isn't a testing gap you can close by writing more test cases. Pre-production environments are structurally misleading for AI systems in ways they aren't for traditional software. The failure modes are systematic, and the fix isn't better staging — it's a different architecture.

The Two-Speed Organization: Why AI Teams and Product Teams Run on Incompatible Clocks

· 10 min read
Tian Pan
Software Engineer

Your ML team ran a promising experiment. The model beat the baseline by 8 points on your eval set. Stakeholders are excited. Then it took four months to ship — and by the time the feature launched, the product roadmap had moved on, the team that requested it had a different priority, and half the infra work got redone because the deployment target changed mid-flight. Sound familiar?

This is the clock-mismatch problem: AI teams and product teams run on fundamentally different time scales, and most organizations treat this as a coordination failure when it is actually an architectural one. You cannot fix a structural mismatch with a better standup cadence.

The Agent Portfolio Audit: How to Consolidate 15 Independent Agents Into a Platform Without Killing Team Autonomy

· 9 min read
Tian Pan
Software Engineer

Six months after launching their first AI agent, most engineering organizations discover they have fifteen of them. Not because anyone planned a fleet — because each team solved a real problem and shipped. The customer support team built a triage agent. The data team built a report-generation agent. Platform engineering built a runbook agent. Infrastructure built three more. None of them share auth, logging, tooling, or evaluation methodology. Tokens are bleeding from a dozen provider accounts and nobody can tell you which agent is responsible.

This is the moment that separates engineering organizations that can scale AI from those that can't. The answer is not to slow down agent development — it's to run a portfolio audit before entropy makes consolidation impossible.

The Ethics Review Gate Your AI Shipping Process Is Missing

· 9 min read
Tian Pan
Software Engineer

Most engineering teams treat ethics like they used to treat security: something you address after the feature ships, if someone complains. The parallels are uncomfortable. In 2004, SQL injection was a "we'll fix it later" problem. Today, every serious team has automated injection detection in CI. Ethics reviews in AI are at the same inflection point — and teams that don't build the gate now will learn the hard way why it exists.

The gap is not intent. It's structure. Security reviews have a 20-year head start on standardization: OWASP checklists, CVE scoring, penetration tests, mandatory sign-offs before production. Ethics reviews have none of that ceremony. Most teams have no defined trigger, no checklist, no exit criteria, and no named owner. The result: a healthcare algorithm that reduced identification of Black patients for care by over 50% not because engineers were malicious, but because no one ran disaggregated accuracy numbers before the thing went live. A recruiting model that systematically downranked resumes containing the word "women's" — trained on historical data, shipped without a fairness pass, discovered months into production. These aren't edge cases. They're what happens when ethics is a post-launch checkbox with no teeth.

Training Data Self-Poisoning: When Your AI Feature Corrupts Its Own Ground Truth

· 10 min read
Tian Pan
Software Engineer

Your recommendation model launched three months ago. Click-through rates are up 18%. Watch time is climbing. The dashboard is green. Leadership is happy.

And your model is quietly destroying the data it will use to train its next version.

This is training data self-poisoning: a feedback loop where a deployed AI feature shifts user behavior in ways that corrupt the interaction data the model was originally trained to learn from. The worst part is that your standard engagement metrics will tell you everything is fine — right up until they don't.

The Data Flywheel Assumption: When AI Features Compound and When They Just Accumulate Noise

· 9 min read
Tian Pan
Software Engineer

Every AI pitch deck includes a slide about the data flywheel. The story is appealing: users interact with your AI feature, that interaction generates data, the data trains a better model, the better model attracts more users, and the cycle repeats. Scale long enough and you have an insurmountable competitive moat.

The problem is that most teams shipping AI features don't have a flywheel. They have a log file. A very large, expensive-to-store log file that has never improved their model and never will—because the three preconditions for a real flywheel are missing and nobody has asked whether they're present.

Diffusion Models in Production: The Engineering Stack Nobody Discusses After the Demo

· 10 min read
Tian Pan
Software Engineer

Your image generation feature just went viral. 100,000 requests are coming in daily. The API provider's rate limit technically accommodates it. Latency crawls to 12 seconds at p95. Your NSFW classifier is flagging legitimate medical illustrations. A compliance audit surfaces that California's AI Transparency Act required watermarking since September 2024. Support has 50 open tickets from users whose content was silently blocked. By the time you realize you need a real production stack, you've already burned two weeks in crisis mode.

This is the moment "just call the API" fails—not because the API is bad, but because the demo's success exposes every assumption you made about inference latency, content policy, moderation fairness, and regulatory compliance. The engineering work nobody shows you in tutorials lives here.

The Eval Debt Ratchet: How Teams Get Buried Cleaning Up What They Shipped on Vibes

· 10 min read
Tian Pan
Software Engineer

Three months after shipping a document summarization feature, a team at a mid-size company runs a prompt improvement. The new prompt scores better on the five examples they tested manually. They deploy it Friday afternoon. Monday morning, their Slack is full of user reports: summaries are now truncating half the document and presenting the truncated version as complete. The feature looked fine. The change passed review. Nobody noticed because there was no evaluation — no golden test set, no regression baseline, no automated check. The ratchet had been turning silently for months.

This is eval debt in its most recognizable form. The team didn't skip evaluations because they were careless. They skipped them because writing evaluations for AI features is harder than it sounds, the feature shipped fast and looked good, and nobody wanted to slow down a team with momentum. Now they're paying the compound interest.

The Federated AI Team: Why Centralizing AI Expertise Creates the Problems It Was Supposed to Solve

· 10 min read
Tian Pan
Software Engineer

The central AI team was supposed to be the answer. Hire the best ML engineers into a single group, standardize the tooling, establish governance, and let product teams consume AI capabilities without needing to understand them. It's a compelling architecture — clean on an org chart, defensible in a board presentation. In practice, it reliably produces a failure mode that looks exactly like the fragmentation it was created to eliminate.

The central AI team becomes a bottleneck. Product teams queue behind it. The AI it ships feels generic to every domain that needs something specific. The ML engineers who built the platform don't know the product metrics. The product engineers who need help can't debug AI behavior without filing a ticket. A 3-month pilot succeeds; a 9-month security review buries it.

Companies in 2025 reported abandoning the majority of their AI initiatives at more than twice the rate they did in 2024. Many of those failures happened at the transition from proof of concept to production — precisely where an overstretched, disconnected central team shows its seams.

The Frozen Feature Trap: When Your AI Differentiator Becomes a Maintenance Anchor

· 9 min read
Tian Pan
Software Engineer

In 2022, a team spent three months fine-tuning a BERT-based classifier to categorize customer support tickets. It was a genuine win — 94% accuracy where their old rule-based system topped out at 70%. Two years later, the same classifier runs on aging infrastructure, requires a specialist to retrain whenever categories shift, and gets beaten on a fresh benchmark by a zero-shot prompt to a frontier model. Nobody wants to touch it. The engineer who built it left. The current team is afraid that deprecating it will break something. The feature is frozen.

This is the frozen feature trap. It's one of the quieter forms of AI technical debt, and it's accumulating across the industry as teams discover that what looked like a moat was actually a hole they've been shoveling money into.

The Generalization Cliff: How Fine-Tuning Creates Silent Capability Regressions

· 9 min read
Tian Pan
Software Engineer

A team at an enterprise software company fine-tuned a 7B model on customer support tickets. The target metric — resolution accuracy — improved by 12 percentage points. The team shipped it. Three weeks later, the product had a second failure mode nobody expected: the model had quietly lost the ability to handle multi-step questions. Users would ask something slightly outside the support domain and receive a confident but incoherent answer. The model had traded breadth it didn't know it needed for depth it could measure.

This is the generalization cliff: the silent capability degradation that follows narrow fine-tuning. Unlike a crash or a timeout, it produces no error. The model still responds. It just responds worse on tasks adjacent to its training distribution — and those tasks never appeared in the eval suite.