Skip to main content

109 posts tagged with "mlops"

View all tags

The Human Bottleneck Problem: When Human-in-the-Loop Becomes Your Slowest Microservice

· 9 min read
Tian Pan
Software Engineer

Most teams add human-in-the-loop review to their AI systems and consider the safety problem solved. Six to twelve months later, they discover the actual problem: their human reviewers are now the bottleneck that prevents the system from scaling, quality has degraded without anyone noticing, and removing the oversight layer feels too risky to contemplate. They are stuck.

This is the HITL throughput failure. It is distinct from the better-known HITL rubber-stamp failure, where humans approve decisions without genuine scrutiny. The throughput failure is quieter and more insidious: reviewers are doing their jobs conscientiously, but the queue grows faster than the team can clear it, latency commitments become impossible to meet, and the human layer transforms from independent validation into a system-wide velocity limiter.

The Hyperparameter Illusion: Why Temperature and Top-P Are the Last Things to Tune

· 9 min read
Tian Pan
Software Engineer

When LLM outputs feel wrong, engineers reach for the temperature dial. It's one of the first moves in the debugging playbook — crank it down for more consistency, nudge it up for more creativity. It feels productive because it's easy to change and produces immediately visible effects. It is almost never the right move.

Temperature and top-p are the last 10% of output quality, not the first 90%. The variables that actually determine whether your model succeeds are context quality, instruction clarity, and model selection — in that order. Misconfiguring sampling parameters on top of a broken prompt is like adjusting the seasoning on a dish that hasn't been cooked through. The fundamental problem doesn't move.

The Co-Evolution Trap: How Your AI Feature's Success Is Quietly Destroying Its Evaluations

· 9 min read
Tian Pan
Software Engineer

Your AI feature launched. It's working well. Users are adopting it. Satisfaction scores are up. You go back and run the original eval suite—still green. Six months later, something is quietly wrong, but your dashboards don't show it yet.

This is the co-evolution trap. The moment your AI feature is deployed, it starts changing the people using it. They adapt their workflows, their phrasing, their expectations. That adaptation makes the distribution of inputs your feature actually processes diverge from the distribution you measured at launch. The eval suite stays green because it's frozen in the pre-deployment world. The real-world performance drifts in ways the suite never captures.

The Eval-Prod Gap: Detecting Behavioral Mode Switching in Production LLMs

· 9 min read
Tian Pan
Software Engineer

Your eval suite is green. Your benchmark scores are strong. Your staging environment looks clean. And yet — your users are reporting subtly wrong answers, inconsistent tone, and outputs that feel off in ways that are hard to pinpoint.

This is the behavioral mode switching problem: a production LLM that performs well when it knows it's being evaluated and drifts noticeably when it doesn't. It's not a hypothetical. It's the quiet majority failure mode of LLM deployments that teams discover late, after they've shipped confidence to stakeholders that the model's behavior was verified.

The problem isn't that your eval harness is lazy. It's that most eval harnesses are structurally incapable of detecting this class of failure.

The Feedback Provenance Gap: Why Your Training Signal Might Not Be What You Collected

· 8 min read
Tian Pan
Software Engineer

Most teams have excellent instrumentation on the feedback capture side. Thumbs-down clicks are logged. Star ratings flow into dashboards. Human annotation jobs write every preference pair to a table. The intake is clean, timestamped, and queryable.

What happens between that capture and the next model update is, for most teams, a black box.

The data gets filtered. Some annotations get weighted higher than others. Rare categories get upsampled. Near-duplicates get dropped. A prompt template change makes last month's labels inconsistent with this month's, but the merge happens anyway. By the time the signal reaches a reward model or fine-tuning job, it has passed through six transformation steps with no audit trail, no version pinning, and no way to trace a degraded model weight back to a specific corruption point in the pipeline.

This is the feedback provenance gap: teams know where feedback enters the system, but not what it becomes before it shapes model behavior.

LLM-as-Classifier in Production: Why Accuracy Is the Wrong Metric

· 11 min read
Tian Pan
Software Engineer

A team ships an LLM-based intent classifier. Evaluation accuracy: 94%. Two weeks into production, support volume is up 30% — not because the model is failing to classify, but because it's routing edge cases to the wrong queue with very high confidence. Nobody built a circuit breaker for "the model is wrong and doesn't know it." The 94% figure never surfaced that risk.

This failure pattern repeats across content moderation pipelines, routing systems, and entity extractors. The LLM gets a high score on the holdout set. The team ships. Something breaks quietly in production.

The issue isn't that accuracy is a bad metric. It's that accuracy answers the wrong question. Production classification has a different set of requirements, and most evaluation pipelines don't test for them.

The Org Chart Problem: Why AI Features Die Between Teams

· 10 min read
Tian Pan
Software Engineer

The model works. The pipeline runs. The demo looks great. And then the feature dies somewhere between the data team's Slack channel and the product engineer's JIRA board.

This is the pattern behind most AI project failures—not a technical failure, but an organizational one. A 2025 survey found that 42% of companies abandoned most of their AI initiatives that year, up from 17% the year prior. The average sunk cost per abandoned initiative was $7.2 million. When the post-mortems get written, the causes listed are "poor data readiness," "unclear ownership," and "lack of governance"—which are three different ways of saying the same thing: nobody was actually responsible for shipping the feature.

The AI Bill of Materials: What Your Dependency Tree Looks Like When Procurement Asks

· 11 min read
Tian Pan
Software Engineer

The first time a regulator, an enterprise customer's procurement team, or your own legal team asks "show us your AI dependency tree," the answer at most companies is a Slack thread. Someone in the platform channel pings the model team. The model team pings the prompt owners. The prompt owners cc the data lead. Two days later a half-finished spreadsheet lands in the auditor's inbox, full of "TBD" cells and a footnote that says "we think this is current as of last week."

This is the moment teams discover that the AI stack — models, prompts, tools, training data, third-party MCP servers, fine-tuned checkpoints, evaluation suites — has no single source of truth. Software supply chain compliance produced the SBOM as the artifact regulators and customers expect. AI products have a parallel surface, but the SBOM concept stops at code dependencies. The dataset that shaped your fine-tuned checkpoint, the prompt template ten teams import, the MCP server an engineer wired up last quarter — none of it shows up in a package.json.

The Preprocessing Bottleneck That Kills AI Pipeline Throughput

· 10 min read
Tian Pan
Software Engineer

A team builds a RAG-backed feature, measures end-to-end latency, finds it unacceptably slow, and immediately starts optimizing the model call. They try a smaller model, batch requests, tune temperature and token limits. After two sprints of work, latency drops by 15%. The feature is still too slow. What they never measured: the 600ms they're spending chunking text and generating embeddings before the LLM ever receives a prompt.

This pattern is common enough that it has a name in distributed systems: optimizing the wrong component. In AI pipelines, the LLM call is visible and easy to measure. Everything before it is invisible until you explicitly instrument it — and that's exactly where throughput dies.

Eval Set Rot: Why Your Score Trends Up While Users Trend Down

· 10 min read
Tian Pan
Software Engineer

The eval score has been trending up for two quarters. The dashboard is green, the regression suite has not flagged a real failure since March, and the team has gotten faster at shipping prompt changes because the eval gives crisp pass/fail answers. Meanwhile, user-reported quality is sliding. NPS is down four points, the support queue is full of failure modes nobody has labels for, and the head of product has started asking why the evals look great if customers are angry.

The eval set is not lying. It is answering the question it was built to answer, six months ago, against the traffic distribution that existed in launch week. The product has shifted. The user base has shifted. The long-tail use cases the team did not anticipate at launch now make up a third of traffic. The eval set is still measuring the world that existed in week one, and the team is averaging today's model against yesterday's product.

This is eval set rot. It is one of the quietest failure modes in modern AI engineering, and it gets worse as the eval set gets bigger, because the people maintaining it confuse "more cases" with "better coverage."

The Hidden Edges Between Your AI Features: When One Prompt Edit Regresses Three Other Teams

· 9 min read
Tian Pan
Software Engineer

A platform engineer changes the opening sentence of the company's "house style" preamble — a single line that anchors voice across customer-facing assistants. The change ships behind a flag. By Tuesday, the search team's relevance regression has spiked, the support bot's eval pass-rate has dropped four points, and the onboarding agent's retry rate has doubled. None of those teams touched their own code. None of them got a heads-up. The platform engineer has no idea any of this happened, because nobody was on the receiving end of an alert that said "your edit just broke three downstream features."

This is the failure mode that defines the second year of an AI org's life. The first year, every team builds its own thing in a corner. The second year, those corners start sharing artifacts — a prompt fragment here, a seeded eval set there, a tool schema reused as a contract — and the moment that sharing becomes implicit, the dependency graph between AI features becomes invisible. You now have a distributed system whose edges no one can name.

The discipline that fixes this is not a new platform. It's drawing the graph.

Why Your Bias Eval Passes in CI and Fails in Deployment

· 10 min read
Tian Pan
Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.