Skip to main content

18 posts tagged with "machine-learning"

View all tags

Bias Monitoring Infrastructure for Production AI: Beyond the Pre-Launch Audit

· 10 min read
Tian Pan
Software Engineer

Your model passed its fairness review. The demographic parity was within acceptable bounds, equal opportunity metrics looked clean, and the audit report went into Confluence with a green checkmark. Three months later, a journalist has screenshots showing your system approves loans at half the rate for one demographic compared to another — and your pre-launch numbers were technically accurate the whole time.

This is the bias monitoring gap. Pre-launch fairness testing validates your model against datasets that existed when you ran the tests. Production AI systems don't operate in that static world. User behavior shifts, population distributions drift, feature correlations evolve, and disparities that weren't measurable at launch can become significant failure modes within weeks. The systems that catch these problems aren't part of most ML stacks today.

The Data Flywheel Trap: Why Your Feedback Loop May Be Spinning in Place

· 11 min read
Tian Pan
Software Engineer

Every product leader has heard the pitch: more users generate more data, better data trains better models, better models attract more users. The data flywheel is the moat that compounds. It's why AI incumbents win.

The pitch is not wrong. But the implementation almost always is. In practice, most data flywheels have multiple leakage points — places where the feedback loop appears to be spinning but is actually amplifying bias, reinforcing stale patterns, or optimizing a proxy that diverges from the real objective. The engineers building these systems rarely know which type of leakage they have, because all of them look identical from the outside: engagement goes up, the model keeps improving on the metrics you can measure, and the system slowly becomes less useful in ways that are hard to attribute.

This is the data flywheel trap. Understanding its failure modes is the prerequisite to building one that actually works.

The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter

· 10 min read
Tian Pan
Software Engineer

When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?

The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.

The Long-Tail Coverage Problem: Why Your AI System Fails Where It Matters Most

· 10 min read
Tian Pan
Software Engineer

A medical AI deployed to a hospital achieves 97% accuracy in testing. It passes every internal review, gets shipped, and then quietly fails to detect parasitic infections when parasite density drops below 1% of cells — the exact scenario where early intervention matters most. Nobody notices until a physician flags an unusual miss rate on a specific patient population.

This is the long-tail coverage problem. Your aggregate metrics look fine. Your system is broken for the inputs that matter.

Why '92% Accurate' Is Almost Always a Lie

· 8 min read
Tian Pan
Software Engineer

You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.

Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.

The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.

The Data Flywheel Is Not Free: Engineering Feedback Loops That Actually Improve Your AI Product

· 11 min read
Tian Pan
Software Engineer

There is a pattern that plays out in nearly every AI product team: the team ships an initial model, users start interacting with it, and someone adds a thumbs-up/thumbs-down widget at the bottom of responses. They call it their feedback loop. Three months later, the model has not improved. The team wonders why the flywheel isn't spinning.

The problem isn't execution. It's that explicit ratings are not a feedback loop — they're a survey. Less than 1% of production interactions yield explicit user feedback. The 99% who never clicked anything are sending you far richer signals; you're just not collecting them. Building a real feedback loop means instrumenting your system to capture behavioral traces, label them efficiently at scale, and route them back into training and evaluation in a way that compounds over time.

Annotation Workforce Engineering: Your Labelers Are Production Infrastructure

· 10 min read
Tian Pan
Software Engineer

Your model is underperforming, so you dig into the training data. Halfway through the audit you find two annotators labeling the same edge case in opposite ways — and both are following the spec, because the spec is ambiguous. You fix the spec, re-label the affected examples, retrain, and recover a few F1 points. Two months later the same thing happens with a different annotator on a different edge case.

This is not a labeling vendor problem. It is not a data quality tool problem. It is an infrastructure problem that you haven't yet treated like one.

Most engineering teams approach annotation the way they approach a conference room booking system: procure the tool, write a spec, hire some contractors, ship the data. That model worked when you needed a one-time labeled dataset. It collapses the moment annotation becomes a continuous activity feeding a live production model — which it is for almost every team that has graduated from prototype to production.

Annotator Bias in Eval Ground Truth: When Your Labels Are Systematically Steering You Wrong

· 10 min read
Tian Pan
Software Engineer

A team spent six months training a sentiment classifier. Accuracy on the holdout set looked solid. They shipped it. Three months later, an audit revealed the model consistently rated product complaints from non-English-native speakers as more negative than identical complaints from native speakers — even when the text said the same thing. The root cause wasn't the model architecture. It wasn't the training procedure. It was the annotation team: twelve native English speakers in one timezone, none of whom noticed that certain phrasings carried different emotional weight in translated text.

The model had learned the annotators' blind spots, not the actual signal.

This is annotator bias in practice. It doesn't announce itself. It shows up as an eval score you trust, a benchmark rank that looks reasonable, a deployed system that behaves strangely on subgroups you didn't test carefully enough. Ground truth corruption is upstream of everything else in your ML pipeline — and it's the problem most teams discover too late.

The Cold Start Trap in AI Products

· 12 min read
Tian Pan
Software Engineer

There's a specific kind of failure that kills AI features before they ever get a chance to prove themselves. It doesn't look like a technical failure — the model architecture is sound, the eval scores are decent, and the feature ships. But adoption is flat, users bounce, and six months later the team quietly deprioritizes the feature. The diagnosis, delivered in a retrospective: "not enough data."

This is the cold start trap. AI features improve with engagement data, but users won't engage until the feature is good enough to be useful. The circular dependency is not a solvable math problem — it's a product design challenge disguised as an engineering problem. And most teams walk into it with the same wrong plan: collect data first, ship ML second.

Why Your Document Extractor Breaks on the Contracts That Matter Most

· 13 min read
Tian Pan
Software Engineer

Your invoice parser probably works fine. Feed it a clean, digital PDF from a Fortune 500 vendor — structured rows, consistent column widths, machine-generated text — and it will extract line items with near-perfect accuracy. Then someone uploads a multi-page contract from a regional supplier, a scanned form with handwritten amendments, or a financial statement where the table header lives on page 3 and the rows continue through page 6. The extractor fails silently, returns partial data, or confidently produces structured output that is wrong in ways no downstream validation catches.

This is the central problem with enterprise document intelligence: the documents that break your system are not the edge cases. They are the ones with the highest business value.

LLM-as-Annotator Quality Control: When the Labeler and Student Share Training Data

· 10 min read
Tian Pan
Software Engineer

The pipeline looks sensible on paper: you have a target task, no human-labeled examples, and a capable large model available. So you use that model to generate labels, then fine-tune a smaller model on those labels. Ship it, repeat.

The problem nobody talks about enough is what happens when your annotator model and your target model trained on the same internet. Which, increasingly, they all have.

The Pretraining Shadow: The Hidden Constraint Your Fine-Tuning Plan Ignores

· 9 min read
Tian Pan
Software Engineer

Your team spent three sprints labeling 50,000 domain-specific examples. You ran LoRA fine-tuning on a frontier model. The eval numbers improved. Then a colleague changed the phrasing of a prompt slightly, and the model reverted to the behavior you thought you'd suppressed. That's not a dataset problem. That's the pretraining shadow.

The core insight that practitioners keep rediscovering: fine-tuning teaches a model how to talk in a new context, but it cannot rewrite what the model fundamentally knows or is inclined to do. The behaviors, biases, and factual priors encoded during pretraining are a gravitational field that fine-tuning orbits but rarely escapes.