Skip to main content

28 posts tagged with "machine-learning"

View all tags

The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter

· 10 min read
Tian Pan
Software Engineer

When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?

The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.

The Long-Tail Coverage Problem: Why Your AI System Fails Where It Matters Most

· 10 min read
Tian Pan
Software Engineer

A medical AI deployed to a hospital achieves 97% accuracy in testing. It passes every internal review, gets shipped, and then quietly fails to detect parasitic infections when parasite density drops below 1% of cells — the exact scenario where early intervention matters most. Nobody notices until a physician flags an unusual miss rate on a specific patient population.

This is the long-tail coverage problem. Your aggregate metrics look fine. Your system is broken for the inputs that matter.

Why '92% Accurate' Is Almost Always a Lie

· 8 min read
Tian Pan
Software Engineer

You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.

Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.

The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.

The Data Flywheel Is Not Free: Engineering Feedback Loops That Actually Improve Your AI Product

· 11 min read
Tian Pan
Software Engineer

There is a pattern that plays out in nearly every AI product team: the team ships an initial model, users start interacting with it, and someone adds a thumbs-up/thumbs-down widget at the bottom of responses. They call it their feedback loop. Three months later, the model has not improved. The team wonders why the flywheel isn't spinning.

The problem isn't execution. It's that explicit ratings are not a feedback loop — they're a survey. Less than 1% of production interactions yield explicit user feedback. The 99% who never clicked anything are sending you far richer signals; you're just not collecting them. Building a real feedback loop means instrumenting your system to capture behavioral traces, label them efficiently at scale, and route them back into training and evaluation in a way that compounds over time.

Annotation Workforce Engineering: Your Labelers Are Production Infrastructure

· 10 min read
Tian Pan
Software Engineer

Your model is underperforming, so you dig into the training data. Halfway through the audit you find two annotators labeling the same edge case in opposite ways — and both are following the spec, because the spec is ambiguous. You fix the spec, re-label the affected examples, retrain, and recover a few F1 points. Two months later the same thing happens with a different annotator on a different edge case.

This is not a labeling vendor problem. It is not a data quality tool problem. It is an infrastructure problem that you haven't yet treated like one.

Most engineering teams approach annotation the way they approach a conference room booking system: procure the tool, write a spec, hire some contractors, ship the data. That model worked when you needed a one-time labeled dataset. It collapses the moment annotation becomes a continuous activity feeding a live production model — which it is for almost every team that has graduated from prototype to production.

Annotator Bias in Eval Ground Truth: When Your Labels Are Systematically Steering You Wrong

· 10 min read
Tian Pan
Software Engineer

A team spent six months training a sentiment classifier. Accuracy on the holdout set looked solid. They shipped it. Three months later, an audit revealed the model consistently rated product complaints from non-English-native speakers as more negative than identical complaints from native speakers — even when the text said the same thing. The root cause wasn't the model architecture. It wasn't the training procedure. It was the annotation team: twelve native English speakers in one timezone, none of whom noticed that certain phrasings carried different emotional weight in translated text.

The model had learned the annotators' blind spots, not the actual signal.

This is annotator bias in practice. It doesn't announce itself. It shows up as an eval score you trust, a benchmark rank that looks reasonable, a deployed system that behaves strangely on subgroups you didn't test carefully enough. Ground truth corruption is upstream of everything else in your ML pipeline — and it's the problem most teams discover too late.

The Cold Start Trap in AI Products

· 12 min read
Tian Pan
Software Engineer

There's a specific kind of failure that kills AI features before they ever get a chance to prove themselves. It doesn't look like a technical failure — the model architecture is sound, the eval scores are decent, and the feature ships. But adoption is flat, users bounce, and six months later the team quietly deprioritizes the feature. The diagnosis, delivered in a retrospective: "not enough data."

This is the cold start trap. AI features improve with engagement data, but users won't engage until the feature is good enough to be useful. The circular dependency is not a solvable math problem — it's a product design challenge disguised as an engineering problem. And most teams walk into it with the same wrong plan: collect data first, ship ML second.

Why Your Document Extractor Breaks on the Contracts That Matter Most

· 13 min read
Tian Pan
Software Engineer

Your invoice parser probably works fine. Feed it a clean, digital PDF from a Fortune 500 vendor — structured rows, consistent column widths, machine-generated text — and it will extract line items with near-perfect accuracy. Then someone uploads a multi-page contract from a regional supplier, a scanned form with handwritten amendments, or a financial statement where the table header lives on page 3 and the rows continue through page 6. The extractor fails silently, returns partial data, or confidently produces structured output that is wrong in ways no downstream validation catches.

This is the central problem with enterprise document intelligence: the documents that break your system are not the edge cases. They are the ones with the highest business value.

LLM-as-Annotator Quality Control: When the Labeler and Student Share Training Data

· 10 min read
Tian Pan
Software Engineer

The pipeline looks sensible on paper: you have a target task, no human-labeled examples, and a capable large model available. So you use that model to generate labels, then fine-tune a smaller model on those labels. Ship it, repeat.

The problem nobody talks about enough is what happens when your annotator model and your target model trained on the same internet. Which, increasingly, they all have.

The Pretraining Shadow: The Hidden Constraint Your Fine-Tuning Plan Ignores

· 9 min read
Tian Pan
Software Engineer

Your team spent three sprints labeling 50,000 domain-specific examples. You ran LoRA fine-tuning on a frontier model. The eval numbers improved. Then a colleague changed the phrasing of a prompt slightly, and the model reverted to the behavior you thought you'd suppressed. That's not a dataset problem. That's the pretraining shadow.

The core insight that practitioners keep rediscovering: fine-tuning teaches a model how to talk in a new context, but it cannot rewrite what the model fundamentally knows or is inclined to do. The behaviors, biases, and factual priors encoded during pretraining are a gravitational field that fine-tuning orbits but rarely escapes.

SFT, RLHF, and DPO: The Alignment Method Decision Matrix for Narrow Domain Applications

· 11 min read
Tian Pan
Software Engineer

Most teams that decide to fine-tune a model spend weeks debating which method to use before they've written a single line of training code. The debate rarely surfaces the right question. The real question is not "SFT or DPO?" — it's "what kind of gap am I trying to close?"

Supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) are not competing answers to the same problem. Each targets a different failure mode. Reaching for RLHF when SFT would have sufficed wastes months. Reaching for SFT when the problem is actually a preference mismatch produces a model that's fluent but wrong in ways that are hard to detect until they surface in production.

This post is a decision framework. It maps each method to the specific problem it solves, explains what signals indicate which method will dominate, and provides a diagnostic methodology for identifying where your actual gap lives before you commit to a training run.

The Metrics Translation Problem: Why Technically Successful AI Projects Lose Funding

· 10 min read
Tian Pan
Software Engineer

Your model achieved 91% accuracy on the held-out test set. Latency is under 200ms at p95. You've cut the error rate by 40% compared to the previous rule-based system. By every technical measure, the project is a success. Six months later, leadership cancels it.

This is not a hypothetical. Eighty percent of AI projects fail to deliver intended business value, and the majority of those failures are not caused by model performance. They are caused by the gap between what engineers measure and what decision-makers understand. The technical team speaks a language that executives cannot evaluate — and in the absence of comprehensible signal, leadership defaults to skepticism.

The metrics translation problem is not a communication soft skill. It is an engineering discipline that most teams treat as optional until the funding review.