Skip to main content

12 posts tagged with "data-quality"

View all tags

The Feedback Provenance Gap: Why Your Training Signal Might Not Be What You Collected

· 8 min read
Tian Pan
Software Engineer

Most teams have excellent instrumentation on the feedback capture side. Thumbs-down clicks are logged. Star ratings flow into dashboards. Human annotation jobs write every preference pair to a table. The intake is clean, timestamped, and queryable.

What happens between that capture and the next model update is, for most teams, a black box.

The data gets filtered. Some annotations get weighted higher than others. Rare categories get upsampled. Near-duplicates get dropped. A prompt template change makes last month's labels inconsistent with this month's, but the merge happens anyway. By the time the signal reaches a reward model or fine-tuning job, it has passed through six transformation steps with no audit trail, no version pinning, and no way to trace a degraded model weight back to a specific corruption point in the pipeline.

This is the feedback provenance gap: teams know where feedback enters the system, but not what it becomes before it shapes model behavior.

The Eval-Set Poison Pill: When Your Benchmark Becomes a Backdoor

· 10 min read
Tian Pan
Software Engineer

A team I know spent six months chasing a regression that wasn't there. Every release passed the eval. Every release shipped. Every quarter, NPS on the AI-served cohort drifted down a point. Eventually, an intern doing a routine audit of the gold dataset noticed that one labeler — long since rotated off the contract — had graded 11% of the items, and that those items were systematically more lenient on a specific failure mode the team had been racing to fix. The eval said the model was getting better. The model was not getting better. The eval had been quietly tilted by one human's calibration drift, and nobody had been watching the labelers because nobody believed the labelers were a threat surface.

This is the eval-set poison pill. Most teams treat their eval set as a trusted artifact: the labels were graded by humans, the data came from production, and the regression dashboard is the one thing the org agrees to defer to when shipping. But the labeling pipeline is a human supply chain, and human supply chains are gameable. Treating an eval as ground truth without applying supply-chain hygiene to its inputs is trusting a number whose provenance you cannot defend.

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

· 10 min read
Tian Pan
Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

The Data Flywheel Trap: Why Your Feedback Loop May Be Spinning in Place

· 11 min read
Tian Pan
Software Engineer

Every product leader has heard the pitch: more users generate more data, better data trains better models, better models attract more users. The data flywheel is the moat that compounds. It's why AI incumbents win.

The pitch is not wrong. But the implementation almost always is. In practice, most data flywheels have multiple leakage points — places where the feedback loop appears to be spinning but is actually amplifying bias, reinforcing stale patterns, or optimizing a proxy that diverges from the real objective. The engineers building these systems rarely know which type of leakage they have, because all of them look identical from the outside: engagement goes up, the model keeps improving on the metrics you can measure, and the system slowly becomes less useful in ways that are hard to attribute.

This is the data flywheel trap. Understanding its failure modes is the prerequisite to building one that actually works.

The Data Quality Ceiling That Prompt Engineering Can't Break Through

· 11 min read
Tian Pan
Software Engineer

A telecommunications company spent months tuning prompts on their customer service chatbot. They iterated on system instructions, few-shot examples, chain-of-thought formatting. The hallucination rate stayed stubbornly above 50%. Then they audited their knowledge base and found it was filled with retired service plans, outdated billing information, and duplicate policy documents that contradicted each other. After fixing the data — not the prompts — hallucinations dropped to near zero. The fix that prompt engineering couldn't deliver took three weeks of data cleanup.

This is the data quality ceiling: a hard performance wall that blocks every LLM system fed on noisy, stale, or inconsistent data, and that no amount of prompt iteration can breach. It's one of the most common failure modes in production AI, and one of the most systematically underdiagnosed. Teams that hit this wall keep turning the prompt knobs when the problem is upstream.

Upstream Data Quality Is Your AI Agent's Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

A team spent three months tuning prompts for their knowledge agent. They tried GPT-4, then Claude, then a fine-tuned model. They rewrote the system prompt six times. They hired a prompt engineer. The agent kept hallucinating — confidently, fluently, and wrong. The actual problem turned out to be a Confluence export from 2023 sitting in the vector store alongside a Slack archive full of contradictory, casual half-opinions about the same topics. The model was doing exactly what it was supposed to do: synthesizing the information it was given. The information was garbage.

Over 60% of AI project failures in production trace to data quality, context problems, or governance failures — not model limitations. Yet when agents misbehave, the first instinct is almost always to touch the prompt. The second instinct is to switch models. The third might be to add a reranker. The upstream database that feeds the whole pipeline rarely makes the troubleshooting list until months of work have been wasted.

The Data Quality Tax in LLM Systems: Why Bad Input Hits Differently

· 9 min read
Tian Pan
Software Engineer

Your gradient boosting model degrades politely when data gets noisy. Accuracy drops, precision drops, a monitoring alert fires, and the on-call engineer knows exactly where to look. LLMs don't do that. Feed an LLM degraded, stale, or malformed input and it produces fluent, confident, authoritative-sounding output that is partially or entirely wrong — and the downstream system consuming it has no way to tell the difference.

This is the data quality tax: the compounding cost you pay when bad data enters an LLM pipeline, expressed not as lower confidence scores but as hallucinations dressed in the syntax of facts.

Annotator Bias in Eval Ground Truth: When Your Labels Are Systematically Steering You Wrong

· 10 min read
Tian Pan
Software Engineer

A team spent six months training a sentiment classifier. Accuracy on the holdout set looked solid. They shipped it. Three months later, an audit revealed the model consistently rated product complaints from non-English-native speakers as more negative than identical complaints from native speakers — even when the text said the same thing. The root cause wasn't the model architecture. It wasn't the training procedure. It was the annotation team: twelve native English speakers in one timezone, none of whom noticed that certain phrasings carried different emotional weight in translated text.

The model had learned the annotators' blind spots, not the actual signal.

This is annotator bias in practice. It doesn't announce itself. It shows up as an eval score you trust, a benchmark rank that looks reasonable, a deployed system that behaves strangely on subgroups you didn't test carefully enough. Ground truth corruption is upstream of everything else in your ML pipeline — and it's the problem most teams discover too late.

LLM-as-Annotator Quality Control: When the Labeler and Student Share Training Data

· 10 min read
Tian Pan
Software Engineer

The pipeline looks sensible on paper: you have a target task, no human-labeled examples, and a capable large model available. So you use that model to generate labels, then fine-tune a smaller model on those labels. Ship it, repeat.

The problem nobody talks about enough is what happens when your annotator model and your target model trained on the same internet. Which, increasingly, they all have.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

· 11 min read
Tian Pan
Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

The Annotation Pipeline Is Production Infrastructure

· 11 min read
Tian Pan
Software Engineer

Most teams treat their annotation pipeline the same way they treat their CI script from 2019: it works, mostly, and nobody wants to touch it. A shared spreadsheet with color-coded rows. A Google Form routing tasks to a Slack channel. Three contractors working asynchronously, comparing notes in a thread.

Then a model ships with degraded quality, an eval regresses in a confusing direction, and the post-mortem eventually surfaces the obvious: the labels were wrong, and no one built anything to detect it.

Annotation is not a data problem. It is a software engineering problem. The teams that treat it that way — with queues, schemas, monitoring, and structured disagreement handling — build AI products that improve over time. The teams that don't are in a cycle of re-labeling they can't quite explain.

Stale Retrieval: The Data Quality Problem Your RAG Pipeline Is Hiding

· 10 min read
Tian Pan
Software Engineer

Your RAG system is lying to you about the past. When a user asks about current pricing, active security policies, or a feature that shipped last quarter, the retrieval pipeline returns the most semantically similar document in the index — not the most recent one. An 18-month-old pricing page and this morning's update look identical to cosine similarity. Nothing in the standard RAG stack has any concept of whether the retrieved document is still true.

This is stale retrieval, and it fails differently than hallucination. The model isn't inventing anything. It accurately summarizes real content that once existed. Standard evaluation metrics — faithfulness, groundedness, context precision — all pass. The system is confidently correct about a fact that stopped being correct months ago.