14 posts tagged with "data-quality"

Provenance Debt in AI Knowledge Bases: When Your RAG System Learns From Itself

May 7, 2026 · 8 min read

Software Engineer

Your RAG system is probably indexing its own outputs. You just don't know it yet.

It starts innocuously: someone adds a quarterly summary document to the knowledge base. That summary was written by the same LLM that queries the knowledge base. Six months later, a developer adds AI-generated release notes. Then auto-generated support FAQs. Then a synthesized onboarding guide. None of these documents are labeled as AI-generated. To the retrieval system, they look identical to human-written primary sources. Now when your model retrieves context to answer a question, a significant portion of that context is the compressed, possibly-distorted output of a prior model run — and your accuracy metrics are still green.

This is provenance debt: the accumulation of AI-generated content in retrieval corpora without source markers, creating a feedback loop where each generation of model outputs becomes raw material for the next.

Training Data Self-Poisoning: When Your AI Feature Corrupts Its Own Ground Truth

May 6, 2026 · 10 min read

Tian Pan

Software Engineer

Your recommendation model launched three months ago. Click-through rates are up 18%. Watch time is climbing. The dashboard is green. Leadership is happy.

And your model is quietly destroying the data it will use to train its next version.

This is training data self-poisoning: a feedback loop where a deployed AI feature shifts user behavior in ways that corrupt the interaction data the model was originally trained to learn from. The worst part is that your standard engagement metrics will tell you everything is fine — right up until they don't.

The Feedback Provenance Gap: Why Your Training Signal Might Not Be What You Collected

May 4, 2026 · 8 min read

Tian Pan

Software Engineer

Most teams have excellent instrumentation on the feedback capture side. Thumbs-down clicks are logged. Star ratings flow into dashboards. Human annotation jobs write every preference pair to a table. The intake is clean, timestamped, and queryable.

What happens between that capture and the next model update is, for most teams, a black box.

The data gets filtered. Some annotations get weighted higher than others. Rare categories get upsampled. Near-duplicates get dropped. A prompt template change makes last month's labels inconsistent with this month's, but the merge happens anyway. By the time the signal reaches a reward model or fine-tuning job, it has passed through six transformation steps with no audit trail, no version pinning, and no way to trace a degraded model weight back to a specific corruption point in the pipeline.

This is the feedback provenance gap: teams know where feedback enters the system, but not what it becomes before it shapes model behavior.

The Eval-Set Poison Pill: When Your Benchmark Becomes a Backdoor

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

A team I know spent six months chasing a regression that wasn't there. Every release passed the eval. Every release shipped. Every quarter, NPS on the AI-served cohort drifted down a point. Eventually, an intern doing a routine audit of the gold dataset noticed that one labeler — long since rotated off the contract — had graded 11% of the items, and that those items were systematically more lenient on a specific failure mode the team had been racing to fix. The eval said the model was getting better. The model was not getting better. The eval had been quietly tilted by one human's calibration drift, and nobody had been watching the labelers because nobody believed the labelers were a threat surface.

This is the eval-set poison pill. Most teams treat their eval set as a trusted artifact: the labels were graded by humans, the data came from production, and the regression dashboard is the one thing the org agrees to defer to when shipping. But the labeling pipeline is a human supply chain, and human supply chains are gameable. Treating an eval as ground truth without applying supply-chain hygiene to its inputs is trusting a number whose provenance you cannot defend.

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

The Data Flywheel Trap: Why Your Feedback Loop May Be Spinning in Place

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Every product leader has heard the pitch: more users generate more data, better data trains better models, better models attract more users. The data flywheel is the moat that compounds. It's why AI incumbents win.

The pitch is not wrong. But the implementation almost always is. In practice, most data flywheels have multiple leakage points — places where the feedback loop appears to be spinning but is actually amplifying bias, reinforcing stale patterns, or optimizing a proxy that diverges from the real objective. The engineers building these systems rarely know which type of leakage they have, because all of them look identical from the outside: engagement goes up, the model keeps improving on the metrics you can measure, and the system slowly becomes less useful in ways that are hard to attribute.

This is the data flywheel trap. Understanding its failure modes is the prerequisite to building one that actually works.

The Data Quality Ceiling That Prompt Engineering Can't Break Through

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

A telecommunications company spent months tuning prompts on their customer service chatbot. They iterated on system instructions, few-shot examples, chain-of-thought formatting. The hallucination rate stayed stubbornly above 50%. Then they audited their knowledge base and found it was filled with retired service plans, outdated billing information, and duplicate policy documents that contradicted each other. After fixing the data — not the prompts — hallucinations dropped to near zero. The fix that prompt engineering couldn't deliver took three weeks of data cleanup.

This is the data quality ceiling: a hard performance wall that blocks every LLM system fed on noisy, stale, or inconsistent data, and that no amount of prompt iteration can breach. It's one of the most common failure modes in production AI, and one of the most systematically underdiagnosed. Teams that hit this wall keep turning the prompt knobs when the problem is upstream.

Upstream Data Quality Is Your AI Agent's Real Bottleneck

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A team spent three months tuning prompts for their knowledge agent. They tried GPT-4, then Claude, then a fine-tuned model. They rewrote the system prompt six times. They hired a prompt engineer. The agent kept hallucinating — confidently, fluently, and wrong. The actual problem turned out to be a Confluence export from 2023 sitting in the vector store alongside a Slack archive full of contradictory, casual half-opinions about the same topics. The model was doing exactly what it was supposed to do: synthesizing the information it was given. The information was garbage.

Over 60% of AI project failures in production trace to data quality, context problems, or governance failures — not model limitations. Yet when agents misbehave, the first instinct is almost always to touch the prompt. The second instinct is to switch models. The third might be to add a reranker. The upstream database that feeds the whole pipeline rarely makes the troubleshooting list until months of work have been wasted.

The Data Quality Tax in LLM Systems: Why Bad Input Hits Differently

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

Your gradient boosting model degrades politely when data gets noisy. Accuracy drops, precision drops, a monitoring alert fires, and the on-call engineer knows exactly where to look. LLMs don't do that. Feed an LLM degraded, stale, or malformed input and it produces fluent, confident, authoritative-sounding output that is partially or entirely wrong — and the downstream system consuming it has no way to tell the difference.

This is the data quality tax: the compounding cost you pay when bad data enters an LLM pipeline, expressed not as lower confidence scores but as hallucinations dressed in the syntax of facts.

Annotator Bias in Eval Ground Truth: When Your Labels Are Systematically Steering You Wrong

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

A team spent six months training a sentiment classifier. Accuracy on the holdout set looked solid. They shipped it. Three months later, an audit revealed the model consistently rated product complaints from non-English-native speakers as more negative than identical complaints from native speakers — even when the text said the same thing. The root cause wasn't the model architecture. It wasn't the training procedure. It was the annotation team: twelve native English speakers in one timezone, none of whom noticed that certain phrasings carried different emotional weight in translated text.

The model had learned the annotators' blind spots, not the actual signal.

This is annotator bias in practice. It doesn't announce itself. It shows up as an eval score you trust, a benchmark rank that looks reasonable, a deployed system that behaves strangely on subgroups you didn't test carefully enough. Ground truth corruption is upstream of everything else in your ML pipeline — and it's the problem most teams discover too late.

LLM-as-Annotator Quality Control: When the Labeler and Student Share Training Data

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

The pipeline looks sensible on paper: you have a target task, no human-labeled examples, and a capable large model available. So you use that model to generate labels, then fine-tune a smaller model on those labels. Ship it, repeat.

The problem nobody talks about enough is what happens when your annotator model and your target model trained on the same internet. Which, increasingly, they all have.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

About Tian Pan