Skip to main content

2 posts tagged with "data-quality"

View all tags

The Annotation Pipeline Is Production Infrastructure

· 11 min read
Tian Pan
Software Engineer

Most teams treat their annotation pipeline the same way they treat their CI script from 2019: it works, mostly, and nobody wants to touch it. A shared spreadsheet with color-coded rows. A Google Form routing tasks to a Slack channel. Three contractors working asynchronously, comparing notes in a thread.

Then a model ships with degraded quality, an eval regresses in a confusing direction, and the post-mortem eventually surfaces the obvious: the labels were wrong, and no one built anything to detect it.

Annotation is not a data problem. It is a software engineering problem. The teams that treat it that way — with queues, schemas, monitoring, and structured disagreement handling — build AI products that improve over time. The teams that don't are in a cycle of re-labeling they can't quite explain.

Stale Retrieval: The Data Quality Problem Your RAG Pipeline Is Hiding

· 10 min read
Tian Pan
Software Engineer

Your RAG system is lying to you about the past. When a user asks about current pricing, active security policies, or a feature that shipped last quarter, the retrieval pipeline returns the most semantically similar document in the index — not the most recent one. An 18-month-old pricing page and this morning's update look identical to cosine similarity. Nothing in the standard RAG stack has any concept of whether the retrieved document is still true.

This is stale retrieval, and it fails differently than hallucination. The model isn't inventing anything. It accurately summarizes real content that once existed. Standard evaluation metrics — faithfulness, groundedness, context precision — all pass. The system is confidently correct about a fact that stopped being correct months ago.