Skip to main content

21 posts tagged with "data-engineering"

View all tags

Data Versioning for AI: The Dataset-Model Coupling Problem Teams Discover Too Late

· 9 min read
Tian Pan
Software Engineer

Your model's accuracy dropped 8% in production overnight. Nothing in the model code changed. No deployment happened. The eval suite is green. So you spend a week adjusting hyperparameters, tweaking prompts, comparing checkpoint losses — and eventually someone notices that a schema migration landed three days ago in the feature pipeline. A single field that switched from NULL to an empty string. That's it. That's the regression.

This is the most common failure mode in production ML systems, and it has almost nothing to do with model quality. It has everything to do with a structural gap most teams don't close until they've been burned: data versions and model versions are intimately coupled, but they're tracked by different tools and owned by different teams.

Agentic Web Data Extraction at Scale: When Agents Replace Scrapers

· 10 min read
Tian Pan
Software Engineer

The demo takes 20 minutes to build. You paste a URL, an LLM reads the HTML, and structured data comes out the other end. It feels like the future of web extraction has arrived.

Then you run it at 1,000 pages per hour. Costs spiral, blocks accumulate, and extracted fields start drifting in ways that don't look like errors — they look like normal data until your downstream pipeline has silently ingested three weeks of garbage. The "LLM reads the page" pattern is not wrong; it's just priced for prototype throughput.

Agentic web extraction genuinely solves problems that traditional scrapers cannot. But scaling it past proof-of-concept requires understanding a different set of failure modes than most teams expect.

Annotation Workforce Engineering: Your Labelers Are Production Infrastructure

· 10 min read
Tian Pan
Software Engineer

Your model is underperforming, so you dig into the training data. Halfway through the audit you find two annotators labeling the same edge case in opposite ways — and both are following the spec, because the spec is ambiguous. You fix the spec, re-label the affected examples, retrain, and recover a few F1 points. Two months later the same thing happens with a different annotator on a different edge case.

This is not a labeling vendor problem. It is not a data quality tool problem. It is an infrastructure problem that you haven't yet treated like one.

Most engineering teams approach annotation the way they approach a conference room booking system: procure the tool, write a spec, hire some contractors, ship the data. That model worked when you needed a one-time labeled dataset. It collapses the moment annotation becomes a continuous activity feeding a live production model — which it is for almost every team that has graduated from prototype to production.

When LLMs Beat Rule-Based Systems for Data Normalization (And When They Don't)

· 11 min read
Tian Pan
Software Engineer

A team I know spent three months building a rule-based address normalizer. It handled the top twenty formats, used a USPS API for verification, and worked great on the data they'd seen. Then they got a new enterprise customer. The first week of data had addresses embedded in freeform notes fields, postal codes missing country prefixes, and cross-border formats their rules had never seen. The normalizer failed silently on 31% of records. They threw an LLM at it as a quick fix, expecting 80% accuracy. They got 94%. The surprise wasn't that the LLM worked — it was that nothing in their evaluation framework had predicted this.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=When%20LLMs%20Beat%20Rule-Based%20Systems%20for%20Data%20Normalization%20(And%20When%20They%20Don't%29)

This is the shape of the problem. Rule-based normalization is predictable, fast, and cheap. It works well when the data distribution stays in-bounds. LLMs handle the long tail — the weird formats, the implicit domain knowledge, the edge cases that rules never enumerate. But LLMs are also expensive, slow, and inconsistent in ways that break production pipelines if you're not careful. The right answer, for almost every team, is a hybrid that uses each approach on the inputs it's actually good at.

LLMs as ETL Primitives: AI in the Data Pipeline, Not Just the Product

· 9 min read
Tian Pan
Software Engineer

The typical AI narrative goes like this: you build a product, you add an AI feature, and users get smarter outputs. That framing is correct, but incomplete. The more durable advantage isn't in the product layer at all — it's in the data pipeline running underneath it.

A growing number of engineering teams have quietly swapped out regex rules, custom classifiers, and hand-coded parsers in their ETL pipelines and replaced them with LLM calls. The result: pipelines that handle unstructured input, adapt to schema drift, and classify records across thousands of categories — without retraining a model for every new edge case. Teams running this pattern at scale are building data assets that compound. Teams still treating LLMs purely as product features are not.

Corpus Curation at Scale: Why Your RAG Quality Ceiling Is Your Document Quality Floor

· 10 min read
Tian Pan
Software Engineer

There's a belief embedded in most RAG architectures that goes something like this: if retrieval returns the right chunks, the LLM will produce correct answers. Teams invest heavily in embedding model selection, hybrid retrieval strategies, and reranking pipelines. Then, three months after deploying to production, answer quality quietly degrades — not because the model changed, not because query patterns shifted dramatically, but because the underlying corpus rotted.

Enterprise RAG implementations fail at a roughly 40% rate, and the failure mode that practitioners underestimate most isn't hallucination or poor retrieval recall. It's document quality. One analysis found that a single implementation improved search accuracy from 62% to 89% by introducing document quality scoring — with no changes to the embedding model or retrieval algorithm. The corpus was the variable. The corpus was always the variable.

Your Embedding Pipeline Is Critical Infrastructure — Treat It Like Your Primary Database

· 9 min read
Tian Pan
Software Engineer

Most teams treat embedding generation as a one-time ETL job: run a script, populate a vector database, move on. This works fine in a demo. In production, it is a slow-motion disaster. Your vector index is not a static artifact — it is a continuously running pipeline with its own failure modes, staleness guarantees, and operational runbook. And unlike your primary database, when it breaks, nothing throws an exception. Your system keeps returning results. They are just quietly, confidently wrong.

If you are running a retrieval-augmented generation (RAG) system, a semantic search feature, or any product that depends on embeddings, your vector index deserves the same rigor you give your PostgreSQL cluster. Here is why most teams get this wrong, and what production-grade embedding infrastructure actually looks like.

Synthetic Data Pipelines That Don't Collapse: Generating Training Data at Scale

· 8 min read
Tian Pan
Software Engineer

Train a model on its own output, then train the next model on that model's output, and within three generations you've built a progressively dumber machine. This is model collapse — a degenerative process where each successive generation of synthetic training data narrows the distribution until the model forgets the long tail of rare but important patterns. A landmark Nature study confirmed what practitioners had observed anecdotally: even tiny fractions of synthetic contamination (as low as 1 in 1,000 samples) trigger measurable degradation in lexical, syntactic, and semantic diversity.

Yet synthetic data isn't optional. Real-world labeled data is expensive, scarce in specialized domains, and increasingly exhausted at the scale frontier models demand. The teams shipping successful fine-tunes in 2025–2026 aren't avoiding synthetic data — they're engineering their pipelines to generate it without collapsing. The difference between a productive pipeline and a self-poisoning one comes down to diversity preservation, verification loops, and knowing when to stop.

The RAG Freshness Problem: How Stale Embeddings Silently Wreck Retrieval Quality

· 12 min read
Tian Pan
Software Engineer

Your RAG system launched three months ago with impressive retrieval accuracy. Today, it's confidently wrong about a third of what users ask — and nothing in your monitoring caught the change. No errors logged. No latency spikes. The semantic similarity scores look healthy. But the documents being retrieved are outdated, and the model answers with full confidence because the retrieved context looks authoritative.

This is the RAG freshness problem: semantic similarity does not care about time. An embedding of a deprecated API reference scores just as high as a current one. A policy document from last quarter retrieves ahead of the updated version. The system doesn't know and can't tell. Most teams discover their index is weeks or months stale only after a user complaint — and by then, users have already quietly stopped trusting it.