18 posts tagged with "data-engineering"

The Data Contract Problem in RAG: When Your Ingestion Pipeline Silently Breaks Retrieval Quality

May 7, 2026 · 10 min read

Software Engineer

Your RAG system has a bug that doesn't throw exceptions. It doesn't spike your error rate. It doesn't show up in your latency dashboards. Instead, it quietly delivers confident, plausible-sounding answers that are wrong — and nobody notices for weeks.

This is the data contract problem in RAG: your ingestion pipeline is the source of truth for everything downstream, but it has no schema enforcement, no freshness guarantees, and no alerting when the shape of the world changes underneath it. Every time an upstream data source adds a field, a chunking parameter shifts, or an embedding model gets updated, your retrieval quality silently degrades.

Eighty percent of enterprise RAG projects experience critical failures in production. The most insidious of those failures don't announce themselves.

Your AI Feature Is Only As Reliable As The ETL Pipeline Nobody Owns

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

The AI feature has the dashboard. The prompt has the version control. The eval suite has the on-call rotation. And then there is the upstream cron job, written in 2022, owned by a team that rotated out of analytics two reorgs ago, that produces the CSV your retrieval index is built from. That cron job has no SLA. That CSV has no schema contract. The team that owns it does not know it feeds an AI feature. When it changes — and it will change — the AI team will spend three weeks debugging a prompt that did nothing wrong.

The AI quality regression you are about to chase is almost never an AI problem. It is an ETL problem wearing an AI costume. The discipline that has to land is the seam between the two — the contract, the lineage, the freshness signal, the paired on-call — and the team that does not formalize it ships an AI feature whose reliability is bounded by the least-loved cron job in the company.

The Indexing Policy Committee Nobody Convened: RAG Corpus Governance Beyond the One-Time Migration

April 27, 2026 · 9 min read

Tian Pan

Software Engineer

Two years ago, a team pointed their retrieval index at the wiki, the Zendesk export, and a snapshot of the public docs. Last week, that same index returned a deprecated runbook that told an SRE to restart a service that no longer exists. The runbook had been deprecated for eighteen months. Nobody owned its retirement, so nobody retired it. The agent confidently cited it. The model wasn't wrong; the corpus was.

This is the failure mode that doesn't show up in retrieval evals: the corpus is treated as a one-time engineering decision when it's actually an ongoing governance problem. The team that scoped the initial ingestion is long gone. The legal review that should have flagged the customer-confidential PDFs never happened, because nobody told legal there was a pipeline. The "freshness strategy" is a Slack message from someone who left in Q3. The retrieval index has become a shared inbox for every document anyone ever scraped, and the bar for inclusion has drifted to "whatever was easy to ingest."

LLM-Powered Data Migrations: What Actually Works at Scale

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The pitch is compelling: feed your legacy records into an LLM, describe the target schema, and let the model figure out the mapping. No hand-written parsers, no months of transformation logic, no domain expert bottlenecks. Teams have run this and gotten to 70–97% accuracy in a fraction of the time it would take traditional ETL. The problem is that the remaining 3–30% of failures don't look like failures. They look like correct data.

That asymmetry—where wrong outputs are structurally valid and plausible—is what makes LLM-powered data migrations genuinely dangerous without the right validation architecture. This post covers what the teams that have done this successfully actually built: when LLMs earn their place in the pipeline, where they silently break, and the validation layer that catches errors traditional tools cannot.

LLMs as Data Engineers: The Silent Failures in AI-Driven ETL

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your hand-coded ETL pipeline handles 95% of records correctly. The edge cases — the currency strings with commas, the inconsistently formatted dates, the inconsistent country codes — flow through to your data warehouse and quietly corrupt your dashboards. Nobody notices until a quarterly report looks wrong. You add another special case to the pipeline. The cycle continues.

LLMs can solve this. They infer schemas from raw samples, handle messy edge cases that no engineer anticipated, and transform unstructured documents into structured records at a fraction of the development time. Several teams have shipped this. Some of them have also had LLMs silently transform "$1,200,000" into 1200 instead of 1200000, flip severity scores from "high" to "low" with complete structural validity, and join on the wrong foreign key in ways that passed every schema check.

The problem isn't that LLMs are bad at data engineering. It's that their failure mode is exactly wrong for ETL: high confidence, no error thrown, structurally valid output.

The ORM Impedance Mismatch for AI Agents: Why Your Data Layer Is the Real Bottleneck

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams building AI agents spend weeks tuning prompts and evals, benchmarking model choices, and tweaking temperature — while their actual bottleneck sits one layer below: the data access layer that was designed for human developers, not agents.

The mismatch isn't subtle. ORMs like Hibernate, SQLAlchemy, and Prisma, combined with REST APIs that return paginated, single-entity responses, produce data access patterns exactly wrong for autonomous AI agents. The result is token waste, rate limit failures, cascading N+1 database queries, and agents that hallucinate simply because they can't afford to load the context they need.

This post is about the structural problem — and what an agent-optimized data layer actually looks like.

Text-to-SQL in Production: Why Natural Language Queries Fail at the Schema Boundary

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

The demo works every time. The LLM translates "show me last quarter's top ten customers by revenue" into pristine SQL, the results pop up instantly, and everyone in the room nods. Then you deploy it against your actual warehouse — 130 tables, 1,400 columns, a decade of organic naming conventions — and the model starts confidently generating queries that return the wrong numbers. No errors. Just wrong answers.

This is the schema boundary problem, and it's why text-to-SQL has the widest gap of any AI capability between benchmark performance and production reality. A model that scores 86% on Spider 1.0 (the canonical academic benchmark) drops to around 6% accuracy on Spider 2.0, which approximates real enterprise schema complexity. Vendors demo on clean, toy schemas. You're deploying on yours.

The Embedding Refresh Problem: Running a Vector Store Like a Database Engineer

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG pipeline is returning confident, well-formatted answers. The LLM response looks great. And yet users keep filing tickets saying the system is wrong. The product manager pulls up the document in question — the information changed six weeks ago, but the vector index still reflects the old version. No errors were thrown. No alerts fired. The system was just silently, invisibly wrong.

This is the embedding refresh problem, and it bites most production RAG systems eventually. Analysis across production deployments shows that over 60% of RAG failures trace back to stale or outdated information in the knowledge base — not bad prompts, not retrieval algorithm failures, but a simple mismatch between what's in the vector index and what's true in the source. Most AI engineers discover this the hard way. Most data engineers already know how to prevent it.

LLM-Powered Data Pipelines: The ETL Tier Nobody Benchmarks

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most conversations about LLMs in production orbit around chat interfaces, copilots, and autonomous agents. But if you audit where enterprise LLM tokens are actually being consumed, a different picture emerges: a quiet majority of usage is happening inside batch data pipelines — extracting fields from documents, classifying support tickets, normalizing messy vendor records, enriching raw events with semantic labels. Nobody is writing conference talks about this tier. Nobody is benchmarking it seriously either. And that silence is costing teams real money and real accuracy.

This is the ETL tier that practitioners build first, justify last, and monitor least. It is also, for most organizations, the layer where LLM spend has the highest leverage — and the highest potential for invisible failure.

Data Versioning for AI: The Dataset-Model Coupling Problem Teams Discover Too Late

April 18, 2026 · 9 min read

Tian Pan

Software Engineer

Your model's accuracy dropped 8% in production overnight. Nothing in the model code changed. No deployment happened. The eval suite is green. So you spend a week adjusting hyperparameters, tweaking prompts, comparing checkpoint losses — and eventually someone notices that a schema migration landed three days ago in the feature pipeline. A single field that switched from NULL to an empty string. That's it. That's the regression.

This is the most common failure mode in production ML systems, and it has almost nothing to do with model quality. It has everything to do with a structural gap most teams don't close until they've been burned: data versions and model versions are intimately coupled, but they're tracked by different tools and owned by different teams.

Agentic Web Data Extraction at Scale: When Agents Replace Scrapers

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

The demo takes 20 minutes to build. You paste a URL, an LLM reads the HTML, and structured data comes out the other end. It feels like the future of web extraction has arrived.

Then you run it at 1,000 pages per hour. Costs spiral, blocks accumulate, and extracted fields start drifting in ways that don't look like errors — they look like normal data until your downstream pipeline has silently ingested three weeks of garbage. The "LLM reads the page" pattern is not wrong; it's just priced for prototype throughput.

Agentic web extraction genuinely solves problems that traditional scrapers cannot. But scaling it past proof-of-concept requires understanding a different set of failure modes than most teams expect.

Annotation Workforce Engineering: Your Labelers Are Production Infrastructure

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your model is underperforming, so you dig into the training data. Halfway through the audit you find two annotators labeling the same edge case in opposite ways — and both are following the spec, because the spec is ambiguous. You fix the spec, re-label the affected examples, retrain, and recover a few F1 points. Two months later the same thing happens with a different annotator on a different edge case.

This is not a labeling vendor problem. It is not a data quality tool problem. It is an infrastructure problem that you haven't yet treated like one.

Most engineering teams approach annotation the way they approach a conference room booking system: procure the tool, write a spec, hire some contractors, ship the data. That model worked when you needed a one-time labeled dataset. It collapses the moment annotation becomes a continuous activity feeding a live production model — which it is for almost every team that has graduated from prototype to production.

About Tian Pan