Skip to main content

LLMs as ETL Primitives: AI in the Data Pipeline, Not Just the Product

· 9 min read
Tian Pan
Software Engineer

The typical AI narrative goes like this: you build a product, you add an AI feature, and users get smarter outputs. That framing is correct, but incomplete. The more durable advantage isn't in the product layer at all — it's in the data pipeline running underneath it.

A growing number of engineering teams have quietly swapped out regex rules, custom classifiers, and hand-coded parsers in their ETL pipelines and replaced them with LLM calls. The result: pipelines that handle unstructured input, adapt to schema drift, and classify records across thousands of categories — without retraining a model for every new edge case. Teams running this pattern at scale are building data assets that compound. Teams still treating LLMs purely as product features are not.

This post is about the engineering reality of putting LLMs inside your data pipeline: what they replace, where they fail, how to make their outputs reliable, and why the data flywheel this creates is structurally hard to replicate.

What LLMs Replace in a Traditional ETL Pipeline

Traditional ETL pipelines handle four core operations: extract records from a source, transform them (classify, enrich, normalize, deduplicate), validate the output, and load to a destination. Each of these operations has a classical implementation, and each has a point where the classical implementation breaks down.

Classification normally means regex rules or a trained classifier. Rules break when categories change. Classifiers break when you hit categories they weren't trained on. LLMs handle both: you can classify records into taxonomies that didn't exist when the model was trained, at zero-shot, using plain language descriptions. Shopify runs 30 million product classification predictions daily using a vision-language model, covering 10,000+ categories across 26 verticals, with an 85% merchant acceptance rate. They don't retrain a classifier for new categories — they describe them.

Enrichment normally means joining against lookup tables, running regex extractors, or calling specialized APIs per field type. This breaks on free-text fields. LLMs transform arbitrary text into structured metadata: extract entities, assign sentiment, derive normalized values, infer intent. MotherDuck ships a prompt() SQL function that applies LLM enrichment per row with automatic batching — usable directly in a dbt model. You write a SQL query; the LLM fills in the enrichment column.

Entity resolution (deduplication) normally uses string edit distance and blocking rules. This breaks on name variants, abbreviations, and cross-language equivalences. LLMs paired with embedding-based approximate nearest-neighbor blocking outperform fine-tuned BERT-based systems by 40–68% on unseen entity types, according to a 2024 ACM study. The pattern: use embeddings for candidate blocking to keep the problem tractable, then use an LLM for pairwise disambiguation.

Unstructured-to-structured transformation — converting PDFs, emails, and tickets into typed records — has no classical equivalent. It required custom parsers for every format, or armies of manual labelers. LLMs convert unstructured documents to structured records at speeds that make human review economically viable only for spot-checking, not the primary path.

The Reliability Problem: Putting a Stochastic Model in a Deterministic Pipeline

The hard part isn't getting LLMs to work in a pipeline. It's getting them to fail safely.

Classical ETL has a comfortable property: given the same input, you get the same output. Pipelines are built on this assumption. A temperature=0 LLM call reduces variance but doesn't eliminate it — and model version upgrades, provider-side infrastructure changes, and prompt changes all shift the output distribution without warning. Silent failures are the dangerous case: the LLM returns a structurally valid response with semantically wrong values, and the downstream system ingests corrupt data without any error signal.

The engineering pattern that survives production is a two-layer validation sandwich:

  1. Structural validation: Enforce schema at the decoding level. Native structured output APIs (OpenAI, Anthropic) and constrained decoding eliminate parse errors before you see them. This is not optional — JSON "mode" via prompting alone produces creative variants (trailing commas, markdown code fences wrapping the payload) that break parsers in real pipelines.

  2. Semantic validation: Structural conformance is necessary but not sufficient. A response with {"confidence": "medium"} or {"status": "complete", "result": null} is valid JSON but semantically wrong in ways no type checker catches. The Instructor library (3M+ monthly downloads) handles this by passing Pydantic validation errors back to the LLM as context and retrying automatically. This closes the loop: the model sees what it got wrong and corrects it.

Beyond the validation layer, three operational patterns make LLM pipeline steps behave more like deterministic transformations:

  • Idempotency keys: Record the LLM's output to a durable store before writing downstream. If the pipeline restarts mid-run, you replay from the log rather than re-calling the LLM.
  • Circuit breakers: Hard limits on cost and turn counts per record. Without these, recursive patterns can drive API costs from $127 to $47,000 per week — a real production incident that propagated undetected because the pipeline had no cost ceiling.
  • Pinned model versions: Treat the model version as a dependency, not a floating pointer. Every major provider offers version pinning. Unpinned models can silently shift output distributions when the provider upgrades infrastructure.

Cost and Latency: When Batch Beats Real-Time

The reflexive concern about LLM-in-the-pipeline is latency. This concern is usually misapplied.

Classification, enrichment, deduplication, and document parsing do not need real-time latency. Nightly enrichment runs, hourly batch jobs, event-driven backfills — none of these have a human waiting on them synchronously. Using batch inference APIs gives you 50% cost reduction versus real-time, higher throughput via optimized parallelization, and most jobs complete in under an hour.

Prompt caching compounds these savings. For pipelines where records share a common system prompt or schema definition — which is nearly always — caching the shared prefix reduces input token cost by up to 90%. A team that applied prompt caching to their classification pipeline cut LLM costs by 59% without changing anything else.

The honest cost metric for a data pipeline is not cost-per-token. It's cost-per-successfully-transformed-record. At that level, LLM-augmented pipelines routinely undercut the total cost of the alternatives when you account for engineering time maintaining rule systems and retraining classifiers.

At extreme scale — DoorDash operates at 10 billion predictions per day — API-based LLMs are economically unviable regardless of optimizations, and the answer is self-hosted fine-tuned models. But most teams are not at that scale, and API-based batch inference is the right default until the numbers force a different decision.

What Breaks: The Pitfalls That Ship to Production

The gap between an LLM pipeline demo and a production LLM pipeline is larger than it appears. The common failure modes:

Silent data corruption is worse than crashes. A pipeline that errors out is immediately visible. A pipeline that silently writes wrong classifications to your analytics database may run for weeks before anyone notices the dashboards are wrong. Confidence gating — discarding or flagging records below a threshold rather than accepting every output — is the first defense.

Context rot in multi-step pipelines. LLM performance degrades measurably past 50,000–150,000 tokens regardless of the model's advertised context length. Long-running agentic enrichment pipelines that accumulate context hit this ceiling. The fix is just-in-time context injection: pass only the information relevant to the current step, not everything accumulated so far.

Prompt fragility. Small prompt changes produce large output distribution shifts. A pipeline that worked in staging breaks in production because someone edited the system prompt to improve one field and inadvertently shifted another. The discipline here is the same as code discipline: treat prompts as versioned artifacts, run evals on every change, and don't ship prompt changes the same day you ship model version upgrades.

The 80/95 problem. Getting to 80% pipeline quality is fast. The remaining 15% to production-ready — edge cases, domain-specific failure modes, distributional corner cases — takes as long as the first 80%. Teams that benchmark their pipeline at 80% and declare it production-ready discover this the hard way. Build in time for the long tail before committing delivery dates.

The Data Asset Flywheel

The operational efficiency of LLM-augmented pipelines is real but secondary. The primary strategic value is what those enriched records enable downstream.

Each LLM-enriched record is a labeled example. At 50,000–500,000 high-quality labeled interactions — the threshold varies by domain — a fine-tuned smaller model materially outperforms any generalist API on your specific tasks at a fraction of the cost. Teams enriching records today are generating training data that makes their next-generation models domain-specific and cheap to run. Teams treating LLMs purely as product features generate user interactions but not the structured, labeled dataset needed for this loop.

The feedback loop compounds in one direction. A richer enriched dataset enables more accurate downstream models, better retrieval, and more reliable agent behavior. Each improvement unlocks the next. After 12–24 months, the proprietary enriched dataset and fine-tuned models create switching costs that generalist competitors cannot overcome by plugging in the same frontier API.

This is the structural difference between teams building AI features and teams building AI-powered data assets. The product feature is visible and replicable. The enriched dataset is invisible, proprietary, and accumulated through real operations — which makes it the harder thing to copy.

Where to Start

The practical entry point is picking one classification or enrichment task that currently lives in a fragile rule system or requires manual review and replacing it with a batch LLM call behind a validation layer.

Start with batch, not real-time. Use structured output APIs rather than prompting for JSON. Add Pydantic validation with automatic retry from day one — retrofitting it later is harder than it looks. Pin your model version. Add a confidence gate to route low-confidence records to a review queue rather than silently accepting them.

Instrument the pipeline to track cost-per-record, validation failure rate, and confidence distribution. These three metrics tell you whether the pipeline is healthy and where to invest optimization effort.

The migration from rule-based ETL to LLM-augmented ETL is not a one-step project. It's a series of small replacements, each improving data quality and reducing maintenance burden, that gradually convert a brittle rule system into a flexible, self-correcting pipeline. The teams doing this systematically in 2025–2026 are not just cutting costs — they're building the data infrastructure that makes their 2028 models impossible to replicate.

The quiet transformation in data pipelines is worth more attention than it gets.

References:Let's stay in touch and Follow me for more thoughts and updates