Skip to main content

LLMs as ETL Primitives: AI in the Data Pipeline, Not Just the Product

· 9 min read
Tian Pan
Software Engineer

The typical AI narrative goes like this: you build a product, you add an AI feature, and users get smarter outputs. That framing is correct, but incomplete. The more durable advantage isn't in the product layer at all — it's in the data pipeline running underneath it.

A growing number of engineering teams have quietly swapped out regex rules, custom classifiers, and hand-coded parsers in their ETL pipelines and replaced them with LLM calls. The result: pipelines that handle unstructured input, adapt to schema drift, and classify records across thousands of categories — without retraining a model for every new edge case. Teams running this pattern at scale are building data assets that compound. Teams still treating LLMs purely as product features are not.

This post is about the engineering reality of putting LLMs inside your data pipeline: what they replace, where they fail, how to make their outputs reliable, and why the data flywheel this creates is structurally hard to replicate.

What LLMs Replace in a Traditional ETL Pipeline

Traditional ETL pipelines handle four core operations: extract records from a source, transform them (classify, enrich, normalize, deduplicate), validate the output, and load to a destination. Each of these operations has a classical implementation, and each has a point where the classical implementation breaks down.

Classification normally means regex rules or a trained classifier. Rules break when categories change. Classifiers break when you hit categories they weren't trained on. LLMs handle both: you can classify records into taxonomies that didn't exist when the model was trained, at zero-shot, using plain language descriptions. Shopify runs 30 million product classification predictions daily using a vision-language model, covering 10,000+ categories across 26 verticals, with an 85% merchant acceptance rate. They don't retrain a classifier for new categories — they describe them.

Enrichment normally means joining against lookup tables, running regex extractors, or calling specialized APIs per field type. This breaks on free-text fields. LLMs transform arbitrary text into structured metadata: extract entities, assign sentiment, derive normalized values, infer intent. MotherDuck ships a prompt() SQL function that applies LLM enrichment per row with automatic batching — usable directly in a dbt model. You write a SQL query; the LLM fills in the enrichment column.

Entity resolution (deduplication) normally uses string edit distance and blocking rules. This breaks on name variants, abbreviations, and cross-language equivalences. LLMs paired with embedding-based approximate nearest-neighbor blocking outperform fine-tuned BERT-based systems by 40–68% on unseen entity types, according to a 2024 ACM study. The pattern: use embeddings for candidate blocking to keep the problem tractable, then use an LLM for pairwise disambiguation.

Unstructured-to-structured transformation — converting PDFs, emails, and tickets into typed records — has no classical equivalent. It required custom parsers for every format, or armies of manual labelers. LLMs convert unstructured documents to structured records at speeds that make human review economically viable only for spot-checking, not the primary path.

The Reliability Problem: Putting a Stochastic Model in a Deterministic Pipeline

The hard part isn't getting LLMs to work in a pipeline. It's getting them to fail safely.

Classical ETL has a comfortable property: given the same input, you get the same output. Pipelines are built on this assumption. A temperature=0 LLM call reduces variance but doesn't eliminate it — and model version upgrades, provider-side infrastructure changes, and prompt changes all shift the output distribution without warning. Silent failures are the dangerous case: the LLM returns a structurally valid response with semantically wrong values, and the downstream system ingests corrupt data without any error signal.

The engineering pattern that survives production is a two-layer validation sandwich:

  1. Structural validation: Enforce schema at the decoding level. Native structured output APIs (OpenAI, Anthropic) and constrained decoding eliminate parse errors before you see them. This is not optional — JSON "mode" via prompting alone produces creative variants (trailing commas, markdown code fences wrapping the payload) that break parsers in real pipelines.

  2. Semantic validation: Structural conformance is necessary but not sufficient. A response with {"confidence": "medium"} or {"status": "complete", "result": null} is valid JSON but semantically wrong in ways no type checker catches. The Instructor library (3M+ monthly downloads) handles this by passing Pydantic validation errors back to the LLM as context and retrying automatically. This closes the loop: the model sees what it got wrong and corrects it.

Beyond the validation layer, three operational patterns make LLM pipeline steps behave more like deterministic transformations:

  • Idempotency keys: Record the LLM's output to a durable store before writing downstream. If the pipeline restarts mid-run, you replay from the log rather than re-calling the LLM.
  • Circuit breakers: Hard limits on cost and turn counts per record. Without these, recursive patterns can drive API costs from $127 to $47,000 per week — a real production incident that propagated undetected because the pipeline had no cost ceiling.
  • Pinned model versions: Treat the model version as a dependency, not a floating pointer. Every major provider offers version pinning. Unpinned models can silently shift output distributions when the provider upgrades infrastructure.

Cost and Latency: When Batch Beats Real-Time

The reflexive concern about LLM-in-the-pipeline is latency. This concern is usually misapplied.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates