Skip to main content

LLMs as Data Engineers: The Silent Failures in AI-Driven ETL

· 11 min read
Tian Pan
Software Engineer

Your hand-coded ETL pipeline handles 95% of records correctly. The edge cases — the currency strings with commas, the inconsistently formatted dates, the inconsistent country codes — flow through to your data warehouse and quietly corrupt your dashboards. Nobody notices until a quarterly report looks wrong. You add another special case to the pipeline. The cycle continues.

LLMs can solve this. They infer schemas from raw samples, handle messy edge cases that no engineer anticipated, and transform unstructured documents into structured records at a fraction of the development time. Several teams have shipped this. Some of them have also had LLMs silently transform "$1,200,000" into 1200 instead of 1200000, flip severity scores from "high" to "low" with complete structural validity, and join on the wrong foreign key in ways that passed every schema check.

The problem isn't that LLMs are bad at data engineering. It's that their failure mode is exactly wrong for ETL: high confidence, no error thrown, structurally valid output.

What LLMs Actually Do Well in ETL

Before cataloging the failure modes, it's worth being precise about where LLMs genuinely add value over hand-coded pipelines.

Schema inference from messy samples. Given 50 rows of unstructured invoice data, a frontier model can propose a reasonable target schema, identify which fields are likely optional, and flag ambiguous cases for human review. This beats spending three days writing regex patterns that still miss edge cases.

Edge case handling in data transformation. Hand-coded pipelines excel at known patterns. LLMs handle unknown patterns: address formats from countries your pipeline wasn't built for, product descriptions that contain prices, timestamps in seven different formats including some that were human-typed and are malformed. The long tail of variation that breaks rule-based systems is exactly where language models are strong.

Semantic mapping between schemas. Migrating from one schema to another when field names don't match is tedious to automate with rules. "billing_address_line_1" maps to "bill_addr1" maps to "Invoice.BillingAddress.Street" — LLMs handle these mappings with high accuracy when the schemas are provided as context.

Unstructured-to-structured extraction. Financial filings, medical records, contracts, support tickets — converting free-form text into structured records at scale is impractical with hand-coded parsers. LLMs collapse development time from months to days for these use cases.

Tools like Unstract (open source) and Snowflake Cortex have productized these capabilities. The Cortex EXTRACT_ANSWER and COMPLETE functions let you run these transformations directly in SQL pipelines. dbt's integration with MotherDuck exposes a prompt() function that transforms text within dbt models. The engineering ecosystem has moved fast here.

The Failure Modes That Will Burn You

Here is what a data transformation failure looks like in a hand-coded pipeline: an exception is thrown, the job stops, Airflow marks the task red, an alert fires, an engineer investigates.

Here is what a data transformation failure looks like in an LLM-augmented pipeline: the job completes successfully, all records are processed, schema validation passes, row counts match, and three weeks later someone notices that unit prices are off by a factor of 100 in one product category because the model consistently stripped the decimal point from European-formatted numbers.

Researchers studying LLM code generation found that GPT-4 shows an 11.97% invalid response rate on complex extraction tasks, and that GPT-4o-mini frequently reconstructs input objects rather than preserving all original fields — causing silent data loss with no error signal. These aren't catastrophic failures. They're quiet ones that accumulate.

The failure modes fall into three categories:

Type coercion errors with high confidence. Currency strings, percentage values, and locale-specific number formats trip models consistently. "$500,000" may become 500000 (correct) or 500 (stripping the comma as a decimal) or 5000 (locale confusion). The model has no awareness that it chose wrong. Structural validation passes because the output is a valid integer. Semantic validation fails only if you defined the right checks.

Semantic field mapping mistakes. An LLM given two similar schemas may map fields that sound related but aren't. customer_segment might map to account_tier in one context and risk_category in another. Both are valid strings; both pass schema validation; one is wrong. In text-to-SQL specifically, models frequently confuse foreign key relationships — joining on fields that co-occur in documentation but aren't actually the right join key in the schema.

Confident hallucination when context is incomplete. When a model lacks the metadata it needs to make a correct transformation, it doesn't stop and ask. It produces a plausible-looking answer from whatever information is available. A model asked to classify records by a taxonomy it hasn't seen will invent category names that look reasonable. A model filling in missing values will use statistical intuition, not your business rules. The outputs pass every check that doesn't specifically test for the thing the model made up.

Building the Validation Layer That Catches What the Model Gets Wrong

The solution isn't to avoid LLMs in ETL pipelines. It's to treat LLM outputs the same way you treat data from untrusted external sources: validate before trusting.

The validation pattern that works in production is a three-tier cascade:

Tier 1 — Structural validation. Pydantic schemas (or equivalent) enforce field types, required fields, and value ranges. This catches malformed JSON, missing required fields, and obvious type errors. Libraries like Instructor wrap LLM calls with automatic Pydantic validation and retry logic. Implement defensive coercion here: a parser that handles "$500,000", "500000", "500,000.00", and "5e5" and normalizes them to a canonical float, rather than relying on the model to always produce the right format.

Tier 2 — Semantic validation. This is the layer most teams skip and most teams regret. Semantic validation checks that outputs are meaningful, not just valid. Techniques include:

  • Rule-based contradiction detection: if severity = "low" and financial_impact > 1000000, flag for review
  • Cross-field consistency: end_date must be after start_date; discount_percent must be between 0 and 100
  • Reference data lookups: country codes must exist in your country table; product SKUs must exist in your product catalog
  • LLM-as-judge calls for high-stakes fields: a second model call that evaluates whether the extraction is semantically coherent

The PARSE framework, which achieves 64.7% accuracy improvements on complex extraction tasks, implements a systematic three-stage semantic check: verify no required attributes are missing, verify extracted values exist in the source document, and verify business rule compliance.

Tier 3 — Statistical anomaly detection. At the pipeline level, track distributions across runs: row counts, cardinality of categorical fields, value distributions, null rates. Alert when these deviate beyond expected thresholds. Kolmogorov-Smirnov tests on numeric distributions, Jensen-Shannon divergence for categorical distributions. Great Expectations (now at v1.0) implements much of this as configuration rather than code, with integration into Airflow and dbt.

Golden dataset regression testing closes the loop. Maintain a curated set of input/output pairs that represent known-good transformations, including known edge cases. Run these on every pipeline change and model version upgrade. Model upgrades that improve average performance sometimes regress specific edge cases; golden tests catch this before it reaches production.

Sandboxing AI-Generated Transformation Code

Some LLM-augmented ETL patterns generate transformation code (Python, SQL) rather than producing structured data directly. The English SDK for Apache Spark, dbt integrations that generate SQL models, and AI-assisted pandas transformations all fall into this category.

Running AI-generated code in production without isolation is a meaningful security risk. CVE-2024-5565 showed that LLMs in data tools can be prompted to generate malicious code through prompt injection via the data being processed — a data record containing a carefully crafted string that hijacks the LLM's next generation step into producing different code than intended. Benchmark research found an 84% attack success rate for inducing vulnerable code generation across evaluated models.

The defense-in-depth approach:

AST-level restriction. Parse generated code before execution and allow only a whitelist of operations. Python transformations that only call pandas operations don't need filesystem access or network calls. Rejecting code with import os, import subprocess, or open() calls eliminates the most common attack surface.

Container isolation. Run generated code in containers with dropped Linux capabilities, read-only root filesystems, and network isolation. Firecracker microVMs boot in under 125ms and provide hardware-level isolation if your threat model requires it.

Dry-run validation. Before materializing results, run transformations against a sample of production data with writes disabled. dbt's --empty flag materializes with zero rows; similar patterns exist in Spark. Catch schema errors and obvious semantic errors before touching production tables.

Idempotency. All AI-generated transformations should be idempotent: running the same transformation twice on the same data produces the same result, with no side effects. This enables safe retries, backfills, and the rollbacks you'll need when a transformation turns out to have been wrong.

Integrating This Into Your Pipeline Stack

The practical integration pattern for teams already running dbt or Airflow:

In dbt, LLM-generated SQL models should go through the same test infrastructure as hand-written ones — uniqueness constraints, not-null checks, relationship tests, and custom data tests for semantic invariants. The dbt Semantic Layer can expose your models to natural language query interfaces, but the transformation layer itself needs the same rigor as any other dbt model. Auto-generate documentation and test suggestions using LLMs; manually review before committing.

In Airflow, the Common AI Provider returns typed Entities objects through XCom, which is preferable to raw strings because it enforces schema at the interface boundary. Pair LLM operators with downstream validation tasks that run Great Expectations checks before proceeding. Use OpenLineage integration to trace data quality failures back to specific source records and LLM calls — when something goes wrong in a pipeline that uses LLMs, you need lineage to understand whether the failure is in the model output, the source data, or the validation logic.

Databricks recommends what they call the "30% rule" for AI ETL: automate 30% of decisions with AI while keeping humans in the loop for complex logic and exceptions. This isn't conservatism — it's an empirical observation from production deployments. LLMs handle the high-confidence, well-structured cases well. Edge cases that fall outside the training distribution get escalated to a human review queue rather than silently mishandled.

The Monitoring Stack You'll Need

Adding LLMs to your ETL pipeline creates new observability requirements. The standard metrics (job duration, row counts, error rates) are necessary but not sufficient. You need:

Per-model call latency and token usage. LLM calls are orders of magnitude slower than traditional transformations and have variable latency. Track p50/p95/p99 per pipeline stage. Set timeouts aggressively; model endpoints can hang.

Extraction confidence tracking. If your extraction framework provides confidence scores or structured uncertainty signals, store them. Low-confidence extractions are candidates for human review queues or re-extraction with different prompts.

Validation failure rates by field. Track which fields fail semantic validation most often. Consistently failing fields indicate either bad source data, a bad prompt, or a wrong schema assumption. This is your feedback loop for prompt iteration.

Model version tracking. Record which model version produced which output. When you upgrade from one model to another, you need to know which outputs were produced by which model to understand any regressions.

OpenTelemetry traces work well here: each LLM call becomes a span, with the prompt, response, latency, and token count as attributes. Tools like ZenML provide end-to-end lineage that links production errors back to specific source data and model calls. This lineage becomes critical when you're debugging a transformation that worked for months and suddenly started producing wrong results.

Getting the Balance Right

The teams that have shipped AI-augmented ETL pipelines successfully share a common pattern: they treated the LLM as an untrusted external service that produces probabilistic outputs, not as a reliable deterministic function. They built validation layers before they needed them. They ran LLM transformations in shadow mode — alongside existing pipelines — before cutting over. They defined what "wrong" looks like in their domain before relying on the model to be right.

The teams that got burned treated LLM outputs as trusted, tested structural validity and assumed semantic correctness, and discovered the failure mode three weeks later when a downstream consumer noticed the numbers were off.

LLMs are genuinely useful in ETL pipelines. The edge case handling, schema inference, and unstructured-to-structured extraction capabilities are real improvements over pure rule-based approaches. But the failure mode — high confidence, no error signal, structurally valid wrong output — means that the validation, sandboxing, and monitoring infrastructure is not optional. It's the difference between a system that's useful and one that quietly degrades your data warehouse for months before anyone notices.

Build the validation layer first. Add the LLM second.

References:Let's stay in touch and Follow me for more thoughts and updates