Skip to main content

LLM-Powered Data Migrations: What Actually Works at Scale

· 10 min read
Tian Pan
Software Engineer

The pitch is compelling: feed your legacy records into an LLM, describe the target schema, and let the model figure out the mapping. No hand-written parsers, no months of transformation logic, no domain expert bottlenecks. Teams have run this and gotten to 70–97% accuracy in a fraction of the time it would take traditional ETL. The problem is that the remaining 3–30% of failures don't look like failures. They look like correct data.

That asymmetry—where wrong outputs are structurally valid and plausible—is what makes LLM-powered data migrations genuinely dangerous without the right validation architecture. This post covers what the teams that have done this successfully actually built: when LLMs earn their place in the pipeline, where they silently break, and the validation layer that catches errors traditional tools cannot.

The Case for LLMs in Data Migration Is Real

Traditional ETL pipelines are deterministic and fast once written, but they require someone to first understand every source format variation, every implicit encoding decision, every field that means different things in different contexts. For migrations of legacy systems that predate any schema discipline—scanned documents, free-text CRM notes, heterogeneous vendor feeds—that understanding can take months.

LLMs compress this dramatically. Where a human analyst has to read 10,000 sample records to enumerate the format variations in a "phone number" field, an LLM can handle them in a single pass. Where writing a parser to normalize addresses across 15 international formats would take weeks, a well-prompted model handles it in an afternoon. Airbnb migrated 3,500 test files from one framework to another and reached 97% completion in six weeks—a task that would have taken far longer through manual rewrites.

The pattern that works is a hybrid: use LLMs for the semantically hard work (extraction from ambiguous inputs, normalization across heterogeneous sources, intent-based field mapping) and keep traditional ETL for the deterministic work (filtering, aggregation, numeric calculations, referential integrity enforcement). LLMs don't replace ETL; they handle the cases ETL can't write rules for.

The Silent Failure Mode You Need to Know About

Here is where teams get burned. LLM errors in data transformation are not random noise that statistical tests catch. They are domain-specific plausibility errors: outputs that pass format validation, pass type checking, pass row counts, and are wrong.

A model migrating financial records might correctly identify that a field is a "rate" and output a decimal value—but silently confuse an annual rate with a daily rate, or misinterpret how a legacy system encoded signed values. The output is a number. The type is correct. The downstream system ingests it without complaint. Three months later, a report is off.

A healthcare migration might normalize a clinical term correctly for 99.8% of records and then, on records with ambiguous abbreviations, substitute a semantically adjacent but clinically distinct term. Row counts match. Schema validation passes. A statistician running an analysis later draws the wrong conclusion from a distribution that has been subtly shifted.

This failure pattern has a name in the research literature: interpretive overconfidence. Models don't make things up wholesale; they add unsupported characterizations, transform attributed statements into general claims, and apply plausible but incorrect domain-specific interpretations. These errors evade linting, parsing, and structural validation. The only thing that catches them is semantic validation against actual business rules.

Researchers at MIT found that LLMs reason from syntactic patterns rather than genuine semantic understanding—a model can answer correctly about a concept while failing on a structurally identical but semantically distinct query. In data migration, this means a model trained on finance data might handle standard cases correctly and then apply the wrong logic to edge cases it hasn't seen, without any signal that it is doing so.

The Validation Architecture That Actually Works

The teams getting to production safely are running at least three layers of validation, not one.

Structural validation is the baseline: row counts, type checking, null constraints, referential integrity. This is table stakes and catches format errors. It does not catch semantic errors.

Spot-check sampling with semantic review is the layer most teams underinvest in. The key is that samples must be reviewed against business rules, not just schema conformance. A random sample of 10% of records needs to be evaluated by someone who understands what the data should mean, not just what format it should be in. Airbnb's "sample, tune, and sweep" loop—run all failures, pick 5–10 representative cases, update prompts, validate fixes, sweep again—is a useful model. The iteration is doing semantic work, not just fixing format errors.

Statistical distribution checks are what catch the subtle population-level shifts that per-record review misses. Before and after migration, the distributions of key fields should match. Value frequencies, numeric ranges, co-occurrence patterns between fields—if the migration is semantically correct, the statistical fingerprint of the data should be preserved (or changed only in ways the migration intentionally changed it). Automated diff tools that cross-validate source and target at the statistical level catch errors that row-by-row sampling would take forever to find.

Multi-pass consistency checks are increasingly practical at model-friendly costs. Run the same transformation twice with different prompts and compare outputs. Disagreement flags records for human review. This is computationally expensive but in high-stakes migrations—financial data, medical records, legal documents—it is cheaper than a data quality incident.

Schema-Based Prompting Over Natural Language

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates