LLM-Powered Data Migrations: What Actually Works at Scale
The pitch is compelling: feed your legacy records into an LLM, describe the target schema, and let the model figure out the mapping. No hand-written parsers, no months of transformation logic, no domain expert bottlenecks. Teams have run this and gotten to 70–97% accuracy in a fraction of the time it would take traditional ETL. The problem is that the remaining 3–30% of failures don't look like failures. They look like correct data.
That asymmetry—where wrong outputs are structurally valid and plausible—is what makes LLM-powered data migrations genuinely dangerous without the right validation architecture. This post covers what the teams that have done this successfully actually built: when LLMs earn their place in the pipeline, where they silently break, and the validation layer that catches errors traditional tools cannot.
The Case for LLMs in Data Migration Is Real
Traditional ETL pipelines are deterministic and fast once written, but they require someone to first understand every source format variation, every implicit encoding decision, every field that means different things in different contexts. For migrations of legacy systems that predate any schema discipline—scanned documents, free-text CRM notes, heterogeneous vendor feeds—that understanding can take months.
LLMs compress this dramatically. Where a human analyst has to read 10,000 sample records to enumerate the format variations in a "phone number" field, an LLM can handle them in a single pass. Where writing a parser to normalize addresses across 15 international formats would take weeks, a well-prompted model handles it in an afternoon. Airbnb migrated 3,500 test files from one framework to another and reached 97% completion in six weeks—a task that would have taken far longer through manual rewrites.
The pattern that works is a hybrid: use LLMs for the semantically hard work (extraction from ambiguous inputs, normalization across heterogeneous sources, intent-based field mapping) and keep traditional ETL for the deterministic work (filtering, aggregation, numeric calculations, referential integrity enforcement). LLMs don't replace ETL; they handle the cases ETL can't write rules for.
The Silent Failure Mode You Need to Know About
Here is where teams get burned. LLM errors in data transformation are not random noise that statistical tests catch. They are domain-specific plausibility errors: outputs that pass format validation, pass type checking, pass row counts, and are wrong.
A model migrating financial records might correctly identify that a field is a "rate" and output a decimal value—but silently confuse an annual rate with a daily rate, or misinterpret how a legacy system encoded signed values. The output is a number. The type is correct. The downstream system ingests it without complaint. Three months later, a report is off.
A healthcare migration might normalize a clinical term correctly for 99.8% of records and then, on records with ambiguous abbreviations, substitute a semantically adjacent but clinically distinct term. Row counts match. Schema validation passes. A statistician running an analysis later draws the wrong conclusion from a distribution that has been subtly shifted.
This failure pattern has a name in the research literature: interpretive overconfidence. Models don't make things up wholesale; they add unsupported characterizations, transform attributed statements into general claims, and apply plausible but incorrect domain-specific interpretations. These errors evade linting, parsing, and structural validation. The only thing that catches them is semantic validation against actual business rules.
Researchers at MIT found that LLMs reason from syntactic patterns rather than genuine semantic understanding—a model can answer correctly about a concept while failing on a structurally identical but semantically distinct query. In data migration, this means a model trained on finance data might handle standard cases correctly and then apply the wrong logic to edge cases it hasn't seen, without any signal that it is doing so.
The Validation Architecture That Actually Works
The teams getting to production safely are running at least three layers of validation, not one.
Structural validation is the baseline: row counts, type checking, null constraints, referential integrity. This is table stakes and catches format errors. It does not catch semantic errors.
Spot-check sampling with semantic review is the layer most teams underinvest in. The key is that samples must be reviewed against business rules, not just schema conformance. A random sample of 10% of records needs to be evaluated by someone who understands what the data should mean, not just what format it should be in. Airbnb's "sample, tune, and sweep" loop—run all failures, pick 5–10 representative cases, update prompts, validate fixes, sweep again—is a useful model. The iteration is doing semantic work, not just fixing format errors.
Statistical distribution checks are what catch the subtle population-level shifts that per-record review misses. Before and after migration, the distributions of key fields should match. Value frequencies, numeric ranges, co-occurrence patterns between fields—if the migration is semantically correct, the statistical fingerprint of the data should be preserved (or changed only in ways the migration intentionally changed it). Automated diff tools that cross-validate source and target at the statistical level catch errors that row-by-row sampling would take forever to find.
Multi-pass consistency checks are increasingly practical at model-friendly costs. Run the same transformation twice with different prompts and compare outputs. Disagreement flags records for human review. This is computationally expensive but in high-stakes migrations—financial data, medical records, legal documents—it is cheaper than a data quality incident.
Schema-Based Prompting Over Natural Language
The single most impactful implementation choice in LLM-powered data transformation is how you specify the mapping. Teams that describe transformations in natural language ("convert the address fields to standard format") get inconsistent results and brittle pipelines. Teams that provide JSON schemas with explicit type definitions, enum constraints, required fields, and null handling get results that are more predictable, more portable across model providers, and easier to validate.
Schema-based prompting activates the model's training on structured data and code rather than on prose. It removes ambiguity at the definition level rather than trying to recover from it in post-processing. It also integrates directly with schema validation libraries—your validation layer can mechanically check the output against the same schema you gave the model, rather than having to write custom validation for each field.
The tradeoff is token cost. Specifying a full JSON schema is verbose. At migration scale this adds up. The practical answer is to use schema-based prompting for fields that matter most for business logic and accuracy, and natural language for low-stakes descriptive fields where the cost of errors is low.
Handling the Accuracy vs. Auditability Conflict
For regulated industries—finance, healthcare, legal—there is a structural tension between LLM-powered migration and compliance requirements. Traditional ETL is deterministic: the same input always produces the same output, and you can reproduce any transformation years later. LLMs are not. Prompt changes, model updates, and non-zero temperature all mean that re-running a migration pipeline can produce different outputs.
This matters for compliance in two ways. First, audit trails need to record not just what happened, but be able to reproduce it—and "we used GPT-4o in March 2025" is not sufficient to reproduce results in November 2025 when the model has been updated. Second, regulators increasingly require documentation of the logic that produced data, and "an LLM decided" is not documentation.
The resolution is to treat LLM pipelines as code rather than as services. Version the prompt, the model version, and every configuration parameter as artifacts. For compliance-sensitive migrations, freeze the model version (use a pinned snapshot, not a live endpoint that updates). Document the transformation logic as executable specifications—if the LLM is making a choice that needs to be justifiable, that choice should be encoded in post-processing rules that can be read, reviewed, and reproduced.
Separate the development iteration (where you optimize prompts for accuracy) from the production compliance run (where the configuration is frozen and every input, output, and intermediate state is logged). What ships to compliance-sensitive production should be more like a compiled artifact than a live inference request.
What the Benchmarks Actually Show
The headline numbers from vendor case studies—102x faster processing, 80% reduction in migration timelines, 70% out-of-the-box accuracy—are real, but they require context to interpret.
The 70% out-of-the-box accuracy figure from multiple implementations means 30% of records need human review or iterative prompt refinement before the migration is production-safe. That is not a criticism; traditional ETL on heterogeneous legacy data often starts lower. But it does mean LLM-powered migration is not a single-pass process. You are signing up for an iteration loop.
The 97% figure Airbnb reached is after six weeks of iteration, including targeted human intervention on the hardest cases. The 3% that remained required engineers who understood the domain. For structured, well-documented source data, you can reach this faster. For complex legacy systems with decades of undocumented encoding decisions, six weeks may be optimistic.
A recent benchmark of LLM agents on end-to-end ELT pipeline generation found a 96% failure rate on complex multi-step tasks. This is not a contradiction—LLMs work well as components in data migration pipelines, especially for specific transformation tasks. They struggle when asked to autonomously generate complete, correct pipelines without human checkpoints. The lesson is to keep humans in the loop on the high-complexity steps, not to automate them out.
The Migration You Should Not Do With LLMs
There are data migrations where LLMs add cost without adding value, and where the risk profile is wrong.
Deterministic transformations—converting a Unix timestamp to ISO 8601, summing two numeric fields, joining records on a key—have no ambiguity for an LLM to resolve. A rule-based transform is faster, cheaper, and easier to audit. Introducing LLM inference here adds latency, cost, and a surface for plausibility errors.
Migrations where the error cost is catastrophic and the records cannot be externally validated—some financial audit trails, certain medical records—require a different risk posture. The asymmetric error profile (errors that look correct) means you need either extremely high coverage human review or a strong independent validation source. If you cannot build that, the efficiency gains from LLM-powered migration do not justify the exposure.
Practical Starting Points
For teams beginning this work: start with the fields that are hardest to parse manually and lowest in consequence if wrong. Build your semantic validation layer before you run your first LLM pass—validating after the fact is harder than designing checks into the pipeline from the start. Define go/no-go criteria as quantitative thresholds (error rate below 0.01%, statistical distribution within X% of source) before the migration starts, not after you have invested in a run you want to ship.
The schema-based prompting, the hybrid architecture, and the semantic validation layer are not optional quality improvements. They are the difference between a migration that ships and one that ships and quietly corrupts your data for six months before anyone notices.
LLM-powered data migration is genuinely useful for the hard cases where traditional ETL fails or takes too long. The teams making it work are treating it as an engineering problem—with defined success criteria, layered validation, and controlled iteration—not as a prompt-and-ship automation. The technology earns its place in the pipeline when the validation architecture is in place to catch what it gets wrong.
- https://airbnb.tech/infrastructure/accelerating-large-scale-test-migration-with-llms/
- https://www.amazon.science/blog/lightweight-llm-for-converting-text-to-structured-data
- https://arxiv.org/html/2504.04808v2
- https://www.datafold.com/blog/modern-data-migration-framework/
- https://www.griddynamics.com/blog/genai-enterprise-data-migration
- https://opper.ai/blog/schema-based-prompting
- https://www.cloverdx.com/blog/using-llms-in-etl-pipelines-production-scale-best-practices
- https://www.nature.com/articles/s41586-024-07421-0
- https://arxiv.org/html/2511.12288
- https://lakefs.io/blog/llm-compliance/
