When LLMs Beat Rule-Based Systems for Data Normalization (And When They Don't)
A team I know spent three months building a rule-based address normalizer. It handled the top twenty formats, used a USPS API for verification, and worked great on the data they'd seen. Then they got a new enterprise customer. The first week of data had addresses embedded in freeform notes fields, postal codes missing country prefixes, and cross-border formats their rules had never seen. The normalizer failed silently on 31% of records. They threw an LLM at it as a quick fix, expecting 80% accuracy. They got 94%. The surprise wasn't that the LLM worked — it was that nothing in their evaluation framework had predicted this.
This is the shape of the problem. Rule-based normalization is predictable, fast, and cheap. It works well when the data distribution stays in-bounds. LLMs handle the long tail — the weird formats, the implicit domain knowledge, the edge cases that rules never enumerate. But LLMs are also expensive, slow, and inconsistent in ways that break production pipelines if you're not careful. The right answer, for almost every team, is a hybrid that uses each approach on the inputs it's actually good at.
The Performance Reality: What the Benchmarks Show
On structured normalization tasks, large models perform surprisingly well. Benchmarked on product attribute normalization tasks — string wrangling, name expansion, unit conversion, category assignment — GPT-4 achieves around 91% F1 on attribute extraction and normalization overall. String manipulation lands around 95% F1. Name expansion reaches 98% F1. The weakest category is unit of measurement conversion at 84%, which reflects a general LLM weakness with calculation-heavy reasoning.
For entity resolution (detecting whether two records refer to the same real-world entity), production benchmarks show:
- GPT-4.1: ~81% accuracy on entity matching tasks
- Llama-4-maverick: ~80% accuracy
- Smaller/cheaper alternatives (GPT-4o-mini, Llama-4-scout): 62–65% accuracy
The 16–19 percentage point gap between frontier and budget models matters here. What matters less than most teams expect: retrieval depth. Increasing the number of candidate pairs retrieved from 5 to 10 improved accuracy by less than 1 percentage point in multiple benchmarks. This means if you're trying to improve entity resolution accuracy, switching models delivers orders of magnitude more impact than tuning your vector database.
For comparison, rule-based matchers on clean structured data (same format, controlled vocabulary) typically exceed 99% recall at low false-positive rates — but that performance collapses when data quality degrades or new formats appear.
When LLMs Win
LLMs beat rule-based systems in a predictable set of scenarios:
Unstructured or semi-structured input. Addresses embedded in paragraphs, product descriptions with implicit attributes, legacy code values without documentation. Rules require explicit patterns; LLMs bring language understanding that generalizes across formats.
High format variance with low volume. If you're normalizing 50,000 records a month with a dozen different address formats, writing and maintaining rules for all of them costs more than the LLM calls. At this scale, a single LLM batch run often costs under $50.
Domain knowledge that's implicit. "NYC" → "New York City" is trivial. "KOA" → "Koa wood" in a furniture catalog is not — it requires knowing that wood species abbreviations appear in that product line. LLMs encode this without you having to teach it.
Few-shot adaptability. When new data formats appear — a new supplier's export format, a new market's address conventions — LLMs adapt with 10 examples. Rules require explicit enumeration.
The long tail. Rule-based systems achieve excellent coverage on the 80% of inputs that follow known patterns. LLMs cover the remaining 20% without engineering intervention. That last 20% often represents your highest-risk records: new customers, edge-case products, outlier geographies.
When Rule-Based Systems Win
Rules are not obsolete. For a large class of problems, they're the right answer:
Real-time, high-throughput normalization. LLM processing takes seconds per record at best. Rule-based systems process thousands per second. Fraud detection, order processing, session-time enrichment — anything that blocks a user transaction — cannot absorb LLM latency.
Precision-critical applications. Credit bureaus, medical record deduplication, and identity matching require false positive rates below 0.1%. LLMs produce inconsistent results on identically worded queries phrased slightly differently, which breaks the precision guarantees these systems require. "Is John Smith at 123 Main St the same as J. Smith at 123 Main Street?" yields different confidence levels depending on how you phrase the question. That inconsistency is not acceptable in regulated identity matching.
Audit trails. "The LLM said so" is not an audit-compliant answer in finance, healthcare, or legal contexts. Rules produce deterministic decisions you can trace and explain. This matters when a data processing decision needs to be reviewed by a human auditor.
Mathematical normalization. Unit conversions, quantity calculations, derived fields. LLMs achieve 84% on unit conversion benchmarks and hallucinate arithmetic. Use deterministic code for math.
The Cost Trap Most Teams Fall Into
The failure mode plays out like this: a team prototypes LLM normalization, it works great on their 10,000-record test set, they approve it for production, and six weeks later the infrastructure bill has tripled.
The math is simple and brutal. A product record with a 100-word description, processed through a prompt that includes schema context and instructions, easily uses 500 tokens. At 1 million records, that's 500 million tokens. At $5 per million tokens, that's $2,500 — just for input tokens. Add output tokens, retry overhead, and multiple API calls for validation, and a naive LLM-based normalization pipeline for a medium-sized e-commerce catalog can cost tens of thousands of dollars per run.
The optimization levers that actually work:
Batch processing. OpenAI batch API charges 50% of real-time pricing. Anthropic batch processing is also discounted at roughly 50% of standard rates. Most normalization work is not latency-sensitive. Queuing records for batch processing halves your API costs immediately with zero engineering change to model selection or prompt design.
Selective querying. Use rules (or a cheap embedding similarity score) to classify records by difficulty. Route high-confidence records directly through rule-based normalization. Route only ambiguous records to the LLM. An uncertainty reduction framework that does this typically reduces LLM API calls by 4–5x while maintaining overall accuracy.
Hybrid filtering. Use cheap embedding retrieval to narrow the candidate space, then apply LLM verification only to ambiguous cases. In entity resolution benchmarks, this hybrid approach matches or exceeds pure LLM performance at a fraction of the cost.
- https://arxiv.org/abs/2403.02130
- https://tilores.io/content/Can-LLMs-be-used-for-Entity-Resolution
- https://arxiv.org/abs/2401.03426
- https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-semantic-search
- https://arxiv.org/pdf/2404.15604
- https://arxiv.org/html/2506.02509v1
- https://aws.amazon.com/blogs/machine-learning/ground-truth-curation-and-metric-interpretation-best-practices-for-evaluating-generative-ai-question-answering-using-fmeval/
- https://www.faktion.com/post/rag-output-validation---part-3-human-feedback-and-llm-based-validation
- https://reintech.io/blog/llm-batch-processing-handling-large-scale-inference-jobs-efficiently
