When LLMs Beat Rule-Based Systems for Data Normalization (And When They Don't)
A team I know spent three months building a rule-based address normalizer. It handled the top twenty formats, used a USPS API for verification, and worked great on the data they'd seen. Then they got a new enterprise customer. The first week of data had addresses embedded in freeform notes fields, postal codes missing country prefixes, and cross-border formats their rules had never seen. The normalizer failed silently on 31% of records. They threw an LLM at it as a quick fix, expecting 80% accuracy. They got 94%. The surprise wasn't that the LLM worked — it was that nothing in their evaluation framework had predicted this.
This is the shape of the problem. Rule-based normalization is predictable, fast, and cheap. It works well when the data distribution stays in-bounds. LLMs handle the long tail — the weird formats, the implicit domain knowledge, the edge cases that rules never enumerate. But LLMs are also expensive, slow, and inconsistent in ways that break production pipelines if you're not careful. The right answer, for almost every team, is a hybrid that uses each approach on the inputs it's actually good at.
The Performance Reality: What the Benchmarks Show
On structured normalization tasks, large models perform surprisingly well. Benchmarked on product attribute normalization tasks — string wrangling, name expansion, unit conversion, category assignment — GPT-4 achieves around 91% F1 on attribute extraction and normalization overall. String manipulation lands around 95% F1. Name expansion reaches 98% F1. The weakest category is unit of measurement conversion at 84%, which reflects a general LLM weakness with calculation-heavy reasoning.
For entity resolution (detecting whether two records refer to the same real-world entity), production benchmarks show:
- GPT-4.1: ~81% accuracy on entity matching tasks
- Llama-4-maverick: ~80% accuracy
- Smaller/cheaper alternatives (GPT-4o-mini, Llama-4-scout): 62–65% accuracy
The 16–19 percentage point gap between frontier and budget models matters here. What matters less than most teams expect: retrieval depth. Increasing the number of candidate pairs retrieved from 5 to 10 improved accuracy by less than 1 percentage point in multiple benchmarks. This means if you're trying to improve entity resolution accuracy, switching models delivers orders of magnitude more impact than tuning your vector database.
For comparison, rule-based matchers on clean structured data (same format, controlled vocabulary) typically exceed 99% recall at low false-positive rates — but that performance collapses when data quality degrades or new formats appear.
When LLMs Win
LLMs beat rule-based systems in a predictable set of scenarios:
Unstructured or semi-structured input. Addresses embedded in paragraphs, product descriptions with implicit attributes, legacy code values without documentation. Rules require explicit patterns; LLMs bring language understanding that generalizes across formats.
High format variance with low volume. If you're normalizing 50,000 records a month with a dozen different address formats, writing and maintaining rules for all of them costs more than the LLM calls. At this scale, a single LLM batch run often costs under $50.
Domain knowledge that's implicit. "NYC" → "New York City" is trivial. "KOA" → "Koa wood" in a furniture catalog is not — it requires knowing that wood species abbreviations appear in that product line. LLMs encode this without you having to teach it.
Few-shot adaptability. When new data formats appear — a new supplier's export format, a new market's address conventions — LLMs adapt with 10 examples. Rules require explicit enumeration.
The long tail. Rule-based systems achieve excellent coverage on the 80% of inputs that follow known patterns. LLMs cover the remaining 20% without engineering intervention. That last 20% often represents your highest-risk records: new customers, edge-case products, outlier geographies.
When Rule-Based Systems Win
Rules are not obsolete. For a large class of problems, they're the right answer:
Real-time, high-throughput normalization. LLM processing takes seconds per record at best. Rule-based systems process thousands per second. Fraud detection, order processing, session-time enrichment — anything that blocks a user transaction — cannot absorb LLM latency.
Precision-critical applications. Credit bureaus, medical record deduplication, and identity matching require false positive rates below 0.1%. LLMs produce inconsistent results on identically worded queries phrased slightly differently, which breaks the precision guarantees these systems require. "Is John Smith at 123 Main St the same as J. Smith at 123 Main Street?" yields different confidence levels depending on how you phrase the question. That inconsistency is not acceptable in regulated identity matching.
Audit trails. "The LLM said so" is not an audit-compliant answer in finance, healthcare, or legal contexts. Rules produce deterministic decisions you can trace and explain. This matters when a data processing decision needs to be reviewed by a human auditor.
Mathematical normalization. Unit conversions, quantity calculations, derived fields. LLMs achieve 84% on unit conversion benchmarks and hallucinate arithmetic. Use deterministic code for math.
The Cost Trap Most Teams Fall Into
The failure mode plays out like this: a team prototypes LLM normalization, it works great on their 10,000-record test set, they approve it for production, and six weeks later the infrastructure bill has tripled.
The math is simple and brutal. A product record with a 100-word description, processed through a prompt that includes schema context and instructions, easily uses 500 tokens. At 1 million records, that's 500 million tokens. At $5 per million tokens, that's $2,500 — just for input tokens. Add output tokens, retry overhead, and multiple API calls for validation, and a naive LLM-based normalization pipeline for a medium-sized e-commerce catalog can cost tens of thousands of dollars per run.
The optimization levers that actually work:
Batch processing. OpenAI batch API charges 50% of real-time pricing. Anthropic batch processing is also discounted at roughly 50% of standard rates. Most normalization work is not latency-sensitive. Queuing records for batch processing halves your API costs immediately with zero engineering change to model selection or prompt design.
Selective querying. Use rules (or a cheap embedding similarity score) to classify records by difficulty. Route high-confidence records directly through rule-based normalization. Route only ambiguous records to the LLM. An uncertainty reduction framework that does this typically reduces LLM API calls by 4–5x while maintaining overall accuracy.
Hybrid filtering. Use cheap embedding retrieval to narrow the candidate space, then apply LLM verification only to ambiguous cases. In entity resolution benchmarks, this hybrid approach matches or exceeds pure LLM performance at a fraction of the cost.
Validation Before the Record Reaches Production
LLM normalization outputs cannot go directly to your production database without a validation layer. The failure modes are too quiet: a hallucinated category that looks plausible, a resolved entity that silently matched the wrong record, a normalized value that passes schema validation but contains incorrect data.
Consistency checking. Generate two or three independent LLM responses for the same record. When they disagree, the record is uncertain — route it to human review rather than auto-resolving. This approach catches roughly 20% more errors than single-pass generation at 3x the token cost, which is often the right tradeoff.
Schema enforcement. Constrain LLM outputs to known-good value sets wherever possible. If you're normalizing product categories and your taxonomy has 847 nodes, emit a list of valid categories in the prompt and validate the output against it. Reject outputs that don't match. This converts silent errors into explicit failures you can handle.
Statistical sampling. For batch normalization runs, pull a random sample of 100–200 records and review them manually before committing the full batch. This catches systematic errors — a model that's consistently mislabeling a product line, a prompt that's stripping required qualifier fields — before they reach production at scale.
Test set discipline. Public benchmarks for entity resolution and data normalization tasks have been in LLM training data since 2023–2024. Your evaluation accuracy on public benchmarks is not a reliable predictor of production accuracy. Build a proprietary test set from your actual production records, rotate it regularly, and treat it as a controlled asset.
The Ground-Truth Problem
Many data normalization tasks have no canonical correct answer. Is "J. Smith, 123 Main St" the same person as "John Smith, 123 Main Street, Apt 4B"? It depends on other signals — date of birth, phone, email, transaction history. The ground truth is probabilistic, not binary.
This creates a measurement problem. If your test set assigns binary labels to inherently uncertain records, your F1 score measures your model's agreement with an arbitrary labeling decision, not its actual accuracy. In practice:
- Accept multi-label ground truth for genuinely ambiguous records
- Evaluate with set-based metrics (Jaccard similarity, F1 across label sets) rather than binary accuracy
- Report confidence intervals on your metrics rather than point estimates
- When using LLMs to generate ground truth labels (a common bootstrapping approach), be aware that this biases evaluation in favor of the same model family — your accuracy estimates will look better than they are
Feedback Loops That Actually Improve the System
Human corrections are valuable training signal, but most teams collect them without routing them anywhere. The pattern that consistently works:
- High-confidence predictions go directly to production — no human review
- Low-confidence predictions (based on consistency check disagreement, low similarity scores, or prompt-extracted uncertainty) go to a review queue
- Human reviewers correct the low-confidence records
- Corrections accumulate; every month or quarter, run a fine-tuning pass or few-shot update against the corrected set
- A/B test the updated system against the previous version before promoting it
Teams that implemented structured feedback loops reported 15–30% accuracy improvement within three months on their specific normalization task. The key is the routing architecture — without it, human corrections are a one-time fix for a specific record rather than a system-wide improvement.
The Hybrid Architecture in Practice
The architecture that holds up under production loads combines three layers:
-
Rule-based normalization handles the high-confidence, high-throughput majority. Regex for standardized formats, lookup tables for controlled vocabularies, deterministic logic for mathematical operations. This layer processes 60–80% of your records at near-zero cost per record.
-
Embedding-based pre-filtering narrows the candidate space for entity resolution. Cheap vector similarity search identifies records that might match; this is not the final answer, just a shortlist.
-
LLM verification resolves only the ambiguous cases the first two layers flag. This might be 10–20% of records in a typical production corpus. With batch pricing, this is affordable.
Each layer hands off to the next based on confidence, not on record type. The routing logic is where most of the engineering work lives — and it's also where most of the accuracy gains come from once you've picked reasonable components for each layer.
What to Actually Measure
Before you ship LLM normalization to production, you need three numbers:
Per-layer accuracy. Not aggregate accuracy across all records — per-layer accuracy for the records each layer actually processes. If your rule-based layer handles 70% of records at 99.5% accuracy and your LLM layer handles 30% at 91%, your aggregate accuracy looks like 97%. But the 9% failure rate in the LLM layer may concentrate on your most complex, highest-value records.
False positive cost vs. false negative cost. For entity resolution, a false positive (merging two distinct entities) often has much higher downstream cost than a false negative (failing to merge duplicates). Tune your threshold toward the cost that matters.
Processing cost per correct record. Tokens processed, batch vs. real-time split, retry overhead. This number needs to be stable as your data volume grows. If it scales linearly with volume without optimization, you're paying for a mistake you haven't found yet.
The Rule That Governs All of This
Rules are fast and cheap until the world changes. LLMs are flexible and expensive until you constrain them. The teams that built hybrid systems early — routing by confidence, validating outputs, running feedback loops — are not spending time maintaining brittle rule sets or debugging unexplained API bills. They're spending time improving the routing logic, which is where the real leverage lives.
If you're starting from scratch: get your rule-based layer working first, instrument it for confidence, and add LLM processing only for the records it fails on. The LLM layer is not a replacement for rules. It's the layer you use when rules run out.
- https://arxiv.org/abs/2403.02130
- https://tilores.io/content/Can-LLMs-be-used-for-Entity-Resolution
- https://arxiv.org/abs/2401.03426
- https://www.elastic.co/search-labs/blog/elasticsearch-entity-resolution-llm-semantic-search
- https://arxiv.org/pdf/2404.15604
- https://arxiv.org/html/2506.02509v1
- https://aws.amazon.com/blogs/machine-learning/ground-truth-curation-and-metric-interpretation-best-practices-for-evaluating-generative-ai-question-answering-using-fmeval/
- https://www.faktion.com/post/rag-output-validation---part-3-human-feedback-and-llm-based-validation
- https://reintech.io/blog/llm-batch-processing-handling-large-scale-inference-jobs-efficiently
