Your AI Feature Should Lose to a Regex First
A team spends three weeks integrating a foundation model to classify incoming support tickets into routing categories. The model reaches 87% accuracy in testing. They ship it. Six months later, an engineer notices that 70% of tickets contain a product name in the subject line and that a simple lookup table would have handled those with 99% accuracy. The LLM is running on the hard 30% and making it up the rest of the time.
This is not an unusual story. It happens because teams treat "use an LLM" as the first implementation choice rather than the last. The fix is a required gate: your AI feature must lose to a dumb rule before you are allowed to build the AI version.
What the Baseline Gate Actually Is
The baseline gate is not a suggestion to try simple things. It is a structured checkpoint that sits before any LLM integration, fine-tuning, or RAG investment. The rule is:
Build the simplest deterministic system that could possibly solve the problem. Measure it. Only proceed with an LLM if the LLM demonstrably beats it on metrics that matter to the business.
This sounds obvious. It is almost universally skipped. The reason is that foundation model APIs feel so easy to reach for that skipping the baseline feels like saving time. In practice, you are borrowing technical debt at a high interest rate.
Google codified this in their machine learning engineering guidelines more than a decade ago: do not be afraid to launch without machine learning. If you think ML will give you a 100% boost, a heuristic will get you 50% of the way there. The insight has not gotten less true because LLMs exist. It has gotten more urgent, because LLM calls cost real money and deterministic rules cost essentially nothing.
The Taxonomy of Dumb Rules
A useful baseline is not a placeholder. It should be the best deterministic system you can build without a model. Depending on the problem, this takes different forms.
Keyword and regex matching handles classification, routing, and extraction for a large fraction of real-world inputs. Subject line classification, spam detection, intent routing, entity extraction from structured fields—these all have substantial sub-populations where the signal is explicit in the text and a pattern match will find it. The failure mode is that engineers estimate coverage will be 20% when it is actually 70%.
Lookup tables and exact matching handle recommendation, retrieval, and personalization when the relevant items are known in advance. Surfacing the top-downloaded items, the most recently used contacts, the bestselling products in a category—these approaches perform well on the high-frequency distribution and require no inference budget at all.
Decision trees and threshold rules handle scoring, anomaly detection, and approval workflows where the decision logic is known and stable. If a transaction amount is above $10,000 and the account is less than 30 days old, flag it. If the user has fewer than three sessions, show the onboarding flow. The logic is transparent, auditable, and runs in microseconds.
Frequency and recency heuristics handle ranking in cases where temporal signal predicts relevance. Sorting by last-used, most-viewed, or highest-rated covers a surprising fraction of sorting problems before any personalization model is warranted.
The common thread: these systems are cheap to build, take a fraction of the implementation time, and produce outputs that are fully explainable. A baseline that takes 10% of the effort and covers 60–70% of inputs is not a consolation prize. It is the right architecture for 60–70% of your traffic.
The Cost Arithmetic You Are Skipping
LLM inference is not expensive in the abstract. It is expensive at scale, and it is expensive compared to the alternative.
At current pricing, a mid-tier frontier model costs roughly $3–18 per million tokens. A typical classification or extraction call might consume 300–800 tokens. At 10,000 requests per day, that is $9–$54 per day, or $3,000–$20,000 per year—for a single feature. A regex matching the same inputs costs the CPU time to evaluate the pattern, which rounds to zero.
The ratio is not 10x. For common pattern-matching tasks, it is closer to 100x or 1,000x once you factor in latency, retry costs, and the operational overhead of a vendor dependency.
This arithmetic changes when the LLM is genuinely necessary. For inputs where the signal is implicit in meaning rather than explicit in syntax—ambiguous intent, multilingual text, unstructured freeform input, multi-step reasoning—a rule-based system cannot do the job and the cost is justified. But that cost is only justified after you have established that the cheap system cannot reach acceptable performance. Until you measure, you are guessing.
A 2020 CMU analysis found that a baseline model typically takes 10% of the development time and achieves 90% of the production-quality results. A deep learning model published in Nature with 13,000 parameters for earthquake aftershock prediction was outperformed by a two-parameter logistic regression. Complexity does not automatically win. The problem is that engineers stop measuring once the LLM passes a threshold, without checking whether a simpler system would have passed the same threshold.
Building the Baseline: A Practical Protocol
The baseline gate needs to be a real artifact, not a verbal disclaimer that "we considered rules and moved on." That means:
Step 1: Enumerate the input distribution. Before building anything, sample 200–500 real inputs from the problem domain. Cluster or manually review them. Identify what fraction has an obvious deterministic signal versus what fraction is genuinely ambiguous. This step alone often shifts the engineering plan significantly.
Step 2: Build the deterministic system to completion. Do not stub it out or implement only the easy cases. Write the regex. Build the lookup. Code the decision tree. Make it production-quality for the inputs it handles. This usually takes one to two days for a well-scoped problem.
Step 3: Measure coverage and accuracy separately. Coverage is the fraction of inputs the baseline handles. Accuracy is correctness on that covered fraction. A baseline that covers 65% of inputs at 97% accuracy is a strong result—it means the LLM only needs to handle the remaining 35%, dramatically reducing inference costs and failure surface.
Step 4: Establish the LLM delta on the uncovered tail. The LLM should be evaluated specifically on the inputs the baseline cannot handle. If the LLM accuracy on this tail is not materially better than a fallback (returning a default, asking the user to clarify, or routing to a human), the LLM is not earning its cost.
Step 5: Make the gate explicit in your team's process. The question "did we beat the rule?" should appear on the design doc, the PR description, or the feature review. Without an explicit checkpoint, the gate will be skipped under schedule pressure.
Where LLMs Actually Win the Comparison
Running this protocol honestly will identify the cases where the LLM genuinely adds value that deterministic systems cannot match.
Ambiguous natural language intent is the canonical example. When a user types "I need help with my account," a lookup table cannot route them. A regex will over-match or under-match. An LLM with context about the user's recent activity can reason about the most likely intent. This is the 30% tail where LLM investment pays off.
Cross-lingual inputs that a rule system would have to handle with explicit per-language patterns are handled uniformly by a multilingual model. The baseline approach breaks down for global products where the deterministic rules multiply without bound.
Compositional extraction from genuinely unstructured text—a legal document, a support email with multiple nested issues, a technical specification where the field boundaries are implicit—requires semantic understanding that pattern matching cannot provide. The complexity is real, and it shows up clearly when you measure the baseline.
Dynamic domains where the relevant entities and relationships change faster than a lookup table can be updated benefit from model generalization. Static rules decay; models can be periodically retrained or prompted with updated context.
The common thread in the cases where LLMs win: the input is ambiguous, multilingual, or structurally complex in a way that explicit rules cannot capture economically. These cases exist. They are a real fraction of real products. They are not 100% of inputs, and treating them as if they were is the core mistake the baseline gate prevents.
Making "Did We Beat the Rule?" a Required Gate
The teams that consistently over-invest in LLMs are not doing so because they are careless. They are doing so because the evaluation process starts with the model. The question asked is "how good is the LLM?" rather than "does the LLM add enough value over the alternative to justify the cost?"
Reversing that question changes everything. It surfaces the 60–70% of cases where a lookup or regex handles the problem. It focuses LLM investment on the tail where it actually moves metrics. It produces a hybrid architecture that is cheaper, faster, more debuggable, and more maintainable than a pure-model approach.
The gate does not need to be bureaucratic. It needs to be non-optional. Before any fine-tuning budget is approved, before any RAG pipeline is designed, before any foundation model API is integrated into a production path, one question must be answered with measurements: what does the simplest deterministic baseline achieve on this problem, and what is the LLM's delta over that baseline on the inputs the baseline cannot handle?
If you have not run that experiment, you do not know whether you need the model. And if you do not know that, every token you spend is a guess.
- https://eugeneyan.com/writing/first-rule-of-ml/
- https://blog.ml.cmu.edu/2020/08/31/3-baselines/
- https://developers.google.com/machine-learning/guides/rules-of-ml
- https://yanirseroussi.com/til/2023/09/21/googles-rules-of-machine-learning-still-apply-in-the-age-of-large-language-models/
- https://www.pecan.ai/blog/rule-based-vs-machine-learning-ai-which-produces-better-results/
- https://arxiv.org/html/2603.15970
