The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse
There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.
This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.
The Problem with "If LLMs Can, LLMs Should"
The mental model that leads here is understandable: LLMs are flexible, handle edge cases gracefully, and eliminate the need to write explicit rules. Why maintain a regex-based extractor when an LLM can parse it? Why write a rule engine when you can describe the rules in natural language and let the model interpret them?
The problem is that "can" and "should" are different questions, and the cost of conflating them compounds across every step in the pipeline.
Consider a real category of task: routing a support ticket to the right team. A naive LLM-everywhere implementation calls the model, sends it the ticket text, and asks it to classify the category. This works. It also costs roughly 3.00 per thousand tickets depending on model choice, adds 200–400ms of latency per request, fails unpredictably when the model changes behavior after a provider update, and produces a different answer for the same ticket on a bad day. A routing function based on keyword matching and a topic lookup table handles 80% of tickets instantly, at effectively zero cost, with guaranteed consistency. The LLM should handle the remaining 20%—the ambiguous cases where rules genuinely fail—not all of them.
The ratio matters here. If you are paying LLM latency and cost for 100% of requests but only 20% of them actually needed the model, you have introduced 5x overhead with no quality gain for the easy majority.
Where the Antipattern Takes Root
Three forces drive teams toward LLM-everywhere:
Organizational pressure. AI projects attract funding, interest, and attention. Describing a pipeline as "AI-powered at every stage" sounds impressive. Engineers who notice the inefficiency but don't control the narrative find it hard to push back on.
The tool-seduction effect. Once your team has LLM API access, every problem starts to look like a language understanding problem. Format conversion? LLM. Date normalization? LLM. Whether this field value is positive, negative, or neutral? LLM. The pattern becomes the default without anyone actively choosing it.
Invisible per-step costs. Individual API calls look cheap in isolation. It is only when you aggregate across request volume and chain length that the bill becomes a budget conversation. Engineers building individual steps often don't see the cumulative cost until it's too late to refactor cheaply.
A Taxonomy of When Not to Use an LLM
The useful question is not "can an LLM do this?" but "does this task require language model reasoning, or does it just require computation?"
Tasks that do not require LLM reasoning:
- Schema-conformant parsing. If you're extracting fields from a structured form, a database record, or a well-defined template, deterministic parsing achieves 99%+ accuracy at microsecond latency. An LLM achieves 80–95% accuracy at 200+ milliseconds. You are paying 200,000x more time and significant money for lower reliability.
- Classification with clear boundaries. Spam vs. not-spam based on sender domain and header patterns. Route assignment based on product area. Support ticket priority based on keywords. These have decision surfaces that can be enumerated. Rule-based classifiers run in microseconds and produce auditable decisions.
- Format conversion. CSV to JSON. ISO date strings to timestamps. Normalizing phone number formats. These are functions. Write them as functions.
- Validation. Whether a field is present, whether it matches a regex, whether a value is within an expected range—these are deterministic checks. Running them through an LLM adds latency, cost, and the possibility that the model "decides" the value is close enough even when it isn't.
- Lookup and retrieval. If the answer is a key-value lookup, a database query, or a simple filter, do that. The model's ability to answer "what is the capital of France" does not mean you should route that query through an LLM.
Tasks that genuinely require LLM reasoning:
- Open-ended text understanding with semantic ambiguity.
- Multi-step reasoning where intermediate conclusions depend on context.
- Synthesis and summarization across heterogeneous sources.
- Generating natural language responses where tone and fluency matter.
- Few-shot classification when you have a handful of examples and no labeled training set.
- Handling edge cases that your rule system explicitly cannot enumerate.
The dividing line is roughly: does solving this task require understanding the meaning of natural language, or does it require executing a function? If the answer is the latter, default to a function.
The Compounding Reliability Problem
LLM-everywhere is not just a cost problem. It is a reliability problem that gets worse as your chain grows.
A single LLM call at a reasonable quality tier might have 85–90% consistency across identical inputs (temperature=0 doesn't fully solve this—floating-point non-determinism, batching effects, and provider-side model updates all contribute variance). When you chain four such steps, expected end-to-end consistency drops to roughly 52–66%. Your pipeline appears to work, but its behavior is a moving target.
This has concrete debugging consequences. When a multi-step LLM pipeline produces a wrong answer, tracing which step introduced the error requires capturing intermediate outputs at every stage, comparing them across runs, and reasoning about probabilistic failures—none of which traditional debugging tools support well. The error says nothing useful. The logs show that the pipeline completed successfully. The model returned a plausible-looking result. Finding the bad step is forensic work.
Deterministic components break in obvious, traceable ways. LLMs fail subtly, confidently, and inconsistently. Every non-LLM step you can introduce to your pipeline is one more reliable anchor point that narrows the failure surface.
The Pipeline Audit Heuristic
When reviewing an existing pipeline for LLM overuse, the most useful question to ask at each step is: could this decision be replicated exactly by a function with defined inputs and outputs?
If yes, the step is a candidate for replacement. Work through the following checklist:
- What is the failure mode if this step is wrong? If the answer is a silent, hard-to-detect error (the model returned a plausible but wrong JSON field value), the stakes for unpredictability are high and you should prefer deterministic logic.
- Does correctness depend on understanding context that isn't locally available? If the step only needs the text of a single field or a known data structure, it doesn't need a model.
- What fraction of inputs are ambiguous? If 5% of your inputs are genuinely ambiguous and 95% follow clear patterns, the LLM should handle the 5%—not all of it. Build a classifier that routes to the LLM only when the deterministic path fails.
- Is this step in the hot path of a user-facing request? Every LLM call adds 100–400ms to your latency. In a synchronous request chain, three unnecessary calls add half a second.
- Can you audit this decision? Regulatory environments, support workflows, and financial applications often require a clear explanation of why a decision was made. "The model decided" is not an auditable answer. A rule that says "routed to team X because ticket contained keyword Y" is.
The Hybrid Pattern That Actually Works
The most reliable architecture for complex pipelines is a cascade, not a chain.
For any given task, build a deterministic fast path that handles the clear cases cheaply and consistently. Layer an LLM above it that handles only the cases the fast path explicitly doesn't cover—ambiguous input, novel patterns, edge cases that exceed the rule space. Use a confidence threshold or an explicit "I don't know" fallback from the deterministic layer to route to the model.
This pattern:
- Keeps the majority of traffic fast and cheap
- Limits non-determinism to the inputs that actually require it
- Makes failures traceable—if the deterministic path handles 85% of requests and you see a regression, you immediately know whether it's in the rule logic or the LLM
- Allows the fast path to be tested exhaustively with standard unit tests
- Makes the LLM's scope narrow enough that its prompts stay simple and its failure modes are easier to understand
The key discipline is actively resisting the temptation to expand the LLM's scope when the fast path is "mostly working." Mostly working is the bar for production deterministic logic. It is not the bar for handing off to a language model.
The Organizational Fix
The technical refactoring is usually straightforward once the decision is made. The harder problem is organizational.
Engineers default to LLMs partly because the question "did you consider a simpler approach?" is less likely to be asked than "why didn't you use AI for this?" Creating space to deliberately choose deterministic approaches requires making the cost and reliability tradeoffs visible. Teams that aggregate per-feature token costs, measure step-by-step latency in their traces, and require explanation for why each LLM call is necessary—rather than why it's justified to skip one—tend to build leaner pipelines.
The useful framing is not "LLMs versus deterministic logic" as a philosophy battle. It's that LLMs are expensive, non-deterministic tools that should be deployed where their specific capabilities are genuinely needed. Everything else should run on the cheapest, most reliable primitive that solves the problem. The pipeline that uses an LLM exactly where it needs to, and nowhere else, is the one that holds up under production load, scales predictably with volume, and stays debuggable when it fails.
Most pipelines are not there yet. Most can get there with an audit, a routing layer, and the discipline to treat "can we use a function instead?" as a first-class architectural question.
- https://newsletter.pragmaticengineer.com/p/software-engineering-with-llms-in-2025
- https://www.louisbouchard.ai/when-not-to-use-large-language-models/
- https://katherine-munro.com/p/llm-classifiers-overkill-or-awesome
- https://arxiv.org/html/2604.05150
- https://dev.to/anshd_12/deterministic-vs-llm-evaluators-a-2026-technical-trade-off-study-11h
- https://llmhorrors.com/
- https://blog.gopenai.com/llms-vs-deterministic-logic-overcoming-rule-based-evaluation-challenges-8c5fb7e8fe46
- https://arxiv.org/pdf/2508.02721
- https://medium.com/generative-ai-revolution-ai-native-transformation/the-llm-bubble-is-bursting-the-2026-ai-reset-powering-agentic-engineering-085da564b6cd
- https://anfalatawi.github.io/LLMs/
