When to Reach for an LLM vs. a Simple Heuristic: A Four-Factor Framework

May 7, 2026 · 10 min read

Software Engineer

A logistics company spent $800K and twelve months trying to use AI for route optimization. At the end of the engagement, their routes were marginally better than the heuristics they already had. Leadership rejected the next three AI proposals. A food delivery company faced the same route problem and solved it in a single night with a set of explicit business rules.

The expensive lesson both teams discovered: route optimization with real-time constraints, driver preferences, and time windows is not an AI problem — it's a combinatorial scheduling problem. The patterns you need to learn aren't hidden in data; they're explicit domain logic that someone in operations already knows.

This plays out across every industry. A 2025 MIT study found 95% of enterprise AI pilots delivered zero measurable impact despite $30–40 billion in combined investment. The dominant failure mode wasn't bad models or insufficient data. It was teams building AI solutions for problems where AI was the wrong tool.

The question "is this an AI problem?" is harder than it looks. It requires evaluating four distinct factors before a team writes a line of training code or fires a single LLM call. Teams that skip this analysis don't just waste engineering time — they damage organizational trust in AI for years afterward.

Factor 1: Signal Quality — Does the Data Contain Learnable Patterns?

The most commonly skipped evaluation is whether the patterns you want to learn actually exist in your data. This sounds obvious, but the failure mode is subtle: your data looks rich and your metrics look promising during development, but in production the model is fitting on noise.

Signal quality has two components. The first is whether predictive features exist at all. For churn prediction, behavioral signals like session depth, feature adoption rate, and support ticket frequency genuinely correlate with future cancellation. For route optimization with constantly shifting constraints, the historical training data reflects a snapshot of traffic patterns and driver availability that changes daily — there's insufficient stable signal to learn from.

The second component is whether those signals are encoded in the data you actually have access to. A model that predicts customer lifetime value needs rich behavioral history, not just signup metadata. If your data pipeline captures what users do but not why, your model will learn correlations that don't hold when context shifts.

The practical test: can a domain expert look at a handful of examples and consistently identify the right answer? If yes, and if you can capture the features that let them make that judgment, you have signal. If even the expert can't articulate which features drive the decision, you're probably fitting noise.

Factor 2: Human Performance Ceiling — What Does "Good" Look Like?

Before building a model, you need a baseline against which to measure it. This is where many teams make a structural error: they compare their model to the current automated system (which is often terrible) rather than to what a skilled human can achieve.

Human performance establishes the ceiling on what's learnable from the task. If expert humans perform at 95% accuracy, you know the signal is strong enough to learn from — and you have a target to work toward. If expert humans only reach 70% consistency, you're dealing with an inherently ambiguous task where even perfect signal extraction won't get you far.

More importantly, the gap between human performance and your current system tells you whether AI is solving the right problem. If humans already achieve 94% accuracy but your automation sits at 60%, you don't have an AI problem. You have a requirements-capture problem: the rules for achieving human-level performance haven't been codified yet. A rule-based system that properly encodes expert judgment will close most of that gap at a fraction of the cost.

This matters because AI creates leverage when it automates learned pattern recognition at scale — not when it re-derives rules that experts could write down. The latter is an engineering problem, not a machine learning problem.

Factor 3: Data Availability — Do You Have Enough to Learn From?

The data availability criterion goes beyond "do we have data" to three more specific questions.

Volume: supervised learning typically requires thousands of labeled examples before patterns generalize reliably. For many enterprise use cases this is achievable; for niche internal workflows it often isn't. A useful heuristic is that with fewer than a few hundred labeled examples, a well-tuned rule-based system will outperform a trained model and be far cheaper to maintain.

Representativeness: your training data must reflect the distribution of inputs you'll encounter in production. This fails more often than teams expect. A fraud detection model trained on historical transactions before a product expansion won't have seen the new customer cohort's behavior. A content moderation model trained on English text degrades immediately when users start posting in other languages.

Labeling quality: label noise is often more damaging than label scarcity. If your training data was labeled by a process that itself had systematic errors — and most automated labeling pipelines do — your model will faithfully learn those errors. Auditing label quality is unglamorous work that most teams defer until the model is already underperforming in production.

When any of these dimensions falls short, the path of least resistance looks like more data collection, more labeling budget, or more elaborate preprocessing — all of which delay the project and rarely fix the underlying problem.

Factor 4: Reversibility — How Badly Can This Go Wrong?

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

When to Reach for an LLM vs. a Simple Heuristic: A Four-Factor Framework

Factor 1: Signal Quality — Does the Data Contain Learnable Patterns?

Factor 2: Human Performance Ceiling — What Does "Good" Look Like?

Factor 3: Data Availability — Do You Have Enough to Learn From?

Factor 4: Reversibility — How Badly Can This Go Wrong?

Recommended Reading

About Tian Pan

Factor 1: Signal Quality — Does the Data Contain Learnable Patterns?​

Factor 2: Human Performance Ceiling — What Does "Good" Look Like?​

Factor 3: Data Availability — Do You Have Enough to Learn From?​

Factor 4: Reversibility — How Badly Can This Go Wrong?​

Recommended Reading

About Tian Pan

Factor 1: Signal Quality — Does the Data Contain Learnable Patterns?

Factor 2: Human Performance Ceiling — What Does "Good" Look Like?

Factor 3: Data Availability — Do You Have Enough to Learn From?

Factor 4: Reversibility — How Badly Can This Go Wrong?