Skip to main content

The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI

· 8 min read
Tian Pan
Software Engineer

Most teams ship their first AI feature on the best model available, because that's what the demo ran on and nobody had time to think harder about it. Then a second feature ships on the same model. Then a third. Six months later, every call across every feature routes to the frontier tier — and the bill is five to ten times higher than it needs to be.

The uncomfortable truth is that 40–60% of the requests your production system processes don't require frontier-level reasoning at all. They require competent text processing. Competent text processing is dramatically cheaper to buy.

How the Default Gets Set

The pattern is consistent across engineering teams: an engineer builds the prototype with the best available model, because that minimizes the number of variables while exploring a new capability. The prototype ships. Product is happy. Engineering moves on. Nobody goes back to ask whether the model choice was actually load-bearing.

There are organizational forces that make this sticky. The argument for keeping the expensive model is easy to make: "It gives the best results, we don't want to regress, customers are happy." The argument for switching is hard to make: "We think we can cut costs, but we might introduce subtle quality regressions." Asymmetric risk tolerance means the default almost always wins.

The result is that teams end up paying frontier-model prices for tasks like intent classification, format conversion, structured extraction from well-formatted inputs, and summarization of short documents — none of which benefit meaningfully from additional reasoning capacity.

The Task-Complexity Audit

Before routing anything differently, you need to understand what your system is actually doing. Pull a sample of 200–500 recent requests and categorize each one by what kind of reasoning it requires.

Most production systems have three buckets:

Pattern-matching tasks require recognizing structure, extracting fields, classifying intent, and transforming format. The input has a clear right answer that a capable smaller model can reliably produce. Examples: extracting entities from a structured form, routing a support ticket to the right category, converting JSON from one schema to another, summarizing a fixed-template report.

Compositional tasks require combining information across sources or generating coherent output that satisfies multiple constraints simultaneously. A mid-tier model can handle these reliably when the constraints are explicit and the context is well-structured. Examples: drafting a response using provided information, explaining a code change, synthesizing a short report from retrieved documents.

Reasoning-intensive tasks require sustained logical coherence across many steps, novel problem-solving without obvious patterns to match against, or judgment calls where the space of correct answers isn't well-defined. These are where frontier models earn their premium. Examples: architectural decisions with ambiguous trade-offs, multi-step debugging with incomplete information, generating novel hypotheses from complex evidence.

When teams do this audit honestly, the pattern is usually that pattern-matching tasks represent 40–60% of volume, compositional tasks represent 30–40%, and reasoning-intensive tasks represent less than 20%. Most billing horror stories come from routing pattern-matching traffic to frontier models because nobody drew this distinction.

The Pricing Arithmetic

The cost gap between tiers is not a rounding error. As of 2026, a representative cost ladder looks like:

  • Haiku-tier (Claude Haiku, Gemini Flash, GPT-4o mini): roughly $0.80–$4 per million tokens input/output
  • Sonnet-tier (Claude Sonnet, GPT-4o): roughly $3–$15 per million tokens
  • Opus-tier (Claude Opus, GPT-5, o3): roughly $5–$75 per million tokens

A chatbot processing 1,000 conversations daily costs around $12–$50/month on the efficient tier. The same workload on the frontier tier runs $1,000–$3,000/month. For document processing at similar volumes, frontier-tier costs can run 90x higher than efficient-tier alternatives for work that produces equivalent user outcomes.

A real example from production coding agent infrastructure: applying three-tier routing — Opus for architectural planning, Sonnet for implementation, Haiku for file navigation and simple edits — reduces per-session cost by 51% compared to running everything on Opus. The expensive model still runs; it just runs on the work that actually needs it.

Research on cascaded LLM systems found that combining routing and escalation achieves 97% of frontier accuracy at 24% of frontier cost. The 3% gap frequently falls within the noise of other system-level variability and well below user-perceptible quality differences for most task categories.

Why "The Demo Looked Good" Is a Bad Calibration Signal

The demo is the worst possible signal for model tier selection, for two reasons.

First, demos are cherry-picked. You remember the response that impressed the stakeholder, not the fifty responses that were adequate. When the frontier model produces a subtly more elegant output than the Sonnet model, that's visible in a side-by-side comparison during development. It's invisible to the user who only sees one response.

Second, demos test showcase scenarios, not production distributions. The frontier model earns its advantage on hard cases — ambiguous inputs, edge cases, requests that require genuine reasoning across long context. These represent a small fraction of production volume. The pattern-matching requests that dominate real traffic don't produce perceptibly different outputs between tiers.

The right calibration signal is user behavior under actual conditions. Revision rate, follow-up request rate, session abandonment, task completion — these measure whether users got what they needed, not whether the output was maximally impressive in a controlled comparison.

How to Run the A/B Test

The practical obstacle to model downgrades is fear of silent quality regression. The solution is not to trust your intuition about what matters — it's to measure.

Route a percentage of traffic on a specific feature to the lower-cost model. Measure task-completion signals, not eval accuracy. Eval accuracy on curated test sets doesn't predict user experience; it predicts eval accuracy. The behavioral signals that actually matter are:

  • Revision rate: Does the user edit or redo the output at higher rates under the cheaper model?
  • Follow-up query rate: Does the user ask clarifying questions or corrections more often?
  • Session depth: Do interactions complete faster or require more turns?
  • Direct rejection: Does the user discard the response without acting on it?

Run this for two to four weeks on moderate traffic volume before drawing conclusions. Most teams discover that for their pattern-matching tasks, the cheaper model produces statistically equivalent outcomes on every behavioral metric. For their compositional tasks, a mid-tier model performs within noise of the frontier tier. The frontier tier becomes necessary only for the reasoning-intensive subset.

The Organizational Problem

The technical work is straightforward. The harder problem is organizational.

Engineering teams often default to frontier models partly because the discussion of model choice creates friction. "We downgraded the AI" is a difficult message to carry to product and business stakeholders who have been told frontier models are better. Even if the A/B test shows no user-perceptible difference, the internal narrative becomes uncomfortable.

The reframe that tends to work: model selection isn't a quality decision, it's a workload matching decision. You wouldn't use a database cluster designed for OLAP queries to serve simple key-value lookups, regardless of whether it could technically do the job. The same logic applies to LLM tiers. Using frontier models for pattern-matching tasks isn't cautious — it's a misallocation of compute resources that would have been more valuable deployed on the reasoning-intensive tasks that actually needed them.

Framing the routing policy as a form of prioritization — freeing expensive capacity for hard problems by offloading easy problems to efficient capacity — tends to land better than framing it as a cost cut.

Building the Routing Policy

The starting point is static routing: define rules per feature or per task category, not per individual request. Static routing is deterministic, requires no additional latency, and is auditable. It covers the majority of high-value routing decisions without complexity.

  • Default to mid-tier for any feature that hasn't been explicitly analyzed. Mid-tier models are capable enough for most production work and cheap enough that the optimization isn't urgent.
  • Default to efficient tier for features that are structurally pattern-matching: extraction, classification, formatting, simple summarization.
  • Reserve frontier tier for features where you've measured that the task distribution genuinely requires sustained reasoning, or where errors are disproportionately costly.

If your traffic patterns are variable enough to warrant dynamic routing, use task complexity signals — input length, request type, presence of ambiguous constraints — to escalate selectively. The key discipline is requiring evidence before adding a feature to the frontier tier, rather than requiring evidence before removing it.

Closing the Loop

Model costs compound over time because product surface area grows and traffic grows, but model selection decisions rarely get revisited once features ship. The teams that manage this well build a simple cost attribution layer: per-feature token spend, aggregated weekly. When a feature's spend crosses a threshold, that triggers a routing review.

This isn't about aggressive penny-pinching. It's about ensuring that frontier-tier capacity is reserved for the work where it genuinely moves the needle. Running a document classification feature on Opus because nobody went back to check the original decision is a waste of resources that could fund additional experimentation, additional capacity, or better infrastructure for the tasks that actually need it.

The goal is not the cheapest possible AI system. The goal is a system where every dollar of inference cost is attached to a task that actually requires what it's buying.

References:Let's stay in touch and Follow me for more thoughts and updates