Reasoning Models in Production: When to Use Them and When Not To
Most teams that adopt reasoning models make the same mistake: they start using them everywhere. A new model drops with impressive benchmark numbers, and within a week it's handling customer support, document summarization, and the two genuinely hard problems it was actually built for. Then the infrastructure bill arrives.
Reasoning models — o3, Claude with extended thinking, DeepSeek R1, and their successors — are legitimately different from standard LLMs. They perform an internal chain-of-thought before producing output, spending more compute cycles to search through the problem space. That extra work produces real gains on tasks that require multi-step logic. It also costs 5–10× more per request and adds 10–60 seconds of latency. Neither of those is acceptable as a default.
What Makes Reasoning Models Different
Standard LLMs predict the next token based on the input context. Reasoning models extend this with an intermediate step: before producing the final answer, they generate an internal thinking trace — a sequence of tokens representing deliberation, backtracking, and hypothesis testing. This trace is either hidden (OpenAI's o-series) or surfaced to the developer (Claude's extended thinking, which exposes the thinking block via the API).
The mechanical result is that the model can explore multiple solution paths and discard wrong ones before committing to an answer. On problems where the first reasonable-looking answer is often wrong — complex algorithm analysis, multi-step math, security vulnerability detection — this matters a lot. On problems where the first reasonable-looking answer is usually correct — summarization, format conversion, sentiment classification — it changes almost nothing.
Claude's recent API changes illustrate the direction the field is heading. The older budget_tokens parameter, which required developers to manually set a reasoning token cap, has been deprecated in favor of adaptive thinking (thinking: { type: "adaptive" }). The model now decides whether to reason and how deeply based on the request itself. This reduces the configuration burden but doesn't change the fundamental cost structure: when the model reasons, you pay for thinking tokens at output rates.
The 10–20% Rule
The clearest signal from teams that have successfully deployed reasoning models in production: they only apply them to 10–20% of requests, and those requests share a common set of properties.
Tasks that genuinely benefit from reasoning:
- Verifiable correctness is required. The answer is right or wrong in a way that can be checked — code that compiles and passes tests, a math result that can be verified, a logic puzzle with a definite solution.
- Multi-step dependency chains. The correct answer at step 5 depends on correctly completing steps 1–4. Standard models can lose track of intermediate state; reasoning models maintain it.
- Rare but high-cost failures. A wrong security audit, an incorrect architecture recommendation, a missed race condition in concurrent code — the downstream cost of a failure exceeds the cost of extra compute.
- Low latency tolerance. Nobody is waiting for a 30-second response in a chat window. Reasoning models work in async pipelines, batch jobs, or workflows where the user expects to wait.
Tasks that don't benefit:
- Pattern matching at scale. Summarization, classification, and extraction are learned pattern mappings. More compute cycles don't help when the bottleneck is recognition, not reasoning.
- Conversational interfaces. Users expect sub-2-second responses. A 15-second thinking pause destroys the interaction model, regardless of answer quality.
- Creative and subjective outputs. There's no verifiable correct answer for a marketing email or a product description. Extended thinking produces overthinking, not better writing.
- High-volume, low-margin tasks. If you're running a million requests per day and margins are tight, a 10× cost multiplier isn't a tradeoff — it's a shutdown.
Building a Routing Architecture
The practical solution isn't "use reasoning models" or "don't use reasoning models." It's a classification-first architecture that routes each request to the appropriate model tier before any expensive inference happens.
A minimal routing system has three components:
Complexity classifier. A lightweight model (or even a rules-based system) evaluates the incoming request and outputs a complexity signal. Good features: token count, presence of multi-step dependencies in the prompt, whether the task maps to a known category, historical accuracy for similar requests. The classifier must be fast — millisecond latency — so it doesn't become the bottleneck.
Model tiers. Map complexity buckets to model options. A simple three-tier setup: fast/cheap model for simple requests (summarization, lookup, formatting), standard frontier model for moderate complexity, reasoning model for verified-hard problems. The tiers should be calibrated to your actual workload distribution.
Feedback loop. Track accuracy per tier against ground truth or human review. The routing decision is a hypothesis about which model is appropriate; the accuracy data tells you whether the hypothesis holds. Adjust tier thresholds as your data accumulates.
async def route_request(prompt: str, task_type: str) -> str:
complexity = await classify_complexity(prompt, task_type)
if complexity == "simple":
return await call_model("claude-haiku-4-5", prompt)
elif complexity == "moderate":
return await call_model("claude-sonnet-4-6", prompt)
else:
# Reasoning model for genuinely hard problems
return await call_model(
"claude-opus-4-6",
prompt,
thinking={"type": "adaptive"}
)
One architectural trap: don't make the router stateless. Requests that look simple in isolation can be complex given conversation history or user context. Pass relevant context to the classifier, or you'll route incorrectly on the cases that matter most.
Cache Reasoning Outputs Aggressively
Reasoning model outputs are expensive to generate and often deterministic enough to cache. A cache hit on a reasoning model request recovers the full cost premium. This changes the economics considerably for workflows with repeated or near-repeated inputs.
Standard caching TTLs are often too short for reasoning outputs. If a code review produces the same architectural recommendation every time the same module is submitted, that output should be cached for hours, not minutes. Reason: the reasoning process itself is expensive, and the answer is unlikely to change unless the code changes.
For applications with predictable input patterns — a code review bot that processes similar functions daily, a compliance checker that evaluates similar clauses weekly — pre-warming the cache with common inputs is worth the upfront cost. You pay once at warm-up, then serve from cache for the duration.
What to Monitor
Standard LLM monitoring tracks latency, token counts, and error rates. Reasoning models need additional instrumentation:
- Thinking token distribution. How many thinking tokens are consumed per request, as a histogram. Bimodal distributions often indicate routing failures: you're sending simple requests to the reasoning tier.
- Cost per tier as a percentage of total spend. If the reasoning tier is consuming more than 30–40% of your LLM budget but handling 10% of requests, your routing thresholds are off.
- Accuracy by tier. Do requests routed to the reasoning tier actually benefit from it? If accuracy in the reasoning tier isn't meaningfully higher than the standard tier for the same task types, recalibrate.
- Latency percentiles. P50 and P99 latency by tier and task type. Reasoning model P99s can be very long and should be bounded with explicit timeouts.
One monitoring antipattern: tracking overall accuracy without tier breakdown. It obscures whether the reasoning model is earning its cost premium. You need per-tier accuracy to know if the routing decision is correct.
The Fallback Cascade
For latency-sensitive applications that occasionally encounter hard problems, a cascade pattern works better than static routing. Start with a fast, cheap model. Evaluate confidence. If confidence is low — measured by output entropy or a lightweight verifier — escalate to the next tier.
async def cascade_request(prompt: str) -> str:
# Attempt 1: fast model
result, confidence = await call_with_confidence("claude-haiku-4-5", prompt)
if confidence > CONFIDENCE_THRESHOLD:
return result
# Escalate to standard model
result, confidence = await call_with_confidence("claude-sonnet-4-6", prompt)
if confidence > CONFIDENCE_THRESHOLD:
return result
# Last resort: reasoning model
return await call_model(
"claude-opus-4-6",
prompt,
thinking={"type": "adaptive"}
)
The cascade adds latency on escalated requests, but those requests are the ones where the latency cost is already justified by the problem's difficulty. The fast path remains fast.
The Decision in Practice
Before using a reasoning model, answer three questions:
- Is there a verifiable correct answer? If correctness can't be checked, the reasoning doesn't add signal.
- Is 10–60 seconds of latency acceptable for this request? If not, rule out synchronous reasoning model calls.
- Does a wrong answer here cost more than the extra compute? If the failure mode is low-stakes, the standard model is sufficient.
Most production requests fail at least one of these checks. The ones that pass all three are exactly the requests where reasoning models justify their cost. That's typically a small fraction of total volume — but it's the fraction that matters most.
The mistake isn't using reasoning models. It's treating them as a drop-in upgrade rather than a specialized tool with real constraints on where it applies.
