Skip to main content

Reasoning Models in Production: When They Help and When They Hurt

· 9 min read
Tian Pan
Software Engineer

A team building a support triage system switched their classification pipeline from GPT-4o to o3. Accuracy improved by 2%. Costs went up by 900%. The latency jumped from 400ms to 12 seconds. They switched back.

This is the most common story in production AI right now. Reasoning models represent a genuine capability leap — o3 solved 25% of problems on the Frontier Math benchmark when no previous model had exceeded 2%. But that capability comes with a cost and latency profile that makes them wrong for the majority of tasks in the average production system. Knowing the difference is one of the more valuable things an AI engineer can internalize right now.

What Makes a Reasoning Model Different

The term "reasoning model" has gotten fuzzy, but it describes something specific architecturally. Standard instruct models — GPT-4o, Claude Haiku, Gemini Flash — take your prompt and immediately begin generating the response token by token. The intelligence is baked into the weights; the model does one forward pass.

Reasoning models move chain-of-thought inside the model at inference time. Before generating the visible response, they run extended internal iterations — searching the solution space, testing hypotheses, backtracking. You're paying for inference compute that happens before the first visible token appears.

There are two implementation patterns. Hidden reasoning (o-series) runs the thinking process internally; you only see the final answer, but you get billed for the reasoning tokens. Visible reasoning (DeepSeek R1, Claude extended thinking, Gemini 2.5 thinking) surfaces the chain-of-thought as an inspectable trace — useful for debugging, and for building trust with users who want to see why the model reached a conclusion.

The key architectural insight, from recent research: base models already know how to reason. Reasoning models don't acquire fundamentally new capabilities through training — they learn when to deploy the reasoning they already have, trained to allocate thinking depth based on problem difficulty. At low effort settings, they behave more like instruct models. At high effort, they burn significant compute exploring the solution space.

Where They Actually Win

The performance gap on genuinely hard multi-step problems is real and large. Some concrete numbers:

On AIME 2024 (competitive math exam), GPT-4o scored 12%. o1 scored 83.3%. o3 scored 96.7%. That's not a marginal improvement — it's the difference between a model that can't do the problem and one that can.

For graduate-level science problems (GPQA Diamond across chemistry, physics, and biology), o3 scored 87.7%. Standard instruct models typically land in the 50-60% range on this benchmark.

For agentic coding on SWE-Bench Verified, o3 achieves 69.1% and Claude's latest models push past 72%. These are tasks that require the model to understand a codebase, identify the root cause of a bug, implement a fix, and write tests — genuinely multi-step, with real state to track.

The pattern here is consistent: reasoning models win on tasks that require searching over a solution space — where there are multiple candidate approaches, where intermediate steps can be wrong in ways that cascade forward, where the correct answer requires hypothesis testing and backtracking. Graduate-level math, competitive programming, multi-step planning with dependencies, ambiguous instructions with multiple valid interpretations.

If your task has this structure, reasoning models are worth the cost. The capability gap is not benchmarking noise.

Where They Don't

The failure mode is subtler than "reasoning models are expensive." They actively hurt on certain task types.

For tasks with single right answers that require pattern recognition rather than search — classification, entity extraction, translation, summarization — the reasoning process doesn't help and can introduce errors. The model continues reasoning after arriving at a correct answer, generating redundant verification steps that sometimes overwrite correct intermediate results. Ask a reasoning model "Which is larger, 0.9 or 0.11?" and it may spend 20+ seconds generating unnecessary verification, occasionally talking itself out of the right answer.

For high-volume pipelines, the latency math becomes untenable. Reasoning model time-to-first-token sits at 2-10 seconds on most providers (o3 measures around 6.5 seconds on independent benchmarks). A pipeline applying a reasoning model to 1 million records at 30 seconds per query takes approximately one year. That's not a deployment concern — it's a product design constraint. Reasoning models don't belong in batch pipelines unless heavily parallelized and budget-capped.

For user-facing applications with sub-second SLA expectations, even streaming with hidden thinking only partially helps. You can hide the thinking tokens from the response stream, so the text starts flowing sooner — but you still pay for them, and the response won't start until enough thinking has completed.

The Hidden Token Problem

The billing model on reasoning is the part that surprises engineers most. When you send 50 input tokens and receive 100 visible output tokens, you may be billed for 650 tokens total — 500 hidden reasoning tokens that never appear in your response.

At high effort, reasoning token counts can dwarf the visible response. An independent evaluation found o3 generating 44 million tokens across 7 benchmarks, costing over $2,700 in a single evaluation run. A complex code review that costs $0.02 with a standard model can cost $0.15-$0.20 with a reasoning model at default settings.

The cost structure is non-linear. At low effort settings, o-series reasoning models are competitive with or cheaper than GPT-4o equivalents. At high effort, costs diverge dramatically. The effort/budget parameter is not a quality dial — it's a cost dial. Treating it as "set it high for important tasks" without understanding your cost envelope is how teams end up with 10x billing surprises.

Practical guidance: start at minimum budget (1,024 thinking tokens for Claude's API), benchmark accuracy, then increase only until you hit your accuracy target. Most production tasks plateau well before the maximum budget.

The Routing Solution

The team that switched back to GPT-4o for classification was right to do so. But the more sophisticated answer isn't "pick one model" — it's building a routing layer that directs each query to the appropriate model.

RouteLLM-style routing, using a lightweight classifier trained on query characteristics, has demonstrated 85%+ cost reduction on general benchmarks while maintaining 95% of peak-model quality. A hybrid routing system that reserved reasoning models for analytically complex queries achieved 40-46% reduction in total LLM usage, 32-38% latency improvement for simpler queries, and 39% overall cost reduction in production deployments.

The routing classifier itself doesn't need to be a reasoning model — it needs to categorize query complexity quickly. That can be a fine-tuned small model, a rule-based filter, or even a fast instruct model with a structured prompt. The signal is usually apparent from query length, presence of mathematical notation, multi-step phrasing ("first... then... then..."), and domain signals.

GPT-5 itself now implements this pattern internally, routing queries between a fast efficient model and a deeper reasoning model based on assessed complexity. When the model provider is building model cascades into the API tier, it's a signal that this is now table stakes for production systems.

The Hybrid Planning Pattern

The most valuable production architecture for complex agentic systems isn't "use reasoning models throughout" — it's using them exclusively as planners.

Reasoning models are good at searching over a solution space to produce a plan: decomposing a goal into subtasks, identifying dependencies, choosing between approaches, writing the orchestration logic. That search happens once, at the beginning. The execution of individual steps — making specific API calls, formatting outputs, running templated queries, applying classification to individual records — is low-ambiguity, repetitive, and well-suited to fast instruct models.

In practice: a reasoning model receives the high-level goal and produces a structured task plan. Fast instruct models execute each step. The reasoning model optionally reviews the final output or handles exceptions. The planner runs once; the executor runs many times. Cost scales with execution frequency, not with thinking depth.

This maps to how Claude's adaptive thinking is designed for agentic loops: the model allocates more thinking to ambiguous or high-stakes decision points and less to routine tool calls. You're paying for reasoning where it matters and skipping it where it doesn't.

Prompting Anti-Patterns

Engineers who use reasoning models like they use instruct models run into consistent problems.

Few-shot examples that improve instruct model accuracy often degrade reasoning model accuracy. The model's internal search process is disrupted by example-based constraints. On DeepSeek R1, zero-shot prompting empirically outperforms few-shot. If your instinct is to add examples to improve consistency, you may be trading accuracy for familiarity.

"Think step by step" instructions are redundant. The model is already doing this internally. Adding it costs tokens without benefit.

Elaborate system prompts that guide instruct models can constrain the reasoning search space. Keep system prompts short and direct. Reasoning models produce comprehensive responses by default — explicitly request brevity when you need it.

Reasoning model knowledge domains are narrower than broad instruct models. They trade general world-knowledge breadth for reasoning depth. Don't rely on general knowledge recall; inject domain context explicitly.

And finally: benchmark scores and your data are different things. Independent evaluations on modified puzzles show reasoning models scoring 20-30 points lower than headline benchmark numbers on adapted problems. Test on your data before committing.

What to Build With

The practical decision framework:

Use a reasoning model when: the problem has multiple candidate approaches, mistakes in intermediate steps cascade forward, correctness is verifiable, and latency tolerance is at least 5-10 seconds. Mathematical calculations, code generation with test suites, multi-step plans with dependencies, problems where you'd naturally think for several minutes before answering.

Use a fast instruct model when: the task is classification, extraction, translation, summarization, or factual lookup; when you're processing more than a few thousand queries per day; when latency SLAs are under 2 seconds; or when the task is one where a human expert would answer immediately without deliberation.

Build a routing layer when: your application handles diverse query types with variable complexity. This is the configuration that gets you near-peak accuracy at instruct-model costs for most queries.

The reasoning model ecosystem is moving fast — effort-level tuning, adaptive thinking, and model cascades built into the API tier are all relatively new. The capability gap will likely close on some tasks and widen on others. What won't change is the underlying tradeoff: internal search costs compute and time. Tasks that benefit from search should pay for it. Tasks that don't shouldn't.

References:Let's stay in touch and Follow me for more thoughts and updates