Prompting Reasoning Models Differently: Why Your Existing Patterns Break on o1, o3, and Claude Extended Thinking

April 16, 2026 · 10 min read

Software Engineer

Most teams adopting reasoning models do the same thing: they copy their existing system prompt, point it at o1 or Claude Sonnet with extended thinking, and assume the model upgrade will do the rest. Benchmarks improve. Production accuracy stays flat — or drops. The issue isn't the model. It's that the mental model for prompting never changed.

Reasoning models don't work like instruction-following models. The strategies that squeeze performance out of GPT-4o — elaborate system prompts, carefully curated few-shot examples, explicit "think step by step" instructions — were designed for a different inference architecture. Applied to reasoning models, they constrain the exact thing that makes these models valuable.

This post is a practical guide to the differences that matter and the adjustments that actually work.

The Architectural Gap That Explains Everything

Standard LLMs are optimized to predict the next token given the context window. They're very good at matching patterns: if you show them examples of how you want output formatted, they'll format output that way. If you tell them to think step by step, they'll output text that looks like step-by-step thinking.

Reasoning models operate differently. Before generating a response, they run an internal search process — a hidden chain-of-thought that explores and prunes solution paths before committing to an answer. This thinking happens outside the visible output (or in a separate thinking block, depending on the implementation). The final response is not the thinking; it's what the model decided after thinking.

This distinction reshapes what a "good prompt" looks like. For instruction-following models, you're shaping token prediction. For reasoning models, you're scoping a search problem. Instructions that help with the first often interfere with the second.

Why "Think Step by Step" Is Dead Weight

The classic chain-of-thought prompt — "think step by step," "reason through this carefully," "work through each part before answering" — is one of the most well-validated techniques in LLM prompt engineering. It consistently improves accuracy on instruction-following models by forcing the model to surface intermediate reasoning as output tokens, which then inform subsequent token predictions.

On reasoning models, it does nothing useful. The model is already thinking step by step internally. You're asking it to do what it's already doing, at the cost of tokens and potential interference.

OpenAI's official guidance for the o-series models makes this explicit: these models perform best when you trust their inherent reasoning abilities rather than directing their reasoning process. Adding chain-of-thought instructions doesn't improve the internal search — it just adds noise to the context window.

This isn't a subtle performance difference. Teams migrating elaborate CoT-heavy prompts to reasoning models sometimes see accuracy decrease because the model is allocating context and attention to instructions that compete with the reasoning it would have done anyway.

The Few-Shot Reversal

For standard models, few-shot prompting is the highest-ROI technique in the playbook. Providing 4–8 well-chosen examples dramatically improves output format, style, and task accuracy. The research showing this is extensive and largely uncontested.

For reasoning models, the finding flips. A 2025 paper revisiting chain-of-thought prompting found that for strong reasoning models like the Qwen2.5 series, adding traditional CoT exemplars did not improve performance compared to zero-shot. The underlying mechanism: strong models don't learn from examples the way weaker models do. They largely ignore the examples and respond to the task description. The examples constrain the model's exploration space without providing useful signal.

DeepSeek documented the same issue with reasoning language models: standard few-shot CoT demonstrations hurt performance, which is why their own evaluation benchmarks use direct inference without demonstrations. OpenAI's guidance for the o-series is similarly explicit: "Reasoning models often don't need few-shot examples to produce good results."

The practical consequence for practitioners: your default for reasoning models should be zero-shot. Only add examples if you have empirical evidence they help for your specific task — and test that assumption with ablations, not intuition.

Long System Prompts Are a Tax, Not an Asset

With instruction-following models, longer and more detailed system prompts generally produce better results up to a point. Spelling out the expected output format, persona, constraint list, and behavioral guidelines all help. The model is a pattern matcher; the more patterns you provide, the more precisely it matches.

Reasoning models invert this relationship. Extended thinking needs space to explore. A system prompt that enumerates 20 behavioral constraints doesn't teach the model to reason better — it creates 20 competing objectives that the model optimizes against instead of exploring the full solution space. The model satisfies the constraints rather than searching for the best answer.

Claude's guidance for extended thinking specifically notes this: elaborate system prompts tend to constrain the reasoning search space in ways that reduce quality on complex tasks. The sweet spot is a short, goal-focused prompt that states what you need without dictating how to get there.

This is a difficult adjustment for teams with years of accumulated system prompt engineering. The instinct to specify everything precisely is hard-won and usually correct for instruction-following models. On reasoning models, that same instinct produces worse output. Shorter is often better — not because precision doesn't matter, but because reasoning models fill in the gaps differently than pattern-matching models do.

Multi-Objective Instruction Failure

A related failure mode appears when you load a reasoning model with multiple competing objectives in the same prompt. Standard models handle this with reasonable grace because they're balancing objectives during token prediction. Reasoning models handle it differently: the internal search process gets anchored on satisfying the constraint set, and the model finds a locally optimal solution that clears the requirements rather than globally optimal reasoning.

The symptom looks like a model that's technically compliant but misses the point. Every instruction gets followed, but the actual task quality is lower than it would have been with a simpler prompt that named only the core goal.

The fix is prompt decomposition: if you have multiple objectives, prioritize ruthlessly and state only the most important one or two. Route complex multi-objective tasks to specialized prompts rather than trying to handle everything in one system prompt. This is counterintuitive if you've built your system around a single "god prompt" that handles all cases — but reasoning models reward narrower, deeper prompts over broader, shallower ones.

Priority Inversion Under Overspecification

A subtler version of the multi-objective problem is priority inversion. When you stack many instructions on a reasoning model, the model's internal search doesn't preserve your implicit priority ordering. An instruction at the top of the system prompt doesn't automatically outrank one buried in the middle.

Standard models exhibit predictable priority degradation through attention mechanisms — instructions at the end of long context windows get less attention. Reasoning models have a different failure mode: they satisfy constraint sets holistically, and low-priority edge-case instructions can end up dominating the solution because they're the binding constraint.

The practical implication: if you have a hierarchy of concerns (correctness > format > verbosity, for example), make it explicit rather than relying on ordering. And keep the list short enough that implicit priorities don't create inversion traps.

What Actually Works: A Framework

Based on official guidance and empirical findings, the prompting pattern for reasoning models is roughly:

Start with zero-shot. Don't add examples by default. State the task clearly with explicit success criteria and let the model reason. If accuracy is insufficient, add domain context before adding examples.

State the goal, not the process. Tell the model what you need, not how to think about it. Reasoning models do better with "Analyze whether this contract clause creates any compliance risks under GDPR" than with a step-by-step analysis framework that prescribes the reasoning path.

Keep system prompts minimal. Two to three focused instructions outperform ten diffuse ones. If your system prompt is more than a few hundred words, look for instructions that can be removed rather than refined.

Inject domain context explicitly. Reasoning models often have narrower world-knowledge recall than instruct models — they trade breadth for depth. If your task requires specific technical context, provide it directly rather than relying on the model to recall it.

Provide structure with delimiters, not instructions. Using XML tags or markdown headers to separate distinct sections of your input (instructions, context, the actual query) helps reasoning models parse your intent without constraining their search. Structure the problem; don't narrate the solution approach.

When you do include few-shot examples, be precise. If ablation testing shows examples help your specific task, ensure they exactly match the output format and quality you want. Noisy or inconsistent examples hurt more on reasoning models than on standard ones.

Detecting When You're Fighting the Model

The highest-value diagnostic is a zero-shot ablation. For any reasoning model prompt that isn't hitting your accuracy targets, strip everything down to a minimal zero-shot version of the task and measure accuracy. If zero-shot matches or beats your engineered prompt, the prompt is the problem.

Beyond that, a few signals indicate you're constraining the model's reasoning:

Thinking token distributions that spike on simple queries: If straightforward tasks are consuming large thinking budgets, the model may be reasoning around your constraints rather than through the problem.
Technically-compliant but semantically wrong outputs: The model satisfies all stated requirements but misses the actual goal — a classic sign of constraint-optimization over genuine task completion.
Zero-shot beats few-shot in A/B testing: Direct evidence the examples are constraining exploration.

Prompt debugging for reasoning models requires treating the prompt as a search scope, not a behavior specification. The question isn't "does this instruction tell the model what to do?" — it's "does this instruction unnecessarily narrow the reasoning space?"

The Mental Model Shift

The useful reframe is: prompting a standard LLM is like writing a specification for a pattern-matcher. Prompting a reasoning model is like briefing an analyst. You give the analyst the goal, the relevant context, and the format you want back. You don't walk them through their thinking process. The more you prescribe how they should reason, the worse the output gets.

This means less prompting work on structure and more on problem definition. What exactly is the task? What counts as a good answer? What context does the model need that it can't derive? These questions matter more than the mechanics of system prompt design.

The teams that get the most out of reasoning models are the ones who stop trying to engineer the reasoning path and start focusing on framing the problem clearly. Counterintuitive for anyone who's spent years refining elaborate prompts — but that's where the leverage is.

The shift from instruction-following to reasoning models is not just a model upgrade. It requires updating the prompting mental model that took years to build. The core adjustment: remove the scaffolding, state the goal, let the model think. The hardest part is resisting the instinct to add one more instruction.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Prompting Reasoning Models Differently: Why Your Existing Patterns Break on o1, o3, and Claude Extended Thinking

The Architectural Gap That Explains Everything

Why "Think Step by Step" Is Dead Weight

The Few-Shot Reversal

Long System Prompts Are a Tax, Not an Asset

Multi-Objective Instruction Failure

Priority Inversion Under Overspecification

What Actually Works: A Framework

Detecting When You're Fighting the Model

The Mental Model Shift

Recommended Reading

About Tian Pan

The Architectural Gap That Explains Everything​

Why "Think Step by Step" Is Dead Weight​

The Few-Shot Reversal​

Long System Prompts Are a Tax, Not an Asset​

Multi-Objective Instruction Failure​

Priority Inversion Under Overspecification​

What Actually Works: A Framework​

Detecting When You're Fighting the Model​

The Mental Model Shift​

Recommended Reading

About Tian Pan

The Architectural Gap That Explains Everything

Why "Think Step by Step" Is Dead Weight

The Few-Shot Reversal

Long System Prompts Are a Tax, Not an Asset

Multi-Objective Instruction Failure

Priority Inversion Under Overspecification

What Actually Works: A Framework

Detecting When You're Fighting the Model

The Mental Model Shift