The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume
A reasoning model emits a chain-of-thought trace because that is how it computes. A product team renders that trace in the UI because hiding it feels like throwing away tokens the user paid for. Those are two different decisions, and almost nobody on the product side notices they made the second one. The trace becomes a panel, the panel becomes a feature, the feature gets a docs page, and six months later someone in a quarterly review asks why the support queue is full of users arguing with the reasoning instead of the answer.
The trace is debug output. It exists for engineers who need to know why the model picked one tool, hedged on a date, or quietly switched personas mid-paragraph. Pushing it to the end user without a design pass is the AI-product equivalent of leaving console.log calls in production and calling them "transparency." It looks like a feature, it costs almost nothing to render, and it quietly degrades trust in ways that don't show up in any of the dashboards the team built.
Why the trace got promoted from log to product surface
The default architecture of a 2025-era reasoning product is a model that produces two streams: a thinking stream and a final stream. The thinking stream is intended for the team training the model and for the engineers debugging the agent loop. Somewhere in the path from prototype to ship, a designer noticed there was a free panel of "AI is doing something" content sitting on disk, and the product spec grew a "show reasoning" toggle that was on by default. Nobody fought for it because it felt obviously good — transparency, explainability, the user gets to see the work.
That intuition came from the wrong reference class. The closest analog people reached for was a math teacher's "show your work," where the steps are the artifact being graded. In a model output, the steps are not the artifact. The answer is. The trace is closer to a profiler flame graph: occasionally vital for the person tuning the system, mostly noise for the person consuming the output. A flame graph is not a product feature. It is a debugging tool that happens to render.
The shift in industry posture didn't help. After DeepSeek-R1 shipped with its full reasoning visible, OpenAI — which had been hiding its o1 reasoning behind a summarized "thinking" panel — added more reasoning detail to o3-mini in part because users complained they couldn't debug their prompts. That argument is real, but it is an engineering argument made by users wearing prompt-engineer hats. The team that generalized "expose more reasoning" from "developers want to debug their prompts" to "end users want to read prose explanations" performed a category error that is now baked into a generation of consumer AI products.
The trace doesn't say what you think it says
The strongest argument against rendering the trace as product surface is that the trace is frequently lying. Anthropic's faithfulness work on Claude 3.7 Sonnet and DeepSeek R1 found that the visible chain-of-thought mentioned the actual hint that drove the model's answer only 25% of the time for Claude and 39% of the time for R1 on a controlled benchmark. On harder reasoning problems, faithfulness dropped further — roughly a 44% relative decline for Claude on GPQA versus easier datasets. On exactly the cases where users would most want a reliable explanation, the explanation is least reliable.
The implications for product are sharper than the academic framing suggests:
- The trace is not a window into the model's actual computation. It is a plausible-sounding monologue the model generates in parallel.
- A user who reads the trace and reasons "the model decided X because of step 3" is forming a mental model the system does not honor.
- When the trace contradicts the final answer — which it does in a non-trivial fraction of long traces — the user is left adjudicating between two outputs from the same model, and the experience is measurably worse than if there had been one output to disagree with.
If the trace were faithful, "show your work" would still be questionable as a default. Because the trace is not faithful, the question of whether to render it stops being a UX preference and starts being a correctness one.
What actually goes wrong in the support queue
Teams that ship always-on traces tend to discover the same failure modes, often after the relevant launch retrospective is closed and the dashboard is showing green:
- Users argue with the reasoning instead of the answer. A user asks for a contract clause and gets a clean answer. The trace, because it is the model thinking out loud, includes "this is a sensitive area, the user might want a lawyer." The user reads the hedge as a hedge, files a ticket about the AI being "weasel-worded," and the team realizes they shipped a product whose dominant complaint vector is the part of the output the team thought was a freebie.
- The trace and the final output disagree, and that disagreement reads as deception. Reasoning models routinely change their minds during the trace, then commit to one branch in the answer. The user sees the discarded branch and concludes the system is hiding something or, worse, that the answer is internally contested. Wrong-but-confident outputs hurt trust. Confident-but-internally-contradictory outputs hurt it faster.
- The hedge becomes the user's first impression. When a trace is rendered above or before the answer, the first prose the user reads is calibrated to internal model uncertainty, not to the audience. "I'm not entirely sure, but..." is appropriate for a reasoning step that is genuinely uncertain. As a product opener, it primes the user to distrust the answer that follows.
- Power users learn to scroll past the trace, casual users get scared off. The bimodal usage that emerges — engineers who skim, casual users who close the tab — is the worst possible outcome of a feature whose stated goal was to add transparency.
- https://www.anthropic.com/research/reasoning-models-dont-say-think
- https://www.anthropic.com/research/measuring-faithfulness-in-chain-of-thought-reasoning
- https://venturebeat.com/ai/dont-believe-reasoning-models-chains-of-thought-says-anthropic
- https://venturebeat.com/ai/openai-responds-to-deepseek-competition-with-detailed-reasoning-traces-for-o3-mini
- https://bdtechtalks.com/2025/02/12/openai-o3s-chain-of-thought/
- https://arxiv.org/html/2506.23678v1
- https://www.aiuxdesign.guide/patterns/progressive-disclosure
- https://uxplanet.org/progressive-disclosure-in-ai-powered-product-design-978da0aaeb08
- https://pair.withgoogle.com/chapter/explainability-trust/
- https://www.smashingmagazine.com/2025/09/psychology-trust-ai-guide-measuring-designing-user-confidence/
- https://news.ycombinator.com/item?id=42799743
