Skip to main content

2 posts tagged with "chain-of-thought"

View all tags

Your Chain-of-Thought Is a Story, Not an Audit Log

· 11 min read
Tian Pan
Software Engineer

An agent tells you, in clean prose, that it checked the user's permission, looked up the policy, confirmed the request was in scope, and executed the action. Legal reads the trace. Auditors read the trace. Your incident review reads the trace. Everyone reads the same paragraph and everyone comes away satisfied.

None of them know whether the permission check actually ran. The paragraph is evidence of narration, not evidence of execution — and those two things get confused precisely because the narration is fluent enough to feel like proof. Anthropic's own reasoning-model faithfulness research found that when Claude 3.7 Sonnet was fed a hint about the correct answer, it admitted using the hint only about 25% of the time on average, and as low as 19–41% for the problematic categories (grader hacks, unethical cues). The model's stated reasoning diverges from its actual behavior roughly half the time or more, and this is true even for models explicitly trained to show their work.

The Token Economics of Chain-of-Thought: When Thinking Out Loud Costs More Than It's Worth

· 8 min read
Tian Pan
Software Engineer

Chain-of-thought prompting was one of the most important discoveries in applied LLM engineering. Ask a model to "think step by step," and accuracy jumps on math, logic, and multi-hop reasoning tasks. The technique became so standard that many teams apply it reflexively to every prompt in their system — classification, extraction, summarization, routing — without asking whether it's actually helping.

It usually isn't. Recent research from Wharton's Generative AI Lab shows that chain-of-thought provides no statistically significant improvement for one-third of model-task combinations, and actively hurts performance in others. Meanwhile, every CoT request inflates your token bill by 2–5x and adds seconds of latency. For production systems handling millions of requests, that's not a prompting strategy — it's an unaudited cost center.