Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts
The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.
The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.
