The Long-Horizon Evaluation Gap: Why Your Agent Passes Every Benchmark and Still Fails in Production
A model that scores 75% on SWE-Bench Verified falls below 25% on tasks that take a human engineer hours to complete. The same agent that reliably handles single-turn question answering can spiral into incoherent loops, hallucinate tool outputs, and forget its original goal when asked to coordinate a dozen steps toward an open-ended objective. The gap between benchmark number and production behavior isn't noise—it's structural, and understanding it is the difference between shipping something useful and shipping something that looks good in the demo.
This post is about that gap: why it exists, what specific failure modes emerge in long-horizon tasks that never appear in static evals, and what it takes to build an evaluation harness that actually catches them.
