Skip to main content

The Long-Horizon Evaluation Gap: Why Your Agent Passes Every Benchmark and Still Fails in Production

· 11 min read
Tian Pan
Software Engineer

A model that scores 75% on SWE-Bench Verified falls below 25% on tasks that take a human engineer hours to complete. The same agent that reliably handles single-turn question answering can spiral into incoherent loops, hallucinate tool outputs, and forget its original goal when asked to coordinate a dozen steps toward an open-ended objective. The gap between benchmark number and production behavior isn't noise—it's structural, and understanding it is the difference between shipping something useful and shipping something that looks good in the demo.

This post is about that gap: why it exists, what specific failure modes emerge in long-horizon tasks that never appear in static evals, and what it takes to build an evaluation harness that actually catches them.

The Single-Turn Illusion

Most popular benchmarks measure one thing: given a prompt, does the model produce the right output? MMLU tests knowledge recall. HumanEval tests whether a function body passes unit tests. Even the "agentic" evaluations that gained traction in 2023 and 2024 are largely framed as single episodes—the agent gets a task, takes some actions, and is scored on whether the end state matches a target.

The problem isn't that these benchmarks are wrong. It's that they're testing a fundamentally different behavior than what production agents do.

In production, an agent might need to research a codebase, propose a fix, run tests, interpret failures, revise its approach, handle a tool timeout, and produce a clean diff—across 30 to 60 steps. The probability of success in this regime isn't a fixed property of the model. It's a function of task duration, error recovery behavior, and how the agent manages accumulated context. None of that shows up in pass@1.

The numbers make the gap concrete. Recent reliability research spanning 23,392 episodes across multiple models found that pass@1 drops from 76.3% on short tasks to 52.1% on very-long tasks—a 24-percentage-point decline that's super-linear, meaning it's worse than you'd predict if errors were independent. For software engineering tasks specifically, the Graceful Degradation Score collapses from 0.90 to 0.44 as task length increases. The agent isn't making more mistakes per step—the mistakes are compounding in ways that short-horizon evals cannot detect.

What Long-Horizon Tasks Actually Expose

Several failure modes appear almost exclusively in multi-step, extended interactions. They're not visible in single-turn evals because they require the accumulation of state and context over time.

Compounding errors. The most documented failure: a small mistake early in a trajectory distorts every subsequent reasoning step. If the agent misclassifies a function signature in step 3, it builds an incorrect mental model that poisons steps 10 through 20. Research on coding agents specifically identifies "a combination of minor, compounding issues" as a distinct failure category—not a single catastrophic error, but many small ones that cascade. When each step succeeds 95% of the time, a 20-step chain completes correctly only 36% of the time.

State drift. Over extended interactions, agents deviate from their original objectives through what researchers call semantic drift. The agent doesn't forget the task in a literal sense—it remains in the context window—but priorities gradually shift and the agent starts optimizing for local coherence rather than the original goal. This is distinct from compounding errors: the reasoning at each step may be locally valid, but the trajectory drifts off-target.

Meltdown behavior. A particularly striking failure mode documented in long-horizon benchmarks is the transition from coherent-but-wrong to incoherent. The agent begins looping on failed tool calls, contradicts its own earlier outputs, or starts hallucinating tool responses rather than invoking tools. Research on τ-bench (tau-bench), a benchmark that simulates multi-turn customer service interactions with dynamic tool use, found that GPT-4o achieves less than 50% task success, and when evaluated for consistency across eight runs, that drops to below 25%—revealing an agent that occasionally gets lucky rather than one that reliably understands the task.

Tool use degradation. Agents frequently begin tasks with correct tool selection and well-formed arguments, then degrade mid-execution. Failures include malformed JSON in tool calls, tool invocations that are semantically correct but contextually wrong (calling a read tool when a write tool was needed), and loss of structure in outputs after long context accumulation. This pattern doesn't appear in evals where the agent takes one or two tool calls—it emerges after a dozen.

Irreversible action risk. Long-horizon tasks are more likely to involve actions that can't be undone: committing code, sending messages, modifying shared state, deleting records. Short-horizon evals rarely expose this because the cost of a wrong action in a three-step eval is trivially recoverable. In a 40-step agentic workflow, an irreversible mistake at step 15 can invalidate every subsequent step regardless of how well the agent recovers.

The Benchmarks That Reveal the Gap

A handful of benchmarks are specifically designed to stress-test long-horizon capability, and their results are consistently humbling.

SWE-Bench Pro tasks agents with software engineering problems that require hours to days of work for a human professional—patches spanning multiple files, architectural changes, and cross-cutting concerns. State-of-the-art models score below 25% pass@1. More pointedly, the same models that exceed 70% on SWE-Bench Verified (which focuses on isolated bug fixes) collapse to sub-25% when the tasks require sustained multi-file reasoning. When human-provided requirements and interface specifications are removed from the task context, GPT-5's score drops from 25.9% to 8.4%—demonstrating how dependent current agents are on well-structured scaffolding.

τ-bench (tau-bench) measures agents performing customer service tasks in retail and airline domains, with tool access and domain policy guidelines. The benchmark uses a pass^k metric: the probability that all k attempts succeed. This is the relevant metric when you need an agent to be consistently right, not occasionally lucky. The data is stark: top function-calling agents succeed on less than 50% of tasks, and pass^8 falls below 25% in the retail domain.

GAIA benchmarks general-purpose assistants on multi-step tasks requiring web browsing, tool use, and multi-modal reasoning. When it launched, humans solved 92% of tasks while GPT-4 with plugins solved 15%. The gap has narrowed but the benchmark exposes a consistent pattern: agents fail not on the reasoning within a single step, but on the coordination across steps.

WebArena tasks agents with completing real web-based workflows across multiple sites. Initial models achieved 14% success on a benchmark where humans score 78%. Even with improvements bringing some agents above 60%, the benchmark consistently surfaces failures that single-turn QA cannot: agents confused by unexpected UI states, forgetting earlier navigation context, and taking valid-seeming actions that violate implicit task constraints.

Rank Inversions and the Benchmark Trap

One finding from long-horizon reliability research deserves particular attention: model rankings reverse when you switch from short-horizon to long-horizon evaluation.

A model that ranks first on short tasks—GLM-4.5 Air at 94.9% pass@1—drops to fourth on very-long tasks at 66.7%. A model that ranks fifth or sixth by short-horizon capability climbs to third or fourth by long-horizon reliability. If you're selecting a model for a production agent based on standard benchmark leaderboards, you may be choosing the wrong model for your actual workload.

The reason for the rank inversion is that some models achieve high short-horizon scores by being aggressively decisive—they commit to answers quickly and confidently. That works well in single-turn settings. In long-horizon tasks, that same aggressiveness leads to early commitment to wrong strategies, with insufficient recovery behavior when things go wrong.

There's a counterintuitive corollary: frontier models that pursue ambitious multi-step strategies show the highest meltdown rates. DeepSeek V3 and similar models achieve the best very-long GDS scores yet exhibit 19% and 13% meltdown rates respectively. The models that perform best overall are also the most likely to spiral into incoherence when they can't make their complex strategy work. Models selected purely on leaderboard position may be the most dangerous in production if they don't have robust failure modes.

Building an Eval Harness That Catches Long-Horizon Failures

The practical question is what to do about this. Short-horizon pass@1 evals are worth keeping—they're fast, cheap, and good at catching regressions on well-understood tasks. But they need to be supplemented with eval infrastructure that actually exercises the failure modes above.

Measure pass^k, not just pass@1. Run the same task multiple times and measure whether the agent consistently succeeds. A pass@1 of 60% with a pass^8 of 10% means you have an agent that gets lucky, not an agent you can deploy. For customer-facing or consequential tasks, you want pass^k to be high—which means evaluating it explicitly.

Grade trajectories, not just outcomes. An agent that reaches the right final state via a wrong intermediate path is hiding a latent failure. Record full transcripts including tool calls, intermediate outputs, and reasoning traces. Check whether the trajectory is valid, not just whether the final state matches. This is especially important for tasks where lucky accidents can produce correct outcomes from incorrect reasoning.

Isolate state between eval runs. A common eval infrastructure mistake is allowing state to leak between trials. When trial 3 inherits artifacts from trial 2, failures become correlated—the eval underestimates true failure rate because a single root cause masks what looks like multiple independent successes. Production environments don't share state across user sessions; your eval environment shouldn't either.

Build partial credit into graders. Binary pass/fail misses signal about how close the agent got. A task with five subtasks where the agent completed four is fundamentally different from one where it completed zero. Weighted partial credit scoring reveals progress that binary grading hides, and gives you better signal for debugging which parts of a multi-step task are failing.

Inject failures mid-trajectory. Long-horizon tasks encounter tool timeouts, unexpected API responses, ambiguous states, and user clarifications. An eval that only tests the happy path won't reveal how the agent responds when things go wrong. Add eval cases where tools return errors mid-task, where earlier assumptions become invalid later in the trajectory, and where the agent needs to revise its plan.

Add circuit breakers to detect meltdown. In production agents, implement sliding-window monitoring over tool-call sequences. Repeated identical tool calls, rapidly growing context without progress, or tool outputs that contradict earlier successful calls are meltdown precursors. An eval harness should identify when these patterns appear and log them separately from simple task failures—they indicate a qualitatively different class of problem that requires different fixes.

Match eval horizon to deployment horizon. If your production agent handles tasks that take 20-30 steps, your evals need to include tasks at that length. A common mistake is building evals from tasks that are convenient to set up—short, well-defined, with clear success criteria—but then deploying to tasks that are none of those things. The eval suite should cover the task-length distribution of your actual workload.

The Forward Direction

The benchmarks that matter in 2025 and 2026 aren't the ones with the highest numbers. They're the ones with the highest fidelity to production conditions: dynamic environments, multi-step tasks, consistency requirements, and irreversible actions. τ-bench's pass^k metric, SWE-Bench Pro's multi-file complexity requirements, and GAIA's tool-use coordination are harder to score well on specifically because they're harder to game.

The practical upshot for teams building agents is that benchmark leaderboard position is a weak proxy for production readiness. The work is in building evaluation infrastructure that mimics the failure conditions of long-horizon tasks—compounding errors, state drift, tool degradation, and meltdown—before those conditions reveal themselves in production. That infrastructure is harder to build than a single-turn eval, but the gap between benchmark and production performance isn't going to close until the evaluations get harder.

Build evals that run 20 steps. Measure pass^k. Test recovery from mid-task failures. The benchmark that you build from your own traffic will tell you more than any leaderboard, because it will tell you specifically where your agent breaks.

References:Let's stay in touch and Follow me for more thoughts and updates