Skip to main content

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

· 10 min read
Tian Pan
Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

Per-turn accuracy is not a session-level promise

The arithmetic is older than the field. If every step has a 95% chance of being right and the steps are independent, a twenty-step workflow succeeds 36% of the time. A thirty-step workflow succeeds 21%. The independence assumption is generous — in real sessions errors are correlated, because once a model commits to a wrong assumption it tends to defend it for the rest of the conversation rather than discard it. So the actual session-success curve is worse than the multiplicative lower bound, not better.

Teams report per-turn metrics because per-turn data is what their tracing tool aggregates by default. A span succeeded or it did not. A tool call returned a value or it threw. These are easy to count, and they look healthy in dashboards. What they hide is the user-weighted view: not "did each individual span pass," but "did this conversation reach the outcome the user came here for." The session-level number is almost always lower, and the gap widens with turn count. A research team that converted standard single-turn benchmarks into multi-turn conversations by sharding fully specified instructions into atomic pieces and revealing one shard per turn observed an average 39% accuracy drop across every frontier model tested, with most of the loss coming not from reduced raw capability but from a sharp rise in unreliability — the spread between best and worst runs on the same session exceeding 50 points for many models.

That spread is the part executives find hardest to internalize. A 39% drop on the average sounds bad enough to budget for. The reality is that on a given session your model is uniformly random between "ships demo quality" and "fails halfway through," and which of those happens depends on details of the conversation that nobody has a knob to tune.

What actually breaks as the turn counter goes up

Three failure modes show up in long-session traces, and they each operate at a different layer of the stack.

The first is positional attention drift. Models disproportionately attend to information from the first and last turns; citations to the middle of a long session drop below 20%. This is the "lost in the middle" effect, and it is mechanical, not a quirk of any one model — it shows up in needle-in-haystack tests with synthetic distractors, and it shows up in real conversations where the user's actual constraint was mentioned on turn nine and forgotten by turn twenty-three. Multi-needle benchmarks like RULER expose this in a way single-needle tests do not, and most published long-context numbers come from the easy version.

The second is sticky commitment. Once a model picks an interpretation early in a session — the user is asking about Python, the file in question is the auth handler, the tool to use is the search API rather than the database — it tends to layer corrections on top of that assumption rather than re-deciding from scratch. If the early assumption was wrong, every subsequent turn is solving the wrong problem with increasing confidence. The final responses get longer, not shorter; one analysis observed code outputs doubling in length over a session as the model packed in more conditions, exceptions, and patches to defend an interpretation it should have discarded.

The third is stale tool state. Agents that hold onto cached results, partial outputs from a previous tool call, or a half-completed plan from ten turns ago will reason as if that state were current even when the world has moved on. The longer the session, the more state accumulates and the more likely some of it is wrong by the time the agent acts on it. This is rarely caught by per-turn evals because the per-turn evaluator sees the tool call that succeeded, not the staleness of the data it returned.

None of these modes look like a single dramatic failure. They look like the answer drifting slightly off the user's actual question every turn, and then suddenly the conversation is twenty-five turns deep and the agent is confidently doing something the user never asked for.

The long-session trace is the eval that nobody is collecting

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates