The Multi-Turn Session State Collapse Problem
Your per-request error rates look clean. Latency is within SLO. The LLM judge is scoring outputs at 87%. And then a user files a support ticket: "I told the bot my account number three times. It just asked me again." A different user: "It agreed to a refund, then two turns later denied the policy existed."
Single-turn failures are visible. The request comes in, the model hallucinates or refuses, your eval catches it, you fix the prompt. The feedback loop is tight. Multi-turn failures work differently: the session starts fine, degrades gradually turn by turn, and your monitoring never fires because each individual response is technically coherent. The problem is the session as a whole — and almost no team instruments for that.
Research across major frontier models (Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro) shows an average 39% performance drop when moving from single-turn to multi-turn conversations. That number hides the real story: only about 16% of the drop is capability loss. The other 23 points are a reliability crisis — the gap between a model's best and worst performance on the same task doubles as conversation length grows. You're not just getting worse outputs; you're getting inconsistent ones.
Why Multi-Turn Sessions Degrade
The degradation isn't random. There are four repeating failure modes that practitioners encounter in production.
Premature commitment happens when the model latches onto early context and anchors its reasoning there. A user says "I'm working on a Python project" in turn one, spends three turns discussing something unrelated, then asks for code help. The model writes Python even if the user's follow-up question was clearly about a shell script. The early information shard became a prior that everything else was filtered through.
Compounding errors cascade from initial misalignments. A 2% misalignment at turn two creates a 40% failure rate by turn ten if the model builds on incorrect assumptions rather than reconsidering them. Models don't spontaneously correct course; once they're on a wrong track, additional information that should contradict the error gets integrated around the mistake rather than replacing it.
Loss-of-middle-turns is a structural artifact of how transformer attention works. Attention weights have primacy and recency bias — what was said first and what was said last get disproportionate weight. The crucial clarification the user gave at turn four, which refined the requirements from turns one through three, quietly gets buried in the attention noise floor. Empirically, relevant information placed in the middle of a long context drops model accuracy by more than 30% compared to the same information placed at the edges.
Assumption inflation occurs in longer responses. The model introduces plausible-sounding but spurious premises mid-session, and those assumptions persist as context for future turns. The session slowly drifts from the user's actual goals toward a hallucinated version of what the conversation is about.
The underlying mechanism is what researchers call context rot. A model with a 200K token context window can exhibit significant degradation at 50K tokens — well before hitting limits. Softmax normalization means every new token you add raises the noise floor. More context doesn't increase signal strength; it dilutes attention from what matters. At 100K tokens, there are roughly 10 billion pairwise attention comparisons per layer. Each piece of relevant information receives a proportionally smaller slice of the attention budget.
What Fails Silently
The specific failures that don't trigger conventional monitoring:
- The chatbot asks for the account number again after the user provided it in turn two. Per-turn quality scores look fine; the context retrieval failure is invisible without session-level tracking.
- A constraint established in turn one ("only recommend options under $500") gets overridden implicitly by turn eight as the conversation shifts scope. No individual response violates the instruction — the session gradually abandoned it.
- Pronoun resolution breaks across turns. The model introduces a new entity in turn five; by turn eight, it's conflated with the entity the user actually meant. About 70% of real conversations contain cross-turn anaphoric references that models handle poorly even when each turn is parsed correctly in isolation.
- Role or persona drift: a system prompt establishes a formal, policy-bound assistant. By turn twelve, the model is speaking casually and making commitments that contradict the original persona. Each response looks fine; the trajectory has drifted.
Production agents are particularly exposed. Research shows coding agents spend 60% of their first turn retrieving context, accumulating irrelevant files that stay in the window for the entire task. By 35 minutes of task time, success rates drop for every agent tested. Failure rate approximately quadruples when task duration doubles.
How to Instrument for Session Health
Per-request metrics don't catch this. You need session-level instrumentation that groups spans by session ID and tracks state across turns.
Conversation completeness rate measures whether the session fulfilled the user's original intent — not whether each turn was answered. The baseline question: could you state what the user originally wanted at turn one, and did the session deliver it?
- https://arxiv.org/html/2510.07777v1
- https://arxiv.org/html/2505.06120v1
- https://arize.com/glossary/multi-turn-llm-conversation-degradation/
- https://www.getmaxim.ai/articles/how-context-drift-impacts-conversational-coherence-in-ai-systems/
- https://www.getmaxim.ai/blog/from-turn-1-to-turn-10-how-llms-get-lost-in-multi-turn-conversations/
- https://www.morphllm.com/context-rot
- https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
- https://www.llamaindex.ai/blog/improved-long-and-short-term-memory-for-llamaindex-agents
- https://arxiv.org/abs/2310.08560
- https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026
