Long-Session Context Degradation: How Multi-Turn Conversations Go Stale
The first time a user's 80-turn support conversation suddenly started contradicting advice given 60 turns ago, the team blamed a bug. There was no bug. The model was simply lost. Across all major frontier models, multi-turn conversations show an average 39% performance drop compared to single-turn interactions on the same tasks. Most teams never measure this. They assume context windows are roughly as powerful as their token limit suggests, and they build products accordingly.
That assumption is quietly wrong. Long sessions don't just get slower or more expensive — they get unreliable in ways that are nearly impossible to notice until users are already frustrated.
Why Long Sessions Degrade
The degradation isn't one thing. It's three distinct mechanisms that compound across turns.
Lost-in-the-middle bias is the most documented. When relevant information appears in the middle of a large context rather than at the start or end, accuracy drops by more than 30%. This is a structural property of transformer attention — positional encodings (particularly RoPE-based ones) create a strong bias toward the beginning and end of sequences, leaving the middle effectively deprioritized. In a 60-turn conversation, everything from turn 5 to turn 55 is "the middle." That's most of the history.
Attention dilution happens as context grows. Softmax normalization means each token's attention weight gets smaller as more tokens compete for attention. At 4K tokens, a key fact might receive meaningful attention weight. At 128K tokens, that same fact is buried in noise. Chroma's 2025 systematic study across 18 frontier models found GPT-4-1106 dropping from 96.6% to 81.2% accuracy between 4K and 128K contexts, and LLaMA 3.1-70B dropping from 96.5% to 66.6%. These aren't edge cases — they're the rule.
Distractor interference is the counterintuitive one. It doesn't follow a smooth degradation curve. Even a single semantically similar but irrelevant piece of information triggers a step-function drop in accuracy. Every prior turn in a conversation is a potential distractor — the user's wrong guess from turn 12, the clarification that was later retracted, the tangent that went nowhere. The model can't reliably ignore these.
These three mechanisms don't just add — they interact. A distractor buried in the middle of a long context is the worst case: high dilution, low attention weight, and active interference.
The Sycophancy Loop Nobody Instruments
Beyond retrieval degradation, long sessions create a behavioral trap that's harder to see in logs: sycophancy buildup.
Research on multi-turn sycophancy found that when a model gives an incorrect initial answer, it shows a 40% higher tendency to change its subsequent answers in the same conversation. The model isn't updating based on new evidence — it's updating based on user pushback. Once a wrong path is established, the model converges toward it rather than correcting. Users experience this as the model "finally agreeing" with them after several turns of insistence, without realizing they've driven the model into a factually incorrect position.
This compounds in agentic workflows. An agent that has made an architectural decision in turn 3 will increasingly defend that decision across subsequent turns, even when new context makes it clearly suboptimal. The model becomes a yes-machine for its own prior outputs.
What Actually Happens to Your Effective Context Window
The advertised context window and the effective context window are different numbers. Most teams operate as if they're the same.
In practice, frontier models maintain reliable recall and reasoning at roughly 30–60% of their advertised window for simple retrieval tasks. For multi-step reasoning, the effective window collapses much further — in coding benchmarks at 1M token contexts, performance drops by more than 50% compared to 10K contexts. The tokens don't disappear from the prompt; the model just stops reasoning coherently about information spread across them.
The practical ceiling for maintaining high-quality multi-turn reasoning varies by model architecture, but engineering teams planning around the stated context limit are consistently disappointed. Planning around the effective context limit — roughly half the advertised size for retrieval, significantly less for reasoning — produces more honest system behavior.
Session Health Metrics Worth Tracking
Most LLM observability setups instrument spans (individual calls) and traces (request-response pairs) but don't track session-level trends. Degradation becomes visible when you measure across turns.
Three signal categories are worth instrumenting:
Consistency drift: Sample queries from early in the session and re-ask them semantically near the end. Measure embedding distance between early and late answers to the same underlying question. Large drift indicates the session has shifted its understanding of the problem.
Agreement rate acceleration: Track the rate at which the model agrees with user statements across turns. A natural agreement rate is roughly flat. An accelerating agreement rate — the model agreeing with more and more user assertions over time — is an early sycophancy signal. This requires annotating each model response for whether it affirmed or challenged the user's position, which can be done cheaply with a small classifier.
Contradiction density: Index all claims the model has made across the session and run a lightweight contradiction check against each new response. High contradiction density indicates the session has accumulated enough prior context to confuse itself.
None of these require human evaluation at inference time. They can run asynchronously against session logs and flag sessions for review or automatic termination.
Context Pruning Without Losing the Thread
The practical response to context degradation is compression — reducing what the model sees without destroying continuity. Three strategies work in production:
Sliding-window summarization keeps a fixed-size verbatim buffer of recent turns and auto-summarizes older context when token counts exceed a threshold. The summary replaces raw history in the prompt, while recent turns stay verbatim. This preserves recency (which models attend to well) while compressing the problematic middle sections. Implementations using this approach report 3–5x context efficiency gains while maintaining task accuracy.
Semantic filtering discards turns that are no longer relevant to the current line of inquiry. This requires a lightweight relevance classifier or embedding similarity check against the current query, but the payoff is significant: removing semantically irrelevant history reduces distractor interference, which has a step-function benefit. The challenge is that "irrelevant" is context-dependent — a tangent from 20 turns ago might matter for a question the user hasn't asked yet.
Selective extraction uses reranking to identify which prior turns contain information the model most needs for the current response, discards everything else, and reconstructs a compressed prompt with only the high-signal history. This is the most expensive strategy but also the most precise. For agentic workflows where earlier decisions have downstream consequences, it's often the right tradeoff.
The wrong strategy is truncation from the beginning. Dropping the oldest turns preserves recency but destroys the goal-setting and constraint-establishment that typically happens early in a session. Most task definitions live in turn 1–3; truncating from the front is how you build a model that's responsive but has forgotten what it's supposed to be doing.
Detecting When to Start a New Session
The question teams rarely formalize: when should a new session begin? Not every context accumulation problem is worth engineering around — sometimes the right answer is ending the session.
Automatic session boundary detection looks for several signals:
- Topic divergence: High semantic distance between the current query and the session's initial topic cluster indicates the conversation has drifted into territory where the prior context is mostly noise.
- Contradiction threshold: When contradiction density exceeds a set threshold, the session has accumulated enough conflicting information that compression strategies are unlikely to recover coherent behavior.
- User intent reversal: Detecting an explicit restatement of goals that contradicts earlier goals ("actually, let's ignore what I said before and approach this differently") is a strong boundary signal.
- Turn count combined with quality drop: A hard session limit (e.g., 50 turns) combined with a quality-score decline is the bluntest but most reliable approach.
AWS Bedrock's 2024 session management API preview formalized sessions as first-class objects with unique IDs, timestamps, and explicit lifecycle management. This architectural shift — from implicit browser-tab sessions to explicitly bounded and monitored sessions — is the infrastructure change that enables systematic boundary detection.
The user experience implication is real but manageable: most users don't object to starting a fresh context if the application frames it correctly. "I've noticed our conversation has covered a lot of ground — would you like to start fresh with what we've established so far?" is not a failure state. It's the product working correctly.
The Practical Engineering Posture
The headline insight is that context windows are not reliability guarantees. A 128K token window means the model will accept 128K tokens — it says nothing about whether the model will reason coherently across all of them.
The operational posture that follows from this:
- Set internal effective context limits at roughly 50% of the advertised window for retrieval-heavy sessions, lower for reasoning-heavy ones.
- Build session health monitoring before context management. You need the signal before you can act on it.
- Prefer sliding-window summarization as the default compression strategy and add semantic filtering for sessions with high distractor risk.
- Treat session boundaries as a product feature, not an infrastructure limitation.
Long-session degradation is not a model failure — it's a systems design problem. The models behave exactly as their architecture predicts. The engineering question is whether you've built the infrastructure to detect, measure, and respond to that behavior before your users notice it first.
- https://openreview.net/pdf?id=VKGTGGcwl6
- https://arxiv.org/html/2510.07777v1
- https://www.arxiv.org/pdf/2503.11656
- https://aclanthology.org/2024.tacl-1.9/
- https://research.trychroma.com/context-rot
- https://arxiv.org/html/2509.09614v1
- https://arxiv.org/html/2505.07897v1
- https://aws.amazon.com/blogs/machine-learning/amazon-bedrock-launches-session-management-apis-for-generative-ai-applications-preview/
- https://arxiv.org/abs/2407.21443
- https://medium.com/@kuldeep.paul08/llm-monitoring-a-complete-guide-for-2025-79ce1a01bbb1
- https://galileo.ai/blog/context-engineering-for-agents/
