Skip to main content

The Context Window Cliff: Application-Level Strategies for Long Conversations

· 10 min read
Tian Pan
Software Engineer

A 90-minute support session. A research assistant that's been browsing documents for an hour. A coding agent that's touched a dozen files. All of these eventually hit the same wall — and when they do, they don't crash loudly. They get dumb.

The model starts forgetting what was decided twenty minutes ago. It contradicts itself. Retrieval results that should be obvious go missing. Users notice something is off but can't articulate why the assistant got worse. This is the context window cliff: not a hard error, but a gradual quality collapse that your monitoring almost certainly doesn't measure.

Expanding the context window doesn't fix this. Models with million-token windows still degrade on content in the middle, and even when they don't, you're paying for 100x more tokens while the model attends to a fraction of them. The solution is application-level context management — deliberate strategies for what stays in the window, what gets summarized, and what lives outside it entirely.

Why the Cliff Is Harder Than It Looks

The naive mental model is that context fills up like a bucket: everything fits until it doesn't, then you get an error. The reality is worse.

Quality degrades well before the hard limit. As context grows, the attention mechanism has to distribute focus across more tokens, and the signal-to-noise ratio drops. Production teams running instrumented evaluations consistently report that model quality starts declining at 60–70% of the rated context window, not at 100%. The hard limit is not your effective limit.

The "lost in the middle" problem compounds this. Research on multi-document question answering found that model performance follows a U-shaped curve based on where information appears in the context: content at the beginning (primacy) and end (recency) gets reliably attended to, while content in the middle gets progressively ignored. For GPT-3.5-Turbo, placing the relevant document in the middle of a long context dropped performance below the no-document baseline — meaning retrieval actively hurt. This vulnerability persists even on models explicitly designed for long-context handling.

Frameworks hide the degradation. Most production frameworks silently truncate or simply drop old messages without telling the application layer. System prompts disappear. Tool outputs from early in the session get cut. The model proceeds as if it has full context, and you have no signal that anything changed. Newer models return explicit errors, but that only catches the hard limit — not the soft degradation that starts much earlier.

Latency grows with context. Every additional token in the input adds processing time. Benchmarks show roughly 7x latency increase at 15,000 words of context compared to short prompts. For interactive applications, this means users in long sessions experience progressively slower responses, often attributed to "network issues" or "the model being busy."

The Summarization Ladder

The most common fix is periodic summarization, but the naive version — "summarize the whole conversation" — doesn't scale well. A better pattern is a tiered summarization ladder.

Rolling buffer summarization maintains a fixed-length recent message buffer (typically the last 20–30 turns) alongside a running summary of everything before it. When the buffer fills, the oldest messages get summarized and merged into the running summary. The model always sees fresh context in full fidelity and compressed history as a summary. The right trigger is 70–80% context capacity — summarizing before you're forced to truncates cleanly rather than mid-conversation.

Hierarchical summarization handles long documents rather than conversation history. It chunks a document, summarizes each chunk independently, then combines summaries and re-summarizes. The result is a tree of summaries at different granularities. This is effective for research workflows where an agent reads a 200-page document — but be aware that errors propagate up the tree. A mistake in a chunk summary becomes a mistake in every higher-level summary that includes it.

Entity-preserving summarization is the most expensive but most reliable variant. Rather than summarizing prose, the system extracts entities (people, systems, decisions, open questions) and maintains a structured record of them separately from the compressed conversation text. When the model needs to reason about a specific entity, it has exact data rather than a potentially garbled summary. This is particularly important for multi-session agents where summarization errors compound across sessions.

The critical design question is what to preserve at full fidelity versus what to compress. Good heuristics: keep the most recent N turns verbatim (the model needs precise recent context), keep any explicit user instructions or constraints verbatim, and keep the outputs of tool calls verbatim (summaries of structured data introduce errors). Compress conversational filler, intermediate reasoning, and content that has been superseded by later information.

Selective Retention: Not Everything Is Worth Keeping

Summarization assumes all conversation history has some value. Selective retention challenges that assumption. Most conversations contain significant amounts of content that is either redundant, superseded, or irrelevant to the current state of the task.

Relevance-scored eviction assigns a score to each context element based on recency, semantic similarity to the current query, and whether the element was explicitly referenced in a later turn. When context fills, low-scoring elements are evicted first. This is more accurate than simple sliding window approaches but requires maintaining metadata alongside context.

Supersession tracking identifies when new information explicitly overrides old. If the user says "actually, ignore that file — use this one instead," the old file reference is worthless in context. Systems that track supersession events can aggressively evict stale information rather than carrying it forward.

Conversation phase detection distinguishes between exploration (early turns where many threads are open) and execution (later turns where a clear path is chosen). In execution phase, much of the exploration context is noise. Detecting the transition and compressing the exploration phase is a high-leverage eviction opportunity.

The failure mode to watch for: over-aggressively evicting context that seems redundant but turns out to be a load-bearing constraint. The user's initial requirement stated fifteen turns ago is exactly the kind of thing that gets silently dropped. Preserve constraints and requirements with much higher retention scores than discussion.

Externalization: Move Context Out of the Window

The most scalable approach treats the context window as working memory and externalizes everything that doesn't need to be there.

Retrieval-augmented generation is the well-known version: embed source material, store in a vector database, retrieve only the top-matching chunks at query time. The production detail that matters: retrieval precision at scale requires more than raw embedding similarity. Hybrid search (dense + sparse), re-ranking, and query expansion all improve the relevance of what ends up in context. Three highly relevant chunks almost always outperform ten moderately relevant chunks, both for quality and cost.

Agent memory servers implement a dual-tier pattern: short-term memory (current session, full fidelity) and long-term memory (cross-session, retrieved by semantic similarity). When a session ends, important facts get extracted and written to long-term memory. When a new session starts, relevant long-term memories are injected as context. This is how an agent can "remember" details from a conversation two weeks ago without carrying that history in every subsequent prompt.

Externalizing tool outputs is underused. When an agent calls a tool that returns a large document or dataset, the naive approach puts the entire output in context. A better pattern: store the raw output externally, put only a structured summary or the specific fields the model needed in context, and make the full output available via a follow-up tool call if the model needs more detail. This pattern — retrieve-to-context on demand — can reduce context size by 60–80% in document-heavy workflows.

Semantic caching addresses a different dimension: why keep re-computing context that's been seen before? A semantic cache recognizes when a new query is semantically equivalent to a previous one and returns the cached response, bypassing context assembly and inference entirely. At scale, this reduces both latency and cost significantly — and has the side effect of keeping frequently-asked queries from inflating context with redundant retrieval results.

The Token Budget Framework

All of these strategies require a mental model that most engineers don't have when they start: a context window has a budget, and every element competes for a finite allocation.

Treat the context budget as having explicit categories:

  • System prompt: Your instructions, persona, constraints. These must survive intact. Reserve space explicitly; never let conversation history crowd them out.
  • Retrieved context: Chunks from external sources. Cap this at a fraction of the total budget, not an unlimited expansion.
  • Conversation history: Recent turns in full, older turns summarized. Define the boundary explicitly.
  • Working state: Current task description, open questions, decisions made. Keep this tight and up-to-date.
  • Response space: The model needs room to generate. This is often forgotten until output gets cut off.

The most common production failure is having no budget framework at all — just appending turns until something breaks. With an explicit budget, you can make deliberate tradeoffs: in a research workflow, give more budget to retrieved context; in a multi-turn dialogue, give more to conversation history.

What Monitoring to Add

Most teams measure context-window-related problems indirectly and late. The signals that catch problems early:

Context utilization percentage per request. Track the distribution, not just the mean. P95 and P99 utilization above 75% are the real signal — you're regularly close to the soft quality cliff.

Response quality by context utilization percentile. Use your eval metrics (whatever you already track — user ratings, task success, hallucination rate) and segment by how full the context window was. If quality drops significantly above some utilization threshold, you've identified your soft cliff.

Summarization trigger rate. If summarization is triggering on every request in your longest sessions, the conversation design has outgrown the strategy. You need a more aggressive externalization approach.

System prompt survival rate. Explicitly check whether your system prompt appears in the final assembled context. Silent truncation that removes it is a severe failure mode that standard error monitoring won't catch.

What Most Frameworks Get Wrong

The biggest mistake is treating context management as an afterthought — something to bolt on when users start complaining about degraded behavior. By then, you've trained users on a behavior that will regress during the migration.

Design for context pressure from day one. Choose your summarization trigger points, externalization strategy, and budget framework before you ship. Monitor context utilization from the first week. The conversations that strain your context window are almost always the most valuable ones — power users, complex tasks, long sessions — and they're the ones who will notice first when quality falls off.

The second mistake is treating the advertised context window as the effective context window. The numbers that matter are where quality starts degrading (typically 60–80% of maximum), how latency scales with context length, and what your specific model does at different positions. These vary across models and you should measure them on your actual use cases, not assume the benchmark numbers apply to you.

Context windows will keep getting larger. The engineering discipline for managing them needs to keep pace — not because the limits won't grow, but because the failure modes don't disappear with more tokens. They just get harder to notice.

References:Let's stay in touch and Follow me for more thoughts and updates