Conversation History Is a Liability Your Prompt Never Admits

April 23, 2026 · 10 min read

Software Engineer

Read your product's analytics the next time a user says "the AI got dumber today." Filter to sessions over twenty turns. You will find the same U-shape every time: early turns score well, middle turns score well, late turns fall off a cliff. The prompt hasn't changed. The model hasn't changed. What changed is that every one of those late turns is carrying a payload of user typos, false starts, model hedges, corrections that were later reversed, tool outputs nobody re-read, and the fossilized remains of a goal that the user abandoned on turn four. Your prompt template treats this sediment as signal. The model does too. It shouldn't.

Chat history is not free context. It is a liability you are paying to re-send on every turn, and the dirtier it gets, the more it corrupts the answer you are billing the user for. The chat metaphor is the source of the confusion. Chat interfaces habituate users and engineers to treat the transcript as sacred — scrollable, append-only, never reset. That habit is imported wholesale into LLM applications even though it has no physical basis in how models process context. The model is stateless. The transcript is just a string you chose to grow. You can shrink it. You often should.

The per-turn quality attrition curve nobody plots

Context rot is the 2025 term of art for what happens, and the evidence is now hard to argue with. Chroma's systematic study of eighteen frontier models — GPT-4.1, Claude, Gemini, the full roster — found that every single one degrades as input length grows, and models with nominal 200K windows start showing measurable quality drops well before 50K. Microsoft's multi-turn study simulated over 200,000 conversations across fifteen production models and reported a 39% average accuracy drop when the same task was split into a multi-turn exchange instead of a single prompt. Not a degradation on some niche stress test. A drop across general task performance.

The reasons stack. Attention distributes unevenly across a long input: tokens at the beginning and end dominate, the middle sinks into a low-attention trough — the well-documented "lost in the middle" effect. Rotary position embeddings amplify this by design. Each additional turn pushes earlier correct information into that trough and adds new tokens that compete for the same finite attention budget. Worse, the model's own past output carries a disproportionate weight on future turns: once it commits to an answer shape on turn three, it leans on that shape at turn twelve even when the user has quietly pivoted. Microsoft's researchers labeled this precisely: premature answer attempts, verbosity escalation, and over-adjustment to the first and last turns at the expense of everything in between.

Plot the per-turn accuracy of a real agent session, not a benchmark. The line is not flat. It bends.

Every turn adds sediment the model cannot tell from signal

A running conversation accumulates four kinds of junk, and none of them look like junk to the transformer reading them.

The first is user-side noise: typos corrected on the next turn, half-sentences abandoned, requests restated because the first phrasing didn't land. The model sees the first attempt and the second attempt with equal prominence. On later turns it may resurface the first, ungrammatical version of the question because that's what the attention pattern picks up.

The second is model-side hedges and retractions. "I think it's X, but I'm not sure" followed three turns later by "actually it's Y" leaves both assertions in context. Ask the model anything tangentially related on turn twenty and it will cheerfully quote the hedge it has already retracted. Google's Gemini team documented a clean version of this while building an agent that played Pokémon: the agent hallucinated possession of an item, the hallucination got written into the "goals" section of its context, and from then on every turn reinforced a false belief. Context poisoning is not a theoretical risk. It is the default behavior of any system that treats its own past output as ground truth.

The third is stale corrections. A user says "call me by my first name" on turn two, then changes their mind on turn eight. The first instruction is still in the transcript, the second is a later instruction, and which one dominates depends on position effects you do not control.

The fourth is tool output no one re-reads. A 4KB JSON blob from a lookup three turns ago is still consuming 1,200 tokens of attention budget on every subsequent turn, and the answer it contained has already been extracted and used. You are paying to re-send garbage indefinitely.

A stateless model has no way to distinguish any of this from the user's actual current question. You have to do it for it.

Trimming is cheap and lossy. Summarizing is expensive and hallucinogenic. Pick your poison knowingly.

The two industry-default compaction strategies both have failure modes most teams discover in production.

Trimming — the sliding window of the last N messages — is O(1), deterministic, and has zero chance of introducing new hallucinations because it generates no new text. The cost is that any instruction, preference, constraint, or fact the user gave you before the window cutoff is gone, and the model has no way to know it ever existed. The user who said "always reply in metric units" on turn two will get imperial replies on turn forty and will correctly conclude the product is broken. Trimming is cheap to implement and cheap to run, but it silently drops the exact load-bearing details that distinguish a good long session from a bad one.

Summarization — running an LLM over the older turns and replacing them with a compressed digest — preserves long-range information in principle. In practice, it is a lossy encoder with semantic drift. Research on iterative summarization shows a specific and predictable pathology: "I like mild spicy food" becomes "likes spicy food" after one pass and "loves very spicy food" after the next. Low-frequency, high-importance details — "never call the production database directly," "the customer's ID is 47, not 4772" — tend to vanish by the third compression. And every summarization pass is a fresh hallucination surface: the summarizer invents connections that weren't there, consolidates distinct requests into confused ones, and injects its own priors into what the user supposedly said. You traded trimming's clean forgetting for summarization's creative rewriting.

Hybrid approaches — pin the system prompt and the last K turns verbatim, summarize only the middle band, and carry a small extracted-facts structure across summarizations — dominate both extremes in practice. But they are not free either. Factory.ai's production write-up describes maintaining a "rolling summary of the information that actually matters" combined with anchored summaries that only get re-summarized when new spans are dropped. The cost is a real engineering effort: you need a schema for extracted facts, a policy for which facts survive compaction, and an eval harness that catches regressions when a compaction run strips out the wrong thing. The payoff is 10x effective session length in the teams that actually do it. Teams that don't — most teams — are running on trim-or-drift autopilot and wondering why turn thirty feels worse than turn three.

The compaction signals your system should watch for

Compaction should not fire on a fixed turn count. It should fire on observable conversation-health signals that predict imminent degradation. A few that are cheap to measure and disproportionately useful:

Token utilization relative to the model's effective window, not its nominal window. If Chroma's numbers hold, "effective" is often 25–40% of "nominal." Budget accordingly.
Repetition rate in the model's last K outputs. A rising bigram-overlap score across turns is an early indicator the model is locking onto a repeating template and ignoring new user input.
User correction frequency. Count how often the last three user turns contain "no," "actually," "I meant," or any phrasing that means "you got it wrong." A spike means the model is misreading the state; compaction + restatement helps reset it.
Tool-call redundancy. If the agent is calling the same tool with the same arguments it already called ten turns ago, earlier tool output has rotted out of effective attention. Compact and re-surface the relevant facts explicitly.
Topic drift distance. Embed the last user turn, embed the first user turn, measure distance. Past a threshold, the early turns are probably not the right context anymore — they are ballast.
Self-contradiction detection. Cheap pass: extract claims from the last assistant turn, check them against the running extracted-facts state, flag mismatches for compaction-time resolution.

Most teams I have seen fire compaction on "we're approaching the context limit," which is the worst possible trigger because it waits until the damage is already done. Fire on health signals. Fire early. Fire aggressively.

The "no reset" taboo is a UX choice, not an engineering constraint

The hardest part of this problem is not technical. It is that "the AI reset our conversation" reads as rude, broken, or forgetful to users who have been trained by every chat product since SMS that transcripts are permanent. So teams avoid visible compaction, let the transcript grow, and eat the quality degradation instead. Then they get bug reports about the AI getting worse and treat those as model-quality issues rather than context-management issues.

The right frame is that your job is to preserve the user's intent and state, not the literal transcript. Show the transcript in the UI — users want to scroll. Do not send the transcript to the model. Send a curated context built from a pinned system prompt, a durable extracted-facts structure, the last handful of turns verbatim, and a summary of the middle band. The user's scroll view and the model's input can be — should be — different artifacts. Treating them as the same thing is the concession to the chat metaphor that is costing you accuracy.

Two further UX moves help. Make compaction events legible — a small, friendly "I've organized our conversation so far" marker, possibly expandable to show what was retained — so compaction reads as attentiveness rather than forgetting. And give the user an explicit, one-click "start fresh but keep [preferences / project / facts]" action. Half the time, the most useful thing a long-session user can do is start a new session that inherits only the distilled state. Most products make this disproportionately painful, which is why sessions run on until they collapse under their own weight.

What to do on Monday

If you ship a chat product and haven't touched your context handling in six months, three things are worth doing this week. First, instrument per-turn accuracy on whatever eval you already run, and plot it against turn number. If the line is flat you either don't have enough turns in your evals or your evals aren't sensitive. Second, measure how much of every prompt is tool output from three or more turns ago; that's your cheapest win. Third, pick one compaction signal from the list above — user correction frequency is the easiest — and wire a compaction trigger to it. Ship the rolling summary + pinned facts pattern before you ship anything clever.

Conversation history is not memory. It is unstructured append-only log data that you are choosing, every turn, to treat as input to a model that has no way to distinguish its load-bearing parts from its noise. The teams that get long sessions right treat the transcript as a liability to be curated and the model's working context as a contract to be managed. The teams that don't are the ones whose users write "it used to be smarter" in the feedback form.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Conversation History Is a Liability Your Prompt Never Admits

The per-turn quality attrition curve nobody plots

Every turn adds sediment the model cannot tell from signal

Trimming is cheap and lossy. Summarizing is expensive and hallucinogenic. Pick your poison knowingly.

The compaction signals your system should watch for

The "no reset" taboo is a UX choice, not an engineering constraint

What to do on Monday

Recommended Reading

About Tian Pan

The per-turn quality attrition curve nobody plots​

Every turn adds sediment the model cannot tell from signal​

Trimming is cheap and lossy. Summarizing is expensive and hallucinogenic. Pick your poison knowingly.​

The compaction signals your system should watch for​

The "no reset" taboo is a UX choice, not an engineering constraint​

What to do on Monday​

Recommended Reading

About Tian Pan

The per-turn quality attrition curve nobody plots

Every turn adds sediment the model cannot tell from signal

Trimming is cheap and lossy. Summarizing is expensive and hallucinogenic. Pick your poison knowingly.

The compaction signals your system should watch for

The "no reset" taboo is a UX choice, not an engineering constraint

What to do on Monday