Skip to main content

The Summarizer That Paraphrased Away the User's Literal Question

· 8 min read
Tian Pan
Software Engineer

A user asks: "Does this qualify as a 'transfer' under article 28?" Forty turns later, the model gives an answer to a different question. The transcript shows the model answered the question it was given. The user is reading a complaint that reads like a hallucination. Both are right. The model never saw the user's question — it saw your summarizer's polite translation of it: "user asked about article 28 applicability."

The word "transfer" was the question. The summarizer threw it away because the summarizer's loss function was tuned to preserve facts, not wording, and the rubric never learned the difference between paraphrasing the topic and paraphrasing the constraint. Topic was preserved. Constraint became fog.

This failure mode is structural, not anecdotal. Any application that compresses long conversations with a model-generated summary has a second model in the critical path — one whose quality contract is usually treated as a token-budget knob rather than as a piece of product logic. That asymmetry is where the bug lives.

Summarization is lossy compression with the wrong loss function

The textbook description of conversational summarization sounds harmless: drop the older turns, generate a brief summary of what happened, prepend the summary to the live context, and the model on turn forty-one effectively sees turn one through a thin layer of paraphrase. The compression ratio is real — a forty-turn dialogue collapses to a few hundred tokens — and the latency penalty is bounded. Most production "long context" strategies are some variant of this.

What the textbook omits is that the summarizer is being asked to optimize a loss function the application never specified. The default summarizer prompt rewards conciseness, factual coverage, and topical fidelity. It does not reward verbatim preservation of named entities, the specific verbs that load-bear in a question, or the exact phrasing of a constraint. Recent work on long-form summarization has documented a "U-shaped" faithfulness curve — models faithfully summarize the beginning and end of a document while neglecting the middle, and that bias compounds with each pass. By turn forty-one the literal first turn has been through one or more rewrites, each one trained to sound natural and informative, neither one trained to be a contract.

The thing that breaks is not "did the model retrieve turn one." The model retrieved its replacement. The summarizer rewrote a yes/no question that hinged on the legal interpretation of a verb into a topical pointer with no constraint, and the downstream model answered the pointer.

The constraint is the question; the topic is just a tag

A useful mental model: most user turns carry both a topic ("article 28 of the regulation") and a constraint ("does the specific act we just described qualify as a 'transfer'"). Topic is what a search engine would index. Constraint is the load-bearing logic of the request.

Summarizers, especially when they are running on a short instruction like "preserve the key facts of each turn in one sentence," are almost universally biased toward topic. Topic is general, easier to paraphrase, and produces output that looks coherent to a human evaluator. Constraint is specific, often hinges on a single verb or a single qualifier, and is the first thing a paraphrase corrupts. The summarizer is doing its job. Its job is the wrong job.

This shows up in non-legal domains constantly. A medical user asks whether a drug is contraindicated for a specific pre-existing condition; the summary records "user asked about drug interactions." A finance user asks whether a particular structure qualifies as a wash sale; the summary records "user discussed tax treatment of a trade." A support user asks whether the refund covers shipping; the summary records "user asked about the refund policy." In every case the answer to the topical paraphrase is different from the answer to the original question, and the model is forced to answer the paraphrase because nothing else survived.

The thing you cannot summarize: the user's exact words

The pattern that closes the gap is unfashionably simple: the original user query, verbatim, is not subject to summarization. It is treated as a structured slot in the context, alongside the rolling summary, and it stays present for the lifetime of the session.

Some recent work in agent context compression has formalized this as a "surviving vocabulary principle" — when the compressor must paraphrase, it is constrained to reuse the participants' actual phrasing rather than invent its own. A more aggressive variant deletes low-signal content while keeping every surviving token identical to the input, so the compressed transcript is a strict subset rather than a rewrite. Either approach denies the summarizer the freedom to substitute its own words for the user's. That denial is the entire point.

In a chat application this translates to three concrete behaviors. First, the most recent K turns stay verbatim — non-negotiable. Second, any user turn that can be identified as a primary intent — a question, a request, a constraint statement — is pinned verbatim regardless of age, even when its surrounding turns get compressed. Third, the summarizer's prompt explicitly instructs it that named entities, modal verbs, qualifiers, and quoted phrases must be carried through unchanged or marked as missing rather than paraphrased.

The third point matters because no instruction at the application layer reaches the summarizer's behavior unless its prompt and its eval reflect that instruction. "Preserve key facts" and "preserve key facts including the literal phrasing of constraints" are different prompts producing different summaries.

Evaluate the summarizer the way you evaluate the model

The summarizer is not a utility function. It is a model in the critical path of every conversational turn after the sliding window fills. It has its own failure modes, its own drift surface, and its own contract with the application. Most teams have an eval suite for their primary model and no eval suite for the summarizer at all. That asymmetry is how the paraphrase-away-the-question bug ships to production undetected.

A useful eval is constructed from pairs of inputs and answers where the answer changes under common paraphrases. Take a real user question with a load-bearing verb or qualifier, write the answer that follows from the exact phrasing, write a second answer that follows from a topical paraphrase, and measure how often the summarizer produces a representation that the downstream model uses to generate the wrong one. A summarizer that consistently maps "qualify as a transfer" into "applicability of article 28" is failing this eval by construction; if the eval doesn't exist, the failure is invisible.

The second metric is entity coverage. For every named entity, qualifier, and quoted phrase in the original transcript, what fraction survives in the summary? Faithfulness work in abstractive summarization has shown that entity coverage correlates strongly with downstream task accuracy on questions whose answer depends on the entity. The summarizer's prompt and the summarizer's eval should both reward entity coverage explicitly.

The third metric is drift across passes. Progressive summarization — summarizing the summary of the summary as the conversation lengthens — accumulates paraphrase error in a way that single-shot summarization does not. Running a fresh summary against the original transcript at each pass costs more tokens; pretending that doesn't matter costs more incidents.

The summarizer is a product surface, not a token-budget knob

The architectural realization that closes this category of bug is that the summarizer is a model whose behavior the user feels, and it should be governed the way you govern the primary model. That means a versioned prompt under code review, a regression suite that fails on entity loss and constraint paraphrase, a rollout process for prompt or model changes, and an owner whose job description includes the question "is the summarizer making the right tradeoffs for our hardest user cohort."

It also means recognizing that the choice of compression strategy is a product decision. Pure verbatim retention of the user's primary intent costs tokens forever; pure summarization costs the user's exact question; structured extraction into a memory object — primary intent here, constraints here, named entities here — costs implementation complexity but lets the application control which dimensions survive. None of these are free, and none of them are default. The default is whatever the summarizer's prompt happens to reward, which on most teams is "be concise and sound natural" — exactly the prompt that paraphrases the constraint away.

The user who asked whether their act qualified as a "transfer" never authorized the substitution into "applicability." A summarizer that performs that substitution silently is not optimizing for the user; it is optimizing for its own loss function and forcing the user's question into its preferred shape. The fix is not a smarter summarizer. The fix is to stop asking the summarizer to make a decision that belongs to the application: which words in this conversation are the user's, and which words are negotiable.

Keep the user's words. Summarize around them. Evaluate the summarizer like a model in the critical path because that is what it is. The forty-first turn will answer the question the user actually asked.

References:Let's stay in touch and Follow me for more thoughts and updates