The Summarization Validity Problem: How to Know Your AI Compressed Away What Mattered
Summarization fails silently. Your system doesn't crash, logs don't flag an error, and the generated text looks coherent—but somewhere in the compression, the one fact that mattered for the downstream task got dropped. The RAG pipeline returns a confident answer. The multi-hop reasoner reaches a conclusion. The customer service agent gives advice. All of it grounded in a summary that no longer contains the original constraint, exception, or data point the answer depended on.
This is the summarization validity problem: the gap between a summary that is consistent with its source and a summary that preserves what the downstream task needs. Most teams don't instrument for it. They ship pipelines that validate summaries exist, not summaries that are complete.
Why Summarization Appears Everywhere in Production AI
Before discussing how to measure the problem, it's worth mapping where it lives. Summarization isn't just a "summarize this document" feature. It's load-bearing infrastructure in most production AI systems.
Chat history compression is the most common instance. As conversations grow beyond context limits, systems summarize older segments to free up tokens. At a 10:1 compression ratio, context management works beautifully until the user references a constraint they mentioned twelve turns ago—one that got compressed into irrelevance.
RAG document digestion applies summarization to retrieved chunks before passing them to generation. The pipeline's hallucination rate should improve because retrieval provides grounding. But retrieval precision and summarization fidelity compound: if retrieval finds the right document and summarization drops the relevant clause, the grounding advantage evaporates.
Long-context distillation handles input documents that exceed even extended context windows. Multi-document tasks, long PDFs, or corpus-scale retrieval all require pre-compression. Models extended to 128K tokens still apply context management; research shows this extension measurably degrades short-text performance unless explicit distillation objectives are used.
Multi-step reasoning pipelines apply summarization at intermediate hops. Each step summarizes its findings before passing them forward. The failure compounds: information lost at hop two is unrecoverable at hop three. Systems that preserve full intermediate evidence detect errors at the boundary; systems that summarize at each step produce conclusions that are internally consistent but factually degraded.
The Metric That Misses the Point
The standard response to summarization quality concerns is ROUGE. It's fast, it's interpretable, and it has almost nothing to say about whether a summary will serve a downstream task.
ROUGE measures n-gram overlap between generated and reference summaries. A summary that hits 0.65 ROUGE-1 is considered acceptable by the benchmarks that built those thresholds on CNN/DailyMail headlines. But the researchers who designed those benchmarks warn that tracking progress with ROUGE alone is "questionable"—they just don't have a better standard that runs cheaply enough to replace it.
BERTScore does better. Contextual embeddings catch semantic reformulations that n-gram overlap misses, which helps when an abstractive summary paraphrases content. Average BERTScore on standard benchmarks runs around 0.75 versus ROUGE's 0.65. Still, BERTScore measures whether the summary resembles the source in semantic space. It doesn't measure whether the summary can answer the specific questions a downstream task will ask.
SummaC, an NLI-based method that decomposes summaries into sentence pairs and scores entailment, achieves 74% balanced accuracy on inconsistency detection—state of the art for automated faithfulness checking. But it scores 58.5% on implicit hallucinations: claims that don't directly contradict the source but aren't supported by it. Omission is harder than contradiction to detect.
The common thread: these metrics are designed to measure whether the summary is faithful to the source. None of them directly measure whether the summary preserves information required by a specific downstream task.
Scale Is the Hidden Variable
Benchmarking confidence built on short documents is one of the most reliable sources of production surprises in AI engineering. A model achieving 0.7% hallucination rate on standard news articles—essentially negligible—reaches 3.3% hallucination on enterprise-length documents. That's a 4.7x degradation at scale, verified across diverse domains.
The mechanism behind this isn't mysterious. A 2024 study across six summarization datasets documented a "U-shaped" faithfulness curve: models faithfully handle document beginnings and ends, but systematically neglect mid-document content. Faithfulness at document start runs around 90%. Mid-document faithfulness drops to 65%. This isn't a model-specific bug—it's a structural property of how attention behaves across long sequences.
If your validation dataset uses 200-500 token documents, you're testing the top and tail of the faithfulness curve. Your production data is usually longer. The coverage gap between lab evaluation and production performance isn't a known unknown that got deferred—it's an unknown unknown that never appeared in evaluation at all.
The practical implication is straightforward: regression test suites must include documents at production scale. Short, medium (1K-5K tokens), long (8K-32K tokens), and multi-document inputs belong in separate evaluation buckets, because hallucination rates are non-linear across them.
Rethinking Completeness as a Contract
The shift from "validate summaries exist" to "validate summaries preserve task-relevant content" requires changing what you specify, not just what you measure.
The useful abstraction is a completeness contract: a formal specification of what must survive compression. A contract makes information loss visible and testable rather than silent and emergent.
In the medical domain, a completeness contract for discharge note summarization would include: medication name and dosage, contraindications, follow-up instructions, active diagnoses. A study of 450 clinical discharge notes found 191 hallucinated sentences, and 44% of those hallucinations were major—affecting diagnosis or management. The most common failure was omission of follow-up plans in the "Plan" section, with 21% major hallucination rate in that specific field. A completeness contract that explicitly lists "follow-up instructions must appear" would catch that before it reaches a patient.
The most actionable operationalization of a completeness contract is QA-based verification. Generate question-answer pairs from the source document, then measure answerability on the summary. Questions the source can answer that the summary cannot represent dropped information. This maps directly to task relevance: if your downstream QA pipeline needs to answer "what dose of medication X was prescribed," that becomes a test case.
A concrete example: a RAG completeness contract might specify "summary must support answering the three questions that were used to retrieve this document." If retrieval found a document because it matched query Q, the summary of that document must still be able to answer Q. Retrieval precision doesn't help if summarization destroys the relevant clause.
- https://www.sei.cmu.edu/blog/evaluating-llms-for-text-summarization-introduction/
- https://www.nature.com/articles/s41598-025-31075-1
- https://arxiv.org/abs/2410.23609
- https://www.vectara.com/blog/introducing-the-next-generation-of-vectaras-hallucination-leaderboard
- https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
- https://arxiv.org/pdf/2410.12837
- https://arxiv.org/html/2502.07365v3
- https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00453/109470/SummaC-Re-Visiting-NLI-based-Models-for
- https://arxiv.org/html/2604.25130
- https://arxiv.org/html/2503.13657v1
- https://www.nature.com/articles/s41746-025-01670-7
- https://arxiv.org/html/2307.02570
- https://arxiv.org/html/2511.15244v1
- https://deepeval.com/
- https://toloka.ai/blog/rag-evaluation-a-technical-guide-to-measuring-retrieval-augmented-generation/
- https://devblogs.microsoft.com/semantic-kernel/managing-chat-history-for-large-language-models-llms/
