Skip to main content

Repeat-Question Detection: The Session-Level Blind Spot Your Per-Turn Eval Cannot See

· 11 min read
Tian Pan
Software Engineer

A user opens your chat, asks a question, and gets back a response your eval suite would score 4.6 out of 5. Then they ask the same question with different words. Same answer. Same score. They try once more, this time with the kind of hedging language people use when they suspect the machine isn't listening — "what I'm actually trying to do is…" — and then they close the tab. From the model's perspective, three clean Q&A turns. From the dashboard's perspective, an engaged session. From the user's perspective, a product that failed them three times in a row and won't be opened again.

This is the failure mode per-turn evaluation cannot see. Each individual turn looked correct in isolation. The judge gave a thumbs up. The hallucination detector stayed quiet. The relevance score was high. And yet the conversation, as a whole, did not resolve anything — and that's the unit the user was actually evaluating you on.

The mistake is treating each user turn as an independent draw from an evaluation distribution, when in practice consecutive turns from the same user in the same session are deeply correlated. If turn N+1 is semantically equivalent to turn N, you are not seeing a new query — you are seeing the same query, surfaced again because the previous response did not land. Counting that as two engaged turns instead of one unresolved one is how the gap between "model is performing well" and "users are abandoning us" stays invisible for a quarter.

The Shape of a Repeat-Question Session

The pattern is structural enough to name. Same user. Same session. Two or more user turns that map to the same underlying intent. Increasing markers of frustration in the rephrasing — hedges, escalations, explicit references to the prior turn ("like I said before," "no, that's not what I asked"). And a terminal action that is not resolution: closing the tab, asking a different question entirely, opening a support ticket, or — worst case for your retention numbers — never coming back.

Industry data is starting to put numbers on this. Researchers studying multi-turn LLM behavior have found that models "get lost" in conversations once specifications are spread across turns, and that high rephrase rates (more than two retries) are a reliable indicator of dialogue breakdown. Studies of conversational support systems treat repeated questions as a primary frustration signal, alongside abrupt termination, negative sentiment, and channel-switching to a human agent. The Amazon Alexa team specifically built contextual rephrase detection because the friction signal was strong enough to justify a dedicated model.

What makes this metric so leaky is that none of those signals show up in single-turn quality scoring. A turn-level LLM-as-judge graded against your rubric has no way to know that the user already asked this same question two minutes ago and got a similar-but-unhelpful answer. The judge sees one input, one output, scores it, moves on. The session arc is invisible to it by construction.

Why the Per-Turn Eval Was Lying By Omission

The per-turn eval is not wrong, exactly. It is answering the question it was asked: was this response, evaluated against this prompt, of high quality? The problem is that the question it was asked is not the question the business needs answered. The business question is: did this conversation, as a whole, resolve the user's actual need?

These are not the same metric. A response can be technically correct, factually accurate, and grammatically polished, and still fail the conversation because it answered the literal phrasing of the question rather than the underlying need behind it. This is why support-industry frameworks have long distinguished between deflection rate (the bot closed the session), containment rate (the user didn't escalate to a human), and resolution rate (the user's actual problem went away). The first two are easy to measure and easy to game. The third is the one that correlates with retention.

Per-turn quality is closer to deflection than to resolution. It tells you whether each individual exchange was well-formed, not whether the cumulative effect of those exchanges was useful. A session that contains five high-quality but mutually redundant responses to the same restated question scores well on per-turn quality and zero on resolution.

The corollary that's harder to swallow: a portion of your "high quality" responses are the model's politeness layer dressing up the same non-answer. The judge rewards fluency. The user wanted the thing they asked for.

Building the Detection Pipeline

Detecting the repeat-question shape is mechanically simple and almost no one ships it. The minimum viable detector has four moving parts.

Intent clustering across turns. Embed each user turn in a session with a sentence-embedding model — Sentence-BERT-class or a hosted equivalent — and compute pairwise cosine similarity within the rolling window of the last N turns. Flag the session when two or more user turns land in the same intent cluster above a tunable similarity threshold (0.75 to 0.85 is a reasonable starting band for sentence embeddings, and you'll want to tune it against a small hand-labeled sample of your own traffic). Open-intent-discovery techniques have been doing this for years for ticket routing; the novelty is running it within a single session as a quality signal rather than across sessions for taxonomy building.

Frustration-signal classification. A small classifier — or even a cheap LLM call with a tight rubric — that scores each user turn on a frustration scale. It picks up hedges, escalation language, repetition markers ("again," "still," "as I said"), and explicit dissatisfaction ("that didn't answer my question"). Recent research on user frustration in task-oriented dialog has shown that even with class imbalance, transformer-based frustration classifiers reach usable F1 scores once you have a few thousand labeled turns. You don't need state-of-the-art accuracy here — you need a signal that's correlated with eventual non-resolution.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates