Skip to main content

Re-Ask Rate: The Failure Signal Your Eval Pipeline Never Extracts

· 10 min read
Tian Pan
Software Engineer

Open any production chat transcript long enough and you will find a user who asks the same question three times. The phrasing changes a little each turn — pronouns swap to nouns, a clarifier gets bolted on, the polite hedge falls away by the third try — but the underlying request is identical. They are not asking three questions. They are asking the same question, and the agent is failing to answer it, and the user is hoping that this time the words will land differently.

The transcript-level signal here is so loud it is almost obscene. The user has told you, with their own keystrokes, that the previous response did not help. They did not need to fill out a survey. They did not need to leave a thumbs-down. They told you by typing the question again. And in most production AI stacks, this signal is silently discarded by an eval pipeline that scores each turn in isolation and a satisfaction survey that only fires at session end — by which point the user who re-asked three times has usually already churned and will never grade anything.

This is the cleanest implicit-failure signal in conversational AI, and almost nobody is harvesting it.

Turn-Level Evaluation Hides Conversational Failure

The dominant eval pattern for chat-style products is turn-level scoring: grade each model response on its own merits — faithfulness, relevance, factuality, safety — and aggregate the per-turn pass rate into a system-level number. The aggregated number is the one that goes in the deck. It is the number that decides whether the next model gets shipped.

The problem with turn-level eval is that it grades the response, not the outcome. A response can be locally fluent, locally on-topic, locally faithful to retrieved context, and still completely miss what the user actually wanted. The turn-level rubric has no slot for "this answer caused the user to repeat themselves." It has no slot for "this answer was technically correct but addressed the wrong reading of the question." The user knows. The rubric does not.

The chatbot evaluation literature has been catching up to this. Recent work explicitly distinguishes turn-level metrics from session-level ones, noting that per-turn relevance can look high while the user's underlying goal goes unmet across the whole conversation. The information retrieval community figured this out a decade earlier with web search: a click is a noisy signal, but a query reformulation immediately after a click is an almost-certain sign that the user did not find what they were looking for. The same dynamic applies inside a chat session. The user's next message is the most honest review the previous response will ever get.

CSAT Surveys Sample the Wrong Population

The other place teams expect this signal to surface is end-of-session CSAT. It does not surface there, and the reason is sampling.

End-of-session surveys are answered by the population of users who reached the end of the session. Users who churned after the first bad answer are not in that population. Users who escalated to a human midway are not in that population. The people most qualified to tell you the bot is broken — the ones who left because it was broken — are systematically excluded from the dataset you are using to measure brokenness. The surveys you collect are dominated by users for whom the agent worked well enough to finish, lightly contaminated by users patient enough to vent in a survey, and absent the actual failure cases entirely.

This is survivorship bias dressed up as voice-of-customer data. The aggregate CSAT looks fine because it is the average of the people the agent did not lose. Meanwhile, the re-ask behavior of the lost users is sitting in your logs unread.

Re-Ask Has Decades of Evidence Behind It

Treating user repetition as a failure signal is not a new idea — it just has not crossed from web search and traditional IVRs into LLM chatbot evals.

In web search, query reformulation is one of the most studied implicit feedback signals there is. Researchers have used it to predict user satisfaction more accurately than click-based metrics alone, on the grounds that a search where the user reformulates is almost by definition a search the original query did not satisfy. The reformulation can be a generalization, a specialization, a spelling correction, or a complete pivot — each pattern carries its own diagnostic weight. Old IVR systems used "did the caller press 0 to reach an agent" as the same kind of signal: an escape valve that translates user frustration into a measurable event.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates