The Implicit Feedback Trap: Why Engagement Metrics Lie About AI Quality
A Canadian airline's support chatbot invented a bereavement fare policy that didn't exist. The chatbot was confident, well-formatted, and polite. Passengers believed it. A court later held the airline liable for the fabricated policy. Meanwhile, the chatbot's satisfaction scores were probably fine.
This is the implicit feedback trap. The signals most teams use to measure AI quality — thumbs-up ratings, click-through rates, satisfaction scores — are not just noisy. They are systematically biased toward measuring the wrong thing. And optimizing for them makes your AI worse.
Confident Tone Is Indistinguishable from Correct
Language models produce responses with the same confident register whether they are accurate or fabricated. A hallucinated court citation reads identically to a real one. A made-up refund policy sounds like a real policy. Users have no surface-level signal to distinguish them.
This creates a fatal flaw in thumbs-up feedback: raters evaluate responses based on whether they believe the answer, not whether it is true. A user who lacks the background knowledge to spot a hallucination will upvote it. A user who receives a technically correct but surprising or hedged answer — "this is uncertain, but the most likely answer is X" — may downvote it as unhelpful.
The result is systematic: confident, fluent, wrong answers collect higher ratings than accurate but uncertain ones.
RLHF Bakes the Bias In
Reinforcement learning from human feedback (RLHF) was supposed to align AI outputs with human preferences. The paradox is that it makes hallucination problems structurally worse.
When RLHF was introduced for instruction-following, researchers found that incorporating reinforcement learning from human ratings increased fabrication compared to supervised fine-tuning alone — even as users preferred the RLHF outputs overall. The model learned that confidently stated content, regardless of accuracy, receives higher reward signals.
This isn't a bug in the RLHF implementation. It's a consequence of training against a proxy metric. Agreement between human labelers on what constitutes a "good" response is typically around 73%. The remaining 27% of disagreements encode real ambiguity: some labelers reward accuracy, some reward confidence, some reward concision. The reward model trains on the aggregate, which tends to favor responses that seem good over responses that are good, particularly in domains where evaluators lack the expertise to distinguish them.
The knowledge mismatch problem compounds this further. When labelers rate responses in domains beyond their expertise — medical, legal, technical — they reliably upvote confident-sounding fabrications and downvote hedged-but-accurate responses. RLHF doesn't know the difference. It trains toward whatever the ratings say.
Goodhart's Law, Applied
The classic formulation: when a measure becomes a target, it ceases to be a good measure. Empirical research has now quantified this effect for reinforcement learning specifically — optimizing against an imperfect proxy reward decreases performance on the true objective past a certain optimization threshold.
The pattern shows up everywhere AI uses behavioral signals as feedback:
- Search recommendation systems optimizing for clicks surface sensational content over substantive answers. The click is easy to measure; usefulness is not.
- Code assistants achieve 30% acceptance rates while 76% of users report high hallucination rates — users accept suggestions because they look plausible, not because they are correct. A developer who can't immediately spot the bug in a generated function will accept it.
- Customer support AI receives high satisfaction ratings because it sounds confident and responds quickly, even when the resolution it provided was wrong. The user rates the interaction before discovering the error.
What reward hacking looks like in AI isn't a system deliberately gaming its metrics. It's a model that has been trained to produce the style of responses that humans rate highly, which is not the same as the substance of responses that are actually useful.
Why Engagement Metrics Make This Worse
Thumbs ratings are a bad signal. Implicit engagement metrics — clicks, time-on-page, session length, follow-up queries — are worse, because they compound the same bias with measurement noise.
Consider what implicit signals actually measure:
- Session length measures how long a user stayed, not whether they got what they needed. A user who got a wrong answer and searched for corrections will have a long session.
- Click-through rate measures whether a user found a result interesting enough to investigate, not whether it was correct.
- Follow-up query rate is ambiguous: it could indicate the user needed clarification because the first answer was incomplete, or that they were curious to explore further. Both look the same in the data.
Research on recommendation systems has found that direct intent surveys substantially outperform engagement factors — including likes, shares, and comments — for predicting user retention. The correlation between engagement and actual satisfaction breaks down precisely in the cases that matter most: when the AI is confidently wrong.
The aggregate effect is that implicit feedback loops create a ratchet toward confident mediocrity. Each optimization cycle shifts the model slightly more toward the kinds of outputs that generate engagement signals, which is a proxy for the outputs that seem good rather than the outputs that are good.
Signals That Actually Correlate with Quality
The behavioral signals that track actual quality are harder to collect but meaningful:
Correction rate. When a user immediately follows an AI response with an edited version, a rephrasing request, or a "that's wrong" correction, that is a high-signal quality failure. It requires some instrumentation — capturing the immediate follow-up action in context — but it's a direct behavioral indicator that the user rejected the output.
Follow-up query type. Not all follow-up queries are equal. A user who asks "can you explain step 3 in more detail?" is engaging with a successful response. A user who asks "can you try again, this time actually answering my question?" is not. Classifying follow-up queries by type — clarification, correction, repetition, abandonment — gives a much sharper signal than aggregate follow-up rate.
Time-to-action. For task-oriented AI features, how quickly does the user act on the output? A user who reads a code suggestion and immediately pastes it into their editor likely found it useful. A user who reads it, then opens a browser to look something up, then comes back and modifies it substantially, did not. This requires careful instrumentation but gives a reliable intent signal.
Task completion in context. For agentic systems, the question is whether the downstream task actually succeeded — did the code compile, did the customer issue get resolved without escalation, did the data extraction produce valid output? Outcome verification, not interaction rating, is the only honest measure of agent quality.
Consistency across trials. A model that produces correct output on a given prompt 60% of the time and incorrect output 40% of the time should not score well, even if its average rating is high. Consistency metrics — pass@k, which measures how often a model gets the right answer across k independent attempts — catch this instability where single-pass evaluation misses it.
The Structural Fix
Most teams treating AI quality as a product metric problem are measuring at the wrong layer. The question "is this response rated highly?" is not the same as "is this response correct and useful?"
The practical change is instrumentation depth. Rather than measuring whether a user liked a response, measure what the user did next. Rather than rating interaction quality, verify outcome quality where possible. Rather than relying on thumbs ratings, instrument correction rate and follow-up query type as primary signals.
For teams using RLHF or any form of preference optimization, the implication is direct: your reward signal is only as good as your labelers' ability to distinguish correct from plausible. In specialized domains, that ability is often low. Supplementing human ratings with automated verification — running AI-generated code, checking claims against ground truth, auditing factual assertions — is not optional. It is the only way to prevent reward models from encoding the wrong preferences.
Evaluations that track reasoning paths — not just final answers — catch a class of failures that outcome-based metrics miss. A model that reached the right answer via unstable reasoning will fail on slightly different inputs. Transcript analysis, tool selection audits, and step-by-step verification are more expensive than rating responses, but they're measuring closer to what actually matters.
The One Thing to Change First
If you have one feedback mechanism to fix, fix the follow-up query classifier. Most teams log follow-up queries but don't classify them. A lightweight classifier trained on a few hundred labeled examples — distinguishing clarification requests from correction requests from abandonment patterns — converts a noisy implicit signal into an actionable quality indicator.
The thumb is a social signal. What the user does next is a behavioral signal. One tells you whether the response seemed good. The other tells you whether it was.
Build the thing that tells you whether it was.
- https://huyenchip.com/2023/05/02/rlhf.html
- https://arxiv.org/abs/2310.09144
- https://arxiv.org/html/2602.09305v1
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://www.qodo.ai/reports/state-of-ai-code-quality/
- https://alignment.anthropic.com/2025/stress-testing-model-specs/
- https://dl.acm.org/doi/10.1145/3743127
