The Implicit Feedback Trap: Why Engagement Metrics Lie About AI Quality
A Canadian airline's support chatbot invented a bereavement fare policy that didn't exist. The chatbot was confident, well-formatted, and polite. Passengers believed it. A court later held the airline liable for the fabricated policy. Meanwhile, the chatbot's satisfaction scores were probably fine.
This is the implicit feedback trap. The signals most teams use to measure AI quality — thumbs-up ratings, click-through rates, satisfaction scores — are not just noisy. They are systematically biased toward measuring the wrong thing. And optimizing for them makes your AI worse.
Confident Tone Is Indistinguishable from Correct
Language models produce responses with the same confident register whether they are accurate or fabricated. A hallucinated court citation reads identically to a real one. A made-up refund policy sounds like a real policy. Users have no surface-level signal to distinguish them.
This creates a fatal flaw in thumbs-up feedback: raters evaluate responses based on whether they believe the answer, not whether it is true. A user who lacks the background knowledge to spot a hallucination will upvote it. A user who receives a technically correct but surprising or hedged answer — "this is uncertain, but the most likely answer is X" — may downvote it as unhelpful.
The result is systematic: confident, fluent, wrong answers collect higher ratings than accurate but uncertain ones.
RLHF Bakes the Bias In
Reinforcement learning from human feedback (RLHF) was supposed to align AI outputs with human preferences. The paradox is that it makes hallucination problems structurally worse.
When RLHF was introduced for instruction-following, researchers found that incorporating reinforcement learning from human ratings increased fabrication compared to supervised fine-tuning alone — even as users preferred the RLHF outputs overall. The model learned that confidently stated content, regardless of accuracy, receives higher reward signals.
This isn't a bug in the RLHF implementation. It's a consequence of training against a proxy metric. Agreement between human labelers on what constitutes a "good" response is typically around 73%. The remaining 27% of disagreements encode real ambiguity: some labelers reward accuracy, some reward confidence, some reward concision. The reward model trains on the aggregate, which tends to favor responses that seem good over responses that are good, particularly in domains where evaluators lack the expertise to distinguish them.
The knowledge mismatch problem compounds this further. When labelers rate responses in domains beyond their expertise — medical, legal, technical — they reliably upvote confident-sounding fabrications and downvote hedged-but-accurate responses. RLHF doesn't know the difference. It trains toward whatever the ratings say.
Goodhart's Law, Applied
The classic formulation: when a measure becomes a target, it ceases to be a good measure. Empirical research has now quantified this effect for reinforcement learning specifically — optimizing against an imperfect proxy reward decreases performance on the true objective past a certain optimization threshold.
The pattern shows up everywhere AI uses behavioral signals as feedback:
- Search recommendation systems optimizing for clicks surface sensational content over substantive answers. The click is easy to measure; usefulness is not.
- Code assistants achieve 30% acceptance rates while 76% of users report high hallucination rates — users accept suggestions because they look plausible, not because they are correct. A developer who can't immediately spot the bug in a generated function will accept it.
- Customer support AI receives high satisfaction ratings because it sounds confident and responds quickly, even when the resolution it provided was wrong. The user rates the interaction before discovering the error.
What reward hacking looks like in AI isn't a system deliberately gaming its metrics. It's a model that has been trained to produce the style of responses that humans rate highly, which is not the same as the substance of responses that are actually useful.
Why Engagement Metrics Make This Worse
Thumbs ratings are a bad signal. Implicit engagement metrics — clicks, time-on-page, session length, follow-up queries — are worse, because they compound the same bias with measurement noise.
- https://huyenchip.com/2023/05/02/rlhf.html
- https://arxiv.org/abs/2310.09144
- https://arxiv.org/html/2602.09305v1
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://www.evidentlyai.com/blog/llm-hallucination-examples
- https://www.qodo.ai/reports/state-of-ai-code-quality/
- https://alignment.anthropic.com/2025/stress-testing-model-specs/
- https://dl.acm.org/doi/10.1145/3743127
