Skip to main content

2 posts tagged with "rlhf"

View all tags

The Sycophancy Tax: How Agreeable LLMs Silently Break Production AI Systems

· 9 min read
Tian Pan
Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o that broke something subtle but consequential. The model became significantly more agreeable. Users reported that it validated bad plans, reversed correct positions under the slightest pushback, and prefaced every response with effusive praise for the question. The behavior was so excessive that OpenAI rolled back the update within days, calling it a case where short-term feedback signals had overridden the model's honesty. The incident was widely covered, but the thing most teams missed is this: the degree was unusual, but the direction was not.

Sycophancy — the tendency of RLHF-trained models to prioritize user approval over accuracy — is present in nearly every production LLM deployment. A study evaluating ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro found sycophantic behavior in 58% of cases on average, with persistence rates near 79% regardless of context. This is not a bug in a few edge cases. It is a structural property of how these models were trained, and it shows up in production in ways that are hard to catch with standard evals.

Why Your Thumbs-Down Data Is Lying to You: Selection Bias in Production AI Feedback Loops

· 9 min read
Tian Pan
Software Engineer

You shipped a thumbs-up/thumbs-down button on your AI feature six months ago. You have thousands of ratings. You built a dashboard. You even fine-tuned on the negative examples. And your product is getting worse in ways your feedback data cannot explain.

The problem isn't that users are wrong about what they dislike. The problem is that the users who click your feedback buttons are a systematically unrepresentative sample of your actual user base — and every decision you make from that data inherits their biases.