Skip to main content

19 posts tagged with "rlhf"

View all tags

Your Annotation Pipeline Is the Real Bottleneck in Your AI Product

· 10 min read
Tian Pan
Software Engineer

Every team working on an AI product eventually ships a feedback widget. Thumbs up. Thumbs down. Maybe a star rating or a correction field. The widget launches. The data flows. And then nothing changes about the model — for weeks, then months — while the team remains genuinely convinced they have a working feedback loop.

The widget was the easy part. The annotation pipeline behind it is where AI products actually stall.

Feedback Surfaces That Actually Train Your Model

· 10 min read
Tian Pan
Software Engineer

Most AI products ship with a thumbs-up/thumbs-down widget and call it feedback infrastructure. It isn't. What it is, in practice, is a survey that only dissatisfied or unusually conscientious users bother completing — and a survey that tells you nothing about what the correct output would have looked like.

The result is a dataset shaped not by what your users want, but by which users felt like clicking a button. That selection bias propagates into fine-tuning runs, reward models, and DPO pipelines, quietly steering your model toward the preferences of a tiny and unrepresentative minority. Implicit signals — edit rate, retry rate, session abandonment — cover every user who touches the product. They don't require a click. They're generated by the act of using the software.

Here's how to design feedback surfaces that produce high-fidelity training signal as a natural side effect of product use, and how to route those signals into your training pipeline.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

· 10 min read
Tian Pan
Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.

The Alignment Tax: When Safety Tuning Hurts Your Production LLM

· 10 min read
Tian Pan
Software Engineer

You fine-tuned your model for safety. Your eval suite shows it refuses harmful requests 98% of the time. Then you deploy it to production — and your medical documentation assistant starts hedging on routine clinical terminology, your legal research tool refuses to summarize case law involving violence, and your code generation pipeline wraps every shell command in three layers of warnings. Completion rate drops 15%. User satisfaction craters. The model is safer and less useful.

This is the alignment tax: the measurable degradation in task performance that safety training imposes on language models. Every team shipping LLM-powered products pays it, but most never quantify it — and fewer still know how to reduce it without compromising the safety properties they need.

The Sycophancy Tax: How Agreeable LLMs Silently Break Production AI Systems

· 9 min read
Tian Pan
Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o that broke something subtle but consequential. The model became significantly more agreeable. Users reported that it validated bad plans, reversed correct positions under the slightest pushback, and prefaced every response with effusive praise for the question. The behavior was so excessive that OpenAI rolled back the update within days, calling it a case where short-term feedback signals had overridden the model's honesty. The incident was widely covered, but the thing most teams missed is this: the degree was unusual, but the direction was not.

Sycophancy — the tendency of RLHF-trained models to prioritize user approval over accuracy — is present in nearly every production LLM deployment. A study evaluating ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro found sycophantic behavior in 58% of cases on average, with persistence rates near 79% regardless of context. This is not a bug in a few edge cases. It is a structural property of how these models were trained, and it shows up in production in ways that are hard to catch with standard evals.

Why Your Thumbs-Down Data Is Lying to You: Selection Bias in Production AI Feedback Loops

· 9 min read
Tian Pan
Software Engineer

You shipped a thumbs-up/thumbs-down button on your AI feature six months ago. You have thousands of ratings. You built a dashboard. You even fine-tuned on the negative examples. And your product is getting worse in ways your feedback data cannot explain.

The problem isn't that users are wrong about what they dislike. The problem is that the users who click your feedback buttons are a systematically unrepresentative sample of your actual user base — and every decision you make from that data inherits their biases.