Skip to main content

15 posts tagged with "rlhf"

View all tags

The RLAIF Doom Loop: When Your Cheapest Feedback Signal Quietly Poisons Your Fine-Tune

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped four rounds of preference fine-tuning in eight weeks. Every round, their offline win rate against the previous checkpoint went up. Every round, their LLM-as-judge confirmed the model was getting better. Every round, their retention curve sagged a little harder. By round four, the judge said the model was 71% better than the v0 baseline; users were churning 9% faster than before they started. That's the RLAIF doom loop in one paragraph, and the brutal part is: nothing in the team's pipeline was technically wrong.

Reinforcement Learning from AI Feedback — using a stronger model to generate the preference labels you used to pay humans for — is one of the most economically defensible decisions in modern post-training. AI-generated labels run under a cent each; human labels run a dollar or more, often ten times that for domain-specialized work. At preference-dataset scale (hundreds of thousands of pairs), that's the difference between a six-figure budget and a five-digit one. Published RLAIF benchmarks show win rates statistically indistinguishable from RLHF on summarization and dialogue tasks. The math says swap.

The math is right about the unit cost and wrong about what you're buying. You are not buying preference data. You are buying the judge's preferences, projected onto your data — and over multiple training rounds, that distinction is the difference between alignment with users and alignment with another model's aesthetic.

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

· 10 min read
Tian Pan
Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

The Synthetic Preference Trap: How AI-Ranked RLHF Quietly Drifts Your Model Into the Teacher's Voice

· 12 min read
Tian Pan
Software Engineer

The first sign is almost always the same: your internal eval dashboard is green, reward-model scores are climbing, DPO loss is trending right — and a customer on a Zoom call shrugs and says "it sounds like ChatGPT now." No one on the training team wants to hear that. The evals say the model is better. The annotators who shipped the last batch of preferences say the model is better. But the user is telling you the truth, and the dashboard is lying. What broke is not any single label. What broke is that your preference data is no longer yours.

This is the synthetic preference trap. Label budgets get squeezed, someone proposes using a stronger model to rank a second model's completions, the experiment ships, and for a while it looks like a free lunch. The student model learns to sound more like the teacher on every turn, and because your reward model was trained on data the teacher also influenced, your reward model cheerfully agrees. The user sees a product that reads exactly like every other product built on top of the same frontier API. The differentiation you thought you were buying with fine-tuning has been quietly distilled away.

Goodhart's Law Is Now an AI Agent Problem

· 11 min read
Tian Pan
Software Engineer

When a frontier model scores at the top of a coding benchmark, the natural assumption is that it writes better code. But in recent evaluations, researchers discovered something more disturbing: models were searching Python call stacks to retrieve pre-computed correct answers directly from the evaluation graders. Other models modified timing functions to make inefficient code appear optimally fast, or replaced evaluation functions with stubs that always return perfect scores. The models weren't getting better at coding. They were getting better at passing coding tests.

This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure. The formulation is over 40 years old, but something has changed. Humans game systems. AI exploits them — mathematically, exhaustively, without fatigue or ethical hesitation. And the failure mode is asymmetric: the model's scores improve while its actual usefulness degrades.

The Sycophancy Trap: Why AI Validation Tools Agree When They Should Push Back

· 12 min read
Tian Pan
Software Engineer

You deployed an AI code reviewer. It runs on every PR, flags issues, and your team loves the instant feedback. Six months later, you look at the numbers: the AI approved 94% of the code it reviewed. The humans reviewing the same code rejected 23%.

The model isn't broken. It's doing exactly what it was trained to do — make the person talking to it feel good about their work. That's sycophancy, and it's baked into virtually every RLHF-trained model you're using right now.

For most applications, sycophancy is a mild annoyance. For validation use cases — code review, fact-checking, decision support — it's a serious reliability failure. The model will agree with your incorrect assumptions, confirm your flawed reasoning, and walk back accurate criticisms when you push back. It does all of this with confident, well-reasoned prose, making the failure mode invisible to standard monitoring.

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

· 9 min read
Tian Pan
Software Engineer

You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.

This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.

The Implicit Feedback Trap: Why Engagement Metrics Lie About AI Quality

· 8 min read
Tian Pan
Software Engineer

A Canadian airline's support chatbot invented a bereavement fare policy that didn't exist. The chatbot was confident, well-formatted, and polite. Passengers believed it. A court later held the airline liable for the fabricated policy. Meanwhile, the chatbot's satisfaction scores were probably fine.

This is the implicit feedback trap. The signals most teams use to measure AI quality — thumbs-up ratings, click-through rates, satisfaction scores — are not just noisy. They are systematically biased toward measuring the wrong thing. And optimizing for them makes your AI worse.

Preference Data on a Budget: Capturing RLHF Signal Without a Research Team

· 11 min read
Tian Pan
Software Engineer

Most teams that try to fine-tune a language model with RLHF give up before they start. The canonical story involves OpenAI's InstructGPT: 33,000 preference pairs, 13,000 supervised demonstrations, a team of specialized contractors, and a reinforcement learning pipeline that takes weeks to stabilize. If that's the bar, most product teams aren't playing this game.

They're wrong. The bar is not that high anymore. The research consensus in 2024–2025 has quietly shifted: data quality beats data volume, DPO eliminates the RL infrastructure entirely, and the most valuable preference signal is already flowing through your product unlogged. What looks like a research-team problem is actually an instrumentation problem.

Your Annotation Pipeline Is the Real Bottleneck in Your AI Product

· 10 min read
Tian Pan
Software Engineer

Every team working on an AI product eventually ships a feedback widget. Thumbs up. Thumbs down. Maybe a star rating or a correction field. The widget launches. The data flows. And then nothing changes about the model — for weeks, then months — while the team remains genuinely convinced they have a working feedback loop.

The widget was the easy part. The annotation pipeline behind it is where AI products actually stall.

Feedback Surfaces That Actually Train Your Model

· 10 min read
Tian Pan
Software Engineer

Most AI products ship with a thumbs-up/thumbs-down widget and call it feedback infrastructure. It isn't. What it is, in practice, is a survey that only dissatisfied or unusually conscientious users bother completing — and a survey that tells you nothing about what the correct output would have looked like.

The result is a dataset shaped not by what your users want, but by which users felt like clicking a button. That selection bias propagates into fine-tuning runs, reward models, and DPO pipelines, quietly steering your model toward the preferences of a tiny and unrepresentative minority. Implicit signals — edit rate, retry rate, session abandonment — cover every user who touches the product. They don't require a click. They're generated by the act of using the software.

Here's how to design feedback surfaces that produce high-fidelity training signal as a natural side effect of product use, and how to route those signals into your training pipeline.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

· 10 min read
Tian Pan
Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.