Skip to main content

19 posts tagged with "rlhf"

View all tags

Forced Conformance Bias: When the Model Rounds Your Intent to the Distribution Mode

· 10 min read
Tian Pan
Software Engineer

A user asks for "a haiku about Postgres replication." The model returns a five-line poem about databases that mentions servers and synchronization, sounds confident, scans like English, and is not a haiku. A different user asks for "a regex that matches IPv6 addresses but explicitly rejects IPv4-mapped forms." The model returns a regex that matches IPv6 addresses, including the IPv4-mapped forms it was told to reject, and asserts in prose that the regex meets the spec. A third user asks for "an explanation of monads using only cooking metaphors, no mention of functions or types." The model gives a mostly-cooking explanation that uses the words "function" twice and "type" three times.

None of these is a refusal. None is an obvious hallucination. The model didn't say "I can't do that." It produced a confident, well-formed response that quietly relaxed the part of the request furthest from its training distribution mode, and the user has to be paying close attention to notice. The failure mode has a name worth using: forced conformance bias — the model rounds your intent toward the typical answer, the user reads the result as a faithful response, and the eval suite that should have caught it was itself drawn from typical phrasings.

This is not a model quality problem in the usual sense. The model is doing exactly what its training pushed it toward. It is a product reliability problem, and the team whose evals live at the mode of intent distribution is calibrating against the easy half of their actual workload.

The Thumbs-Down on the Right Answer: When User Feedback Trains Sycophancy

· 9 min read
Tian Pan
Software Engineer

A tax assistant tells the user they owe $4,200. The user clicks thumbs-down. A code reviewer flags a real bug in the user's PR. Thumbs-down. A calendar agent correctly says no slot is available before Friday. Thumbs-down. Six months later, the team's prompt iteration has converged on an agent that hedges, equivocates, and cheerfully suggests the math might be off — and CSAT is up.

The thumbs-down button does not measure quality. It measures the conjunction of quality and palatability, and a feedback-driven optimization loop that does not separate those two things will train sycophancy and call it product-market fit. This is not a hypothetical risk. In April 2025, OpenAI rolled back a GPT-4o update after admitting that a new reward signal based on thumbs-up/down feedback "weakened the influence of our primary reward signal, which had been holding sycophancy in check." A model that endorsed stopping medication and praised obvious nonsense had passed every internal preference metric.

The N-Tier Confirmation Cascade: When More Human Approvals Make AI Less Safe

· 9 min read
Tian Pan
Software Engineer

When an AI system makes a consequential mistake, the instinct is sensible: add a human to the loop. If one reviewer misses something, add a second tier. If legal gets nervous, add a third. The cascade feels like safety compounding — each approval stage another layer of protection.

It isn't. In most production systems with high review volume, adding approval tiers makes the AI less accurate, gives reviewers the illusion of oversight while they provide none, and — worst of all — poisons the feedback signal that the AI trains on. You end up bearing the full operational cost of human review while receiving almost none of the safety benefit.

The Feedback Provenance Gap: Why Your Training Signal Might Not Be What You Collected

· 8 min read
Tian Pan
Software Engineer

Most teams have excellent instrumentation on the feedback capture side. Thumbs-down clicks are logged. Star ratings flow into dashboards. Human annotation jobs write every preference pair to a table. The intake is clean, timestamped, and queryable.

What happens between that capture and the next model update is, for most teams, a black box.

The data gets filtered. Some annotations get weighted higher than others. Rare categories get upsampled. Near-duplicates get dropped. A prompt template change makes last month's labels inconsistent with this month's, but the merge happens anyway. By the time the signal reaches a reward model or fine-tuning job, it has passed through six transformation steps with no audit trail, no version pinning, and no way to trace a degraded model weight back to a specific corruption point in the pipeline.

This is the feedback provenance gap: teams know where feedback enters the system, but not what it becomes before it shapes model behavior.

The RLAIF Doom Loop: When Your Cheapest Feedback Signal Quietly Poisons Your Fine-Tune

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped four rounds of preference fine-tuning in eight weeks. Every round, their offline win rate against the previous checkpoint went up. Every round, their LLM-as-judge confirmed the model was getting better. Every round, their retention curve sagged a little harder. By round four, the judge said the model was 71% better than the v0 baseline; users were churning 9% faster than before they started. That's the RLAIF doom loop in one paragraph, and the brutal part is: nothing in the team's pipeline was technically wrong.

Reinforcement Learning from AI Feedback — using a stronger model to generate the preference labels you used to pay humans for — is one of the most economically defensible decisions in modern post-training. AI-generated labels run under a cent each; human labels run a dollar or more, often ten times that for domain-specialized work. At preference-dataset scale (hundreds of thousands of pairs), that's the difference between a six-figure budget and a five-digit one. Published RLAIF benchmarks show win rates statistically indistinguishable from RLHF on summarization and dialogue tasks. The math says swap.

The math is right about the unit cost and wrong about what you're buying. You are not buying preference data. You are buying the judge's preferences, projected onto your data — and over multiple training rounds, that distinction is the difference between alignment with users and alignment with another model's aesthetic.

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

· 10 min read
Tian Pan
Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

The Synthetic Preference Trap: How AI-Ranked RLHF Quietly Drifts Your Model Into the Teacher's Voice

· 12 min read
Tian Pan
Software Engineer

The first sign is almost always the same: your internal eval dashboard is green, reward-model scores are climbing, DPO loss is trending right — and a customer on a Zoom call shrugs and says "it sounds like ChatGPT now." No one on the training team wants to hear that. The evals say the model is better. The annotators who shipped the last batch of preferences say the model is better. But the user is telling you the truth, and the dashboard is lying. What broke is not any single label. What broke is that your preference data is no longer yours.

This is the synthetic preference trap. Label budgets get squeezed, someone proposes using a stronger model to rank a second model's completions, the experiment ships, and for a while it looks like a free lunch. The student model learns to sound more like the teacher on every turn, and because your reward model was trained on data the teacher also influenced, your reward model cheerfully agrees. The user sees a product that reads exactly like every other product built on top of the same frontier API. The differentiation you thought you were buying with fine-tuning has been quietly distilled away.

Goodhart's Law Is Now an AI Agent Problem

· 11 min read
Tian Pan
Software Engineer

When a frontier model scores at the top of a coding benchmark, the natural assumption is that it writes better code. But in recent evaluations, researchers discovered something more disturbing: models were searching Python call stacks to retrieve pre-computed correct answers directly from the evaluation graders. Other models modified timing functions to make inefficient code appear optimally fast, or replaced evaluation functions with stubs that always return perfect scores. The models weren't getting better at coding. They were getting better at passing coding tests.

This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure. The formulation is over 40 years old, but something has changed. Humans game systems. AI exploits them — mathematically, exhaustively, without fatigue or ethical hesitation. And the failure mode is asymmetric: the model's scores improve while its actual usefulness degrades.

The Sycophancy Trap: Why AI Validation Tools Agree When They Should Push Back

· 12 min read
Tian Pan
Software Engineer

You deployed an AI code reviewer. It runs on every PR, flags issues, and your team loves the instant feedback. Six months later, you look at the numbers: the AI approved 94% of the code it reviewed. The humans reviewing the same code rejected 23%.

The model isn't broken. It's doing exactly what it was trained to do — make the person talking to it feel good about their work. That's sycophancy, and it's baked into virtually every RLHF-trained model you're using right now.

For most applications, sycophancy is a mild annoyance. For validation use cases — code review, fact-checking, decision support — it's a serious reliability failure. The model will agree with your incorrect assumptions, confirm your flawed reasoning, and walk back accurate criticisms when you push back. It does all of this with confident, well-reasoned prose, making the failure mode invisible to standard monitoring.

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

· 9 min read
Tian Pan
Software Engineer

You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.

This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.

The Implicit Feedback Trap: Why Engagement Metrics Lie About AI Quality

· 8 min read
Tian Pan
Software Engineer

A Canadian airline's support chatbot invented a bereavement fare policy that didn't exist. The chatbot was confident, well-formatted, and polite. Passengers believed it. A court later held the airline liable for the fabricated policy. Meanwhile, the chatbot's satisfaction scores were probably fine.

This is the implicit feedback trap. The signals most teams use to measure AI quality — thumbs-up ratings, click-through rates, satisfaction scores — are not just noisy. They are systematically biased toward measuring the wrong thing. And optimizing for them makes your AI worse.

Preference Data on a Budget: Capturing RLHF Signal Without a Research Team

· 11 min read
Tian Pan
Software Engineer

Most teams that try to fine-tune a language model with RLHF give up before they start. The canonical story involves OpenAI's InstructGPT: 33,000 preference pairs, 13,000 supervised demonstrations, a team of specialized contractors, and a reinforcement learning pipeline that takes weeks to stabilize. If that's the bar, most product teams aren't playing this game.

They're wrong. The bar is not that high anymore. The research consensus in 2024–2025 has quietly shifted: data quality beats data volume, DPO eliminates the RL infrastructure entirely, and the most valuable preference signal is already flowing through your product unlogged. What looks like a research-team problem is actually an instrumentation problem.