The RLAIF Doom Loop: When Your Cheapest Feedback Signal Quietly Poisons Your Fine-Tune
A team I talked to last quarter shipped four rounds of preference fine-tuning in eight weeks. Every round, their offline win rate against the previous checkpoint went up. Every round, their LLM-as-judge confirmed the model was getting better. Every round, their retention curve sagged a little harder. By round four, the judge said the model was 71% better than the v0 baseline; users were churning 9% faster than before they started. That's the RLAIF doom loop in one paragraph, and the brutal part is: nothing in the team's pipeline was technically wrong.
Reinforcement Learning from AI Feedback — using a stronger model to generate the preference labels you used to pay humans for — is one of the most economically defensible decisions in modern post-training. AI-generated labels run under a cent each; human labels run a dollar or more, often ten times that for domain-specialized work. At preference-dataset scale (hundreds of thousands of pairs), that's the difference between a six-figure budget and a five-digit one. Published RLAIF benchmarks show win rates statistically indistinguishable from RLHF on summarization and dialogue tasks. The math says swap.
The math is right about the unit cost and wrong about what you're buying. You are not buying preference data. You are buying the judge's preferences, projected onto your data — and over multiple training rounds, that distinction is the difference between alignment with users and alignment with another model's aesthetic.
