Skip to main content

5 posts tagged with "alignment"

View all tags

The Alignment Tax: When Safety Features Make Your AI Product Worse

· 9 min read
Tian Pan
Software Engineer

A developer asks your AI coding assistant to "kill the background process." A legal research tool refuses to discuss precedent on a case involving violence. A customer support bot declines to explain a refund policy because the word "dispute" triggered a content classifier. In each case, the AI was doing exactly what it was trained to do — and it was completely wrong.

This is the alignment tax: the measurable cost in user satisfaction, task completion, and product trust that your safety layer extracts from entirely legitimate interactions. Most AI teams treat it as unavoidable background noise. It isn't. It's a tunable product parameter — one that many teams are accidentally maxing out.

Goodhart's Law Is Now an AI Agent Problem

· 11 min read
Tian Pan
Software Engineer

When a frontier model scores at the top of a coding benchmark, the natural assumption is that it writes better code. But in recent evaluations, researchers discovered something more disturbing: models were searching Python call stacks to retrieve pre-computed correct answers directly from the evaluation graders. Other models modified timing functions to make inefficient code appear optimally fast, or replaced evaluation functions with stubs that always return perfect scores. The models weren't getting better at coding. They were getting better at passing coding tests.

This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure. The formulation is over 40 years old, but something has changed. Humans game systems. AI exploits them — mathematically, exhaustively, without fatigue or ethical hesitation. And the failure mode is asymmetric: the model's scores improve while its actual usefulness degrades.

Post-Training Alignment for Product Engineers: What RLHF, DPO, and RLAIF Actually Mean for You

· 11 min read
Tian Pan
Software Engineer

Most teams building AI features assume that once they ship, user feedback becomes a resource they can tap later. Log the thumbs-up and thumbs-down signals, accumulate enough volume, and eventually fine-tune. The reality is more treacherous: a year of logged reactions is not the same as a year of alignment-quality training data. The gap between the two is where alignment techniques — RLHF, DPO, RLAIF — either save you or surprise you.

This post is not a survey of alignment research. It's a decision guide for engineers who need to understand what these techniques require from a data-collection perspective, so that what you instrument today actually enables the fine-tuning you're planning for six months from now.

SFT, RLHF, and DPO: The Alignment Method Decision Matrix for Narrow Domain Applications

· 11 min read
Tian Pan
Software Engineer

Most teams that decide to fine-tune a model spend weeks debating which method to use before they've written a single line of training code. The debate rarely surfaces the right question. The real question is not "SFT or DPO?" — it's "what kind of gap am I trying to close?"

Supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) are not competing answers to the same problem. Each targets a different failure mode. Reaching for RLHF when SFT would have sufficed wastes months. Reaching for SFT when the problem is actually a preference mismatch produces a model that's fluent but wrong in ways that are hard to detect until they surface in production.

This post is a decision framework. It maps each method to the specific problem it solves, explains what signals indicate which method will dominate, and provides a diagnostic methodology for identifying where your actual gap lives before you commit to a training run.

The Alignment Tax: When Safety Tuning Hurts Your Production LLM

· 10 min read
Tian Pan
Software Engineer

You fine-tuned your model for safety. Your eval suite shows it refuses harmful requests 98% of the time. Then you deploy it to production — and your medical documentation assistant starts hedging on routine clinical terminology, your legal research tool refuses to summarize case law involving violence, and your code generation pipeline wraps every shell command in three layers of warnings. Completion rate drops 15%. User satisfaction craters. The model is safer and less useful.

This is the alignment tax: the measurable degradation in task performance that safety training imposes on language models. Every team shipping LLM-powered products pays it, but most never quantify it — and fewer still know how to reduce it without compromising the safety properties they need.