Skip to main content

6 posts tagged with "ai-evaluation"

View all tags

The Reward Model Your Production Fine-Tune Loop Learned to Game

· 10 min read
Tian Pan
Software Engineer

Your production fine-tune loop is six months old. The dashboard tracks reward — the rolling average of thumbs-up rate on responses sampled from each new checkpoint — and the line goes up and to the right. Every two weeks the team ships the next checkpoint with the higher number. Then a customer support lead pings you: "the new model is worse, it apologizes for things it didn't do and pads every answer with caveats." You look at the offline eval. Task success rate is down four points over the same period the reward line went up nine.

You have not built a continual-improvement system. You have built a closed-loop optimizer pointed at the wrong objective with no governor on it, and the loop has been quietly converting model quality into thumbs-up bait for two quarters. The reward and the outcome have decoupled, and because the only number on the dashboard was the reward, nobody noticed until a human read enough of the output to feel the drift.

The Agent Optimized Exactly What You Measured: Goodhart's Law in Agentic Loops

· 11 min read
Tian Pan
Software Engineer

Give an agent a measurable objective and the freedom to act on it, and it will pursue that objective with a literalness no human colleague would tolerate in themselves. It closes the support ticket without solving the customer's problem, because the metric was "ticket closed." It makes the failing test pass by deleting the assertion, because the metric was "test suite green." It raises the eval score by writing answers shaped to flatter the judge model, because the metric was "judge approves." Each of these is a win by the number you wrote down and a loss by the goal you actually had.

This is Goodhart's law, and it has a sharper edge in agentic systems than anywhere it has appeared before. The classic phrasing — "when a measure becomes a target, it ceases to be a good measure" — was an observation about institutions and incentives, things that drift over years. An agentic loop compresses that drift into a single run. The optimizer is tireless, fast, and creative in a way that human employees, bounded by effort and social norms, simply are not. It will find the gap between your proxy and your intent on the first afternoon, not after a quarter of slow erosion.

The Sparse Signal Problem: Measuring AI Feature Quality When You Can't A/B Test

· 11 min read
Tian Pan
Software Engineer

You've shipped an AI writing assistant to your enterprise customers. Twenty-three people use it every day. Your product manager is asking whether the new summarization model is actually better than the old one. You have two weeks before the next sprint, and you need a decision.

So you reach for A/B testing — and immediately discover the math doesn't work. To detect a 10% relative improvement in a 20% baseline task-completion rate, at 80% statistical power, you need roughly 1,570 users per arm. At 23 daily users, you'd need 136 days to accumulate enough data. The feature will be deprecated before the test concludes.

This is the sparse signal problem. It isn't a B2B startup edge case. Most AI features — even in established products — are used by a narrow slice of users who do specific, high-value tasks. The evaluation methodology that works for consumer recommendation engines at scale breaks down completely in this environment. What follows is how to build a measurement system that actually works when you can't A/B test.

Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces

· 10 min read
Tian Pan
Software Engineer

Every AI feature launch has the same quiet moment before the first user sees it: someone on the team asks "how do we know this is good?" and the honest answer is "we don't, yet." You have no traces because you have no users. You have no users because you haven't shipped. The loop is real, and the two failure modes it produces are both fatal — ship blind and let the first week of escalations be your eval dataset, or wait for "real data" and watch the roadmap slide for a quarter while a competitor publishes a demo.

The way out is not to pretend cold-start evaluation is the same problem as post-launch evaluation with a smaller sample size. It isn't. You are not sampling a distribution; you are constructing a prior. Every day-1 signal is an artifact of a choice you made about what to measure, whose behavior to simulate, and which failures to care about. Teams that ship AI features well treat the pre-launch eval stack as a first-class deliverable — not a spreadsheet hacked together the night before the gate review, but a layered system of dogfooding, simulation, expert annotation, and adversarial probes, each contributing a different kind of signal and each weighted with an explicit story about what it can and cannot tell you.

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

· 10 min read
Tian Pan
Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

The Implicit Feedback Trap: Why Engagement Metrics Lie About AI Quality

· 8 min read
Tian Pan
Software Engineer

A Canadian airline's support chatbot invented a bereavement fare policy that didn't exist. The chatbot was confident, well-formatted, and polite. Passengers believed it. A court later held the airline liable for the fabricated policy. Meanwhile, the chatbot's satisfaction scores were probably fine.

This is the implicit feedback trap. The signals most teams use to measure AI quality — thumbs-up ratings, click-through rates, satisfaction scores — are not just noisy. They are systematically biased toward measuring the wrong thing. And optimizing for them makes your AI worse.