3 posts tagged with "bias"

The LLM-as-Judge Ensemble That Agreed Because All Judges Were the Same Family

June 3, 2026 · 10 min read

Software Engineer

Your evaluation pipeline runs a three-judge ensemble against every model output. The judges are GPT-4 with a strict rubric, GPT-4 with a permissive rubric, and GPT-4 with a chain-of-thought rubric. They agree on 91% of cases. You report inter-judge agreement of 0.83 Krippendorff's alpha to the launch review committee. The number lands in the "substantial agreement" band that every methodology textbook treats as a green light. Three model upgrades ship against that number over six months.

An external auditor swaps one of the three judges for Claude using the same rubric and the agreement rate on hard cases drops to 64%. The eval score that justified the last three upgrades turns out to be a number that depends on which provider family you treat as ground truth. The upgrades were upgrades against GPT-4 family preferences, not against quality — because the judges were the model being judged's siblings.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

The Bias Audit You Keep Skipping: Engineering Demographic Fairness into Your LLM Pipeline

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

A team ships an LLM-powered feature. It clears the safety filter. It passes the accuracy eval. Users complain. Six months later, a researcher runs a 3-million-comparison study and finds the system selected white-associated names 85% of the time and Black-associated names 9% of the time — on identical inputs.

This is not a safety problem. It's a fairness problem, and the two require entirely different engineering responses. Safety filters guard against harm. Fairness checks measure whether your system produces equally good outputs for everyone. A model can satisfy every content policy you have and still diagnose Black patients at higher mortality risk than equally sick white patients, or generate thinner resumes for women than men. These disparities are invisible to the guardrail that blocked a slur.

Most teams never build the second check. This post is about why you should and exactly how to do it.

About Tian Pan