Skip to main content

4 posts tagged with "fairness"

View all tags

Why Your Bias Eval Passes in CI and Fails in Deployment

· 10 min read
Tian Pan
Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.

Bias Monitoring Infrastructure for Production AI: Beyond the Pre-Launch Audit

· 10 min read
Tian Pan
Software Engineer

Your model passed its fairness review. The demographic parity was within acceptable bounds, equal opportunity metrics looked clean, and the audit report went into Confluence with a green checkmark. Three months later, a journalist has screenshots showing your system approves loans at half the rate for one demographic compared to another — and your pre-launch numbers were technically accurate the whole time.

This is the bias monitoring gap. Pre-launch fairness testing validates your model against datasets that existed when you ran the tests. Production AI systems don't operate in that static world. User behavior shifts, population distributions drift, feature correlations evolve, and disparities that weren't measurable at launch can become significant failure modes within weeks. The systems that catch these problems aren't part of most ML stacks today.

Subgroup Fairness Testing in Production AI: Why Aggregate Accuracy Lies

· 11 min read
Tian Pan
Software Engineer

When a face recognition system reports 95% accuracy, your first instinct is to ship it. The instinct is wrong. That same system can simultaneously fail darker-skinned women at a 34% error rate while achieving 0.8% on lighter-skinned men — a 40x disparity, fully hidden inside that reassuring aggregate number.

This is the aggregate accuracy illusion, and it destroys production AI features in industries ranging from hiring to healthcare to speech recognition. The pattern is structurally identical to Simpson's Paradox: a model that looks fair in aggregate can discriminate systematically across every meaningful subgroup simultaneously. Aggregate metrics are weighted averages. When some subgroups are smaller or underrepresented in your eval set, their failure rates get diluted by the majority's success.

The fix is not a different accuracy threshold. It is disaggregated evaluation — computing your performance metrics per subgroup, defining disparity SLOs, and monitoring them continuously in production the same way you monitor latency and error rate.

The Bias Audit You Keep Skipping: Engineering Demographic Fairness into Your LLM Pipeline

· 10 min read
Tian Pan
Software Engineer

A team ships an LLM-powered feature. It clears the safety filter. It passes the accuracy eval. Users complain. Six months later, a researcher runs a 3-million-comparison study and finds the system selected white-associated names 85% of the time and Black-associated names 9% of the time — on identical inputs.

This is not a safety problem. It's a fairness problem, and the two require entirely different engineering responses. Safety filters guard against harm. Fairness checks measure whether your system produces equally good outputs for everyone. A model can satisfy every content policy you have and still diagnose Black patients at higher mortality risk than equally sick white patients, or generate thinner resumes for women than men. These disparities are invisible to the guardrail that blocked a slur.

Most teams never build the second check. This post is about why you should and exactly how to do it.