5 posts tagged with "fairness"

The Ethics Review Gate Your AI Shipping Process Is Missing

May 6, 2026 · 9 min read

Software Engineer

Most engineering teams treat ethics like they used to treat security: something you address after the feature ships, if someone complains. The parallels are uncomfortable. In 2004, SQL injection was a "we'll fix it later" problem. Today, every serious team has automated injection detection in CI. Ethics reviews in AI are at the same inflection point — and teams that don't build the gate now will learn the hard way why it exists.

The gap is not intent. It's structure. Security reviews have a 20-year head start on standardization: OWASP checklists, CVE scoring, penetration tests, mandatory sign-offs before production. Ethics reviews have none of that ceremony. Most teams have no defined trigger, no checklist, no exit criteria, and no named owner. The result: a healthcare algorithm that reduced identification of Black patients for care by over 50% not because engineers were malicious, but because no one ran disaggregated accuracy numbers before the thing went live. A recruiting model that systematically downranked resumes containing the word "women's" — trained on historical data, shipped without a fairness pass, discovered months into production. These aren't edge cases. They're what happens when ethics is a post-launch checkbox with no teeth.

Why Your Bias Eval Passes in CI and Fails in Deployment

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.

Bias Monitoring Infrastructure for Production AI: Beyond the Pre-Launch Audit

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your model passed its fairness review. The demographic parity was within acceptable bounds, equal opportunity metrics looked clean, and the audit report went into Confluence with a green checkmark. Three months later, a journalist has screenshots showing your system approves loans at half the rate for one demographic compared to another — and your pre-launch numbers were technically accurate the whole time.

This is the bias monitoring gap. Pre-launch fairness testing validates your model against datasets that existed when you ran the tests. Production AI systems don't operate in that static world. User behavior shifts, population distributions drift, feature correlations evolve, and disparities that weren't measurable at launch can become significant failure modes within weeks. The systems that catch these problems aren't part of most ML stacks today.

Subgroup Fairness Testing in Production AI: Why Aggregate Accuracy Lies

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

When a face recognition system reports 95% accuracy, your first instinct is to ship it. The instinct is wrong. That same system can simultaneously fail darker-skinned women at a 34% error rate while achieving 0.8% on lighter-skinned men — a 40x disparity, fully hidden inside that reassuring aggregate number.

This is the aggregate accuracy illusion, and it destroys production AI features in industries ranging from hiring to healthcare to speech recognition. The pattern is structurally identical to Simpson's Paradox: a model that looks fair in aggregate can discriminate systematically across every meaningful subgroup simultaneously. Aggregate metrics are weighted averages. When some subgroups are smaller or underrepresented in your eval set, their failure rates get diluted by the majority's success.

The fix is not a different accuracy threshold. It is disaggregated evaluation — computing your performance metrics per subgroup, defining disparity SLOs, and monitoring them continuously in production the same way you monitor latency and error rate.

The Bias Audit You Keep Skipping: Engineering Demographic Fairness into Your LLM Pipeline

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

A team ships an LLM-powered feature. It clears the safety filter. It passes the accuracy eval. Users complain. Six months later, a researcher runs a 3-million-comparison study and finds the system selected white-associated names 85% of the time and Black-associated names 9% of the time — on identical inputs.

This is not a safety problem. It's a fairness problem, and the two require entirely different engineering responses. Safety filters guard against harm. Fairness checks measure whether your system produces equally good outputs for everyone. A model can satisfy every content policy you have and still diagnose Black patients at higher mortality risk than equally sick white patients, or generate thinner resumes for women than men. These disparities are invisible to the guardrail that blocked a slur.

Most teams never build the second check. This post is about why you should and exactly how to do it.

About Tian Pan