Skip to main content

Subgroup Fairness Testing in Production AI: Why Aggregate Accuracy Lies

· 11 min read
Tian Pan
Software Engineer

When a face recognition system reports 95% accuracy, your first instinct is to ship it. The instinct is wrong. That same system can simultaneously fail darker-skinned women at a 34% error rate while achieving 0.8% on lighter-skinned men — a 40x disparity, fully hidden inside that reassuring aggregate number.

This is the aggregate accuracy illusion, and it destroys production AI features in industries ranging from hiring to healthcare to speech recognition. The pattern is structurally identical to Simpson's Paradox: a model that looks fair in aggregate can discriminate systematically across every meaningful subgroup simultaneously. Aggregate metrics are weighted averages. When some subgroups are smaller or underrepresented in your eval set, their failure rates get diluted by the majority's success.

The fix is not a different accuracy threshold. It is disaggregated evaluation — computing your performance metrics per subgroup, defining disparity SLOs, and monitoring them continuously in production the same way you monitor latency and error rate.

The Scale of the Problem

The evidence that aggregate metrics hide real harm is not theoretical. It is empirical and replicated across every AI application domain.

Speech recognition. A PNAS study testing five major commercial ASR systems (Amazon, Apple, Google, IBM, Microsoft) found average word error rates of 35% for Black speakers versus 19% for white speakers — nearly a 2x gap. A 2024 Georgia Tech study extended this to other minority English dialects, finding Spanglish and Chicano English speakers at the bottom, with Whisper performing better on AAVE specifically because it had more inclusive training data. The root cause is not vocabulary differences but acoustic model failure on phonological features of African American Vernacular English. The aggregate reported accuracy for each system looked fine.

Healthcare scoring. The Optum risk algorithm (documented in Science, 2019) used healthcare costs as a proxy for medical need. Because Black patients historically received less care, the algorithm ranked them as healthier than white patients with the same objective illness severity — by a factor of 3.25x. Pulse oximeters, ubiquitous in ICUs, report falsely elevated readings on darker skin tones: Black ICU patients experience nearly 3x the rate of undetected hypoxemia compared to white patients. Both systems had high aggregate accuracy metrics.

Resume screening. Brookings Institution research on AI resume screening found that white-associated names were preferred over Black-associated names in 85.1% of test cases, with equal treatment in only 6.3%. Aggregate acceptance rate metrics looked balanced.

Large language models. Models scoring 70%+ on English-language MMLU-ProX benchmarks drop to ~40% accuracy on Swahili — a 30-point gap. African languages (Wolof, Yoruba, Zulu) show even more severe degradation. The "multilingual" model label obscures performance that collapses for 7 billion non-English speakers.

These are not edge cases. They are the modal outcome when teams ship AI features without per-subgroup evaluation.

Why Aggregate Metrics Fail

The mathematical reason aggregate metrics deceive you is straightforward: if your evaluation set has 90% majority-group examples and 10% minority-group examples, and your model has 97% accuracy on the majority and 50% accuracy on the minority, your aggregate accuracy is 0.9 × 0.97 + 0.1 × 0.50 = 92.3%. That looks acceptable. The 50% minority performance, which is coin-flip accuracy, is invisible.

The problem compounds when you add intersectional subgroups. A Black woman denied a loan because of compounded race and gender bias may find that neither the race metric nor the gender metric individually captures her harm. Single-attribute disaggregation misses intersectional failures — you need sliced analysis across combinations.

There is also a training data feedback loop. Healthcare algorithms trained on historical treatment data inherit the undertreatment of minority populations as signal. Hiring algorithms trained on historical resume selections inherit the preferences of prior decision-makers. The training data encodes structural inequality and the aggregate metric reports it as objective truth.

The Subgroup Evaluation Methodology

Subgroup fairness testing is not complicated to implement. It requires three things: the right metrics, representative evaluation data, and a commitment to running this analysis before every production deployment.

Step 1: Define your subgroups explicitly. The relevant subgroups depend on your application domain. For vision systems: skin tone (using a 10-point Monk Skin Tone scale rather than the coarser 6-point Fitzpatrick), gender presentation, age range. For speech: dialect, native language, accent region. For NLP: language, reading level, cultural context. For systems that make consequential decisions: race, gender, socioeconomic indicators where legally permissible. Write these down. Make them part of your eval spec before you start building.

Step 2: Choose the right fairness metric for your domain. Several options exist, and they measure different properties:

  • Demographic parity difference: the gap in positive outcome rates between groups. Use this for hiring, loan approval, content recommendation — any decision where equal selection rates across groups is the goal.
  • Equalized odds difference: the maximum gap in both true positive rate and false positive rate across groups. Use this when misclassification costs matter and you need both types of errors to be equally distributed.
  • Disparate impact ratio: the ratio of positive outcome rates between groups. The EEOC-derived four-fifths rule defines the acceptable range as 0.8 to 1.25 for employment decisions.
  • Recall parity: equal sensitivity across groups. Use this for medical diagnosis, fraud detection, safety systems — anywhere a missed positive is the dominant harm.
  • Calibration: whether predicted probabilities are equally well-calibrated across groups. A model can be aggregate-calibrated while being systematically overconfident for one group and underconfident for another.

Pick the metric that corresponds to the actual harm model for your feature. A feature that gives loan approvals has a different harm structure than one that triages emergency messages.

Step 3: Build a representative evaluation dataset. This is the hard part. Your existing eval set was probably not constructed with subgroup coverage in mind. You need minimum subgroup sample sizes to get reliable estimates — at minimum N≥100 per subgroup for stable metric estimation, N≥500 for intersectional slices. For sparse intersectional groups, report uncertainty intervals alongside point estimates and flag groups with insufficient sample sizes as under-monitored rather than pretending their metrics are reliable.

If you cannot collect demographic metadata from users, you have several options: commission representative evaluation datasets with known demographic distributions, use proxy signals where appropriate, or partner with domain experts who can construct coverage-aware eval sets. What you cannot do is skip this and claim your model is fair.

Step 4: Make subgroup evals blocking gates. Fairlearn's MetricFrame computes any sklearn-compatible metric disaggregated by a sensitive feature column. Run it in CI. Write policy assertions — equalized_odds_difference < 0.05, disparate_impact_ratio in (0.8, 1.25) — as required passing conditions before model promotion. Treat a disparity threshold violation the same as a failing unit test: the build does not ship.

Disparity SLOs in Production

Pre-deployment evals catch the model you trained. They do not catch what happens when your production data distribution shifts and new subgroups emerge or existing subgroups degrade silently.

Define disparity SLOs the same way you define latency SLOs:

  • SLI: the metric, computed per subgroup. Example: equalized odds difference across gender subgroups, computed on a rolling 30-day window of production predictions.
  • SLO: the threshold that must be maintained. Example: equalized odds difference must remain below 0.05 over any rolling 30-day window.
  • Alert budget: the allowable headroom before alerting. If your SLO is 0.05, alert at 0.04 to give time to investigate before violation.

Monitor disparity SLOs alongside your accuracy, latency, and error rate SLOs. They belong in the same dashboard. Fiddler AI and Arize both support out-of-the-box fairness metric dashboards with demographic parity, recall parity, and disparate impact ratio computed continuously on production traffic.

The minimum coverage problem will hit you in production. As you slice production traffic by multiple demographic attributes simultaneously, intersectional cells get sparse fast. A production system serving 1M users across 5 gender categories × 5 racial groups × 4 age brackets yields 100 cells. Many will have insufficient volume for statistically reliable metric estimation. Handle this the same way you handle low-traffic endpoints in SRE: aggregate to coarser granularity for SLO enforcement, flag sparse cells for dataset augmentation, and do not silently treat sparse-subgroup metrics as reliable.

Remediation Patterns

When a disparity SLO violation fires, the fix depends on the root cause:

Data underrepresentation: If the underperforming subgroup was underrepresented in training data, the highest-ROI fix is to collect more data from that subgroup and retrain. This addresses the root cause directly.

Reweighting: Upweight examples from underperforming subgroups during training. Faster than data collection but doesn't add new information.

Threshold optimization (post-processing): Fairlearn's ThresholdOptimizer adjusts per-subgroup decision thresholds to achieve parity without retraining the underlying model. This is the fastest remediation path and works when you can't retrain on short notice.

In-processing constraints: For planned retraining cycles, Fairlearn's Exponentiated Gradient algorithm trains with explicit fairness constraints (equalized odds, demographic parity). This produces a model that satisfies the constraint by construction rather than patching thresholds afterward.

One important warning: removing sensitive features from the model does not remove bias. Models learn to use proxy variables — zip codes for race, names for gender, job titles for age. Fairness testing on outputs is not optional just because you removed the demographic column from the feature set.

The Intersection Problem

Single-attribute disaggregation misses intersectional failures. This is not a theoretical concern. A woman denied a loan due to intersectional race-and-gender bias may look fine on both the gender fairness metric and the race fairness metric individually. The harm is only visible in the intersectional slice.

The challenge is statistical: you need enough samples per intersectional cell for reliable estimation. With coarse demographic categories (3 races × 2 genders × 3 age buckets = 18 cells), you need 18x the sample volume per metric estimate compared to aggregate analysis.

Size-adaptive hypothesis testing approaches handle this more rigorously. For cells above a size threshold, standard confidence intervals work. For sparse cells, Bayesian estimators with Dirichlet-multinomial priors give reliable uncertainty quantification. The key is to never report sparse intersectional metrics without uncertainty intervals, and to actively prioritize data collection for the intersectional slices where you have insufficient sample coverage.

What Good Looks Like

A production AI system with mature subgroup fairness practice looks like this:

Before deployment, a model card documents per-subgroup performance across all defined sensitive attributes, known gaps, and limitations. Subgroup eval thresholds are blocking gates in CI. The eval dataset has documented coverage requirements with minimum sample sizes per subgroup.

In production, fairness SLIs are computed on rolling windows of production traffic and displayed alongside accuracy and latency in the same observability dashboard. Alerts fire when disparity metrics approach SLO thresholds, not after violation. Sparse subgroup cells are flagged as under-monitored. Remediation runbooks exist for each failure mode.

When a disparity SLO fires, there is a triage process — the same systematic diagnosis tree you apply to latency or error rate incidents — that identifies whether the cause is data drift, distribution shift, a model regression, or an evaluation coverage gap. The on-call engineer knows what to do.

The teams with this practice in place catch demographic disparities in staging. The teams without it discover them from user complaints, regulatory inquiries, or journalists. The technical investment is the same either way. The timing is not.

The Practical Starting Point

If you have no subgroup fairness practice today, start here:

Define the three to five most important subgroups for your feature. Pick the fairness metric that matches your harm model. Add MetricFrame from Fairlearn to your existing eval pipeline — it takes fewer than 20 lines of code. Set a threshold. Make it block promotion.

You will immediately discover that you do not have enough demographic metadata in your eval set to measure what you care about. That discovery is valuable. It tells you exactly what to collect next. The alternative is shipping without knowing, and finding out later that your aggregate 94% accuracy was built on 40% accuracy for the users who needed the feature most.

References:Let's stay in touch and Follow me for more thoughts and updates