Locale-Stratified Evals: How to Catch Non-English Regressions Your English Test Set Can't See
Your aggregate eval score is up 1.2 points after the last prompt change. Your CSAT on French queries dropped four points the same week. Both numbers are correct. The reason they disagree is that the eval set is 88% English, 6% Spanish, and the rest is a long tail none of which sees enough traffic to move the rollup. The French regression is in your data — it is just sitting at three decimal places below the noise floor of your top-line metric.
This is the most common shape of locale drift I see in production AI systems: not a sudden collapse, not a translated-string bug, but a steady performance gap that the rollup hides and the support queue eventually surfaces. By the time someone in the Paris office forwards a screenshot, you have shipped two more prompt changes on top of the regression and the bisect costs three engineering days.
The fix is not "translate more eval cases." That is necessary, but on its own it makes the dashboard greener without making the product better, because a translated eval set inherits all the English-centric assumptions of the original — including the ones that broke. The discipline is to stratify eval reporting by locale, calibrate judges per language, weight regressions by traffic share, and treat each major locale as its own gated channel. The rest of this post walks through what each of those means and where they tend to fail.
Why English-Centric Evals Hide Non-English Failures
The English-first eval set has a structural blind spot that compounds across four independent failure modes, and any one of them is enough to mask a real regression.
The first is tokenizer asymmetry. Common Latin-1 text in English tokenizes at roughly 0.75 tokens per word; Japanese, Chinese, Korean, and Arabic frequently tokenize at 2–3x that density. The same logical prompt — same instructions, same few-shot examples, same context budget — leaves a Japanese query with substantially less room for the actual question and the actual answer. Attention degrades faster across that compressed working space, instruction-following gets noisier, and the long-context behavior you validated in English does not transfer. Your English eval cases will never reveal this because they are not the ones running out of budget.
The second is latent-space bias. Most production frontier models are trained on a corpus that is overwhelmingly English (and a long tail of high-resource languages). The internal representation is English-shaped, and non-English queries often pass through a translation-like intermediate step before reasoning. That step introduces its own losses, particularly for languages where compounding produces meaning that the parts do not predict — Japanese place names, German technical compounds, Chinese 父母 (parents = father + mother). Reasoning quality drops in ways that look like the model "just being worse" but are actually language-specific.
The third is instruction-retention drift. The widely-cited finding from BenchMAX-style multilingual evaluation is that English performance stays stable across 6–10 turn conversations while non-English performance degrades materially. The model loses the constraint earlier, ignores the formatting rule earlier, drops the persona earlier. Almost no English eval is structured to catch a sixth-turn instruction-retention failure, because the failure does not show up in English.
The fourth is cultural and grammatical bleed-through. Many English-shaped constraints — "respond in the second person," "use a friendly but not casual tone," "address the user by their first name" — do not map 1:1 across languages. A "friendly but not casual" tone in Japanese involves a register choice the prompt never specified. Polite second-person address in Korean involves honorifics the prompt never named. The model picks something, often the most generic option, and the answer reads as oddly stiff or oddly familiar to a native speaker — but the English judge has no way to score this.
The aggregate eval score does not see any of these because it is averaging over an English-heavy sample. The fix has to be structural, not statistical.
Locale-Stratified Eval Sets, Not Translated Ones
The first move is to stop reporting a single aggregate score and start reporting per-locale scores as gated channels. This sounds obvious, but the harder part is what goes into each per-locale set.
Translated eval cases are a starting point and a trap. They preserve the English logical structure, which means they preserve the English failure modes and miss the locale-specific ones. A translated case about "scheduling a meeting at 3pm on Thursday" does not test whether the model handles 木曜日の午後三時 in Japanese correctly, because the underlying ambiguity (is "Thursday" relative to today or to the next week?) is an English-shaped ambiguity. The Japanese version of the same problem has different ambiguities — the absolute-vs-relative distinction, the choice between formal and conversational time expression — that a translation will never surface.
What works better is a three-layer eval set per major locale:
- Translated cases as a sanity floor: catch obvious regressions and tokenization-driven length blowups. Cheap to maintain, low signal.
- Localized cases authored by native speakers: same product domain, same task shape, but written from scratch in the target language so the actual idiomatic patterns, register choices, and cultural references appear. Higher signal, more expensive.
- Failure-mode cases harvested from real user traffic in that locale: support tickets, low-rated turns, abandoned sessions in Japanese, French, Portuguese, whichever the locale is. Each case is annotated with the failure mode it is guarding against, so a future prompt change cannot accidentally regress it back.
The ratio depends on traffic. For a locale at 15% of traffic, you probably want a few hundred localized cases plus the harvested failures. For a tail locale at 1%, the translated layer might be the only thing that pencils out — and you should be honest that you have low confidence in that locale.
The big claim worth defending: a locale-stratified eval set with 200 native-authored cases will catch more real regressions than 2,000 machine-translated ones. The reason is signal density. Translated cases mostly re-test the English failure modes; native-authored cases test the language-specific ones, which is where the regressions actually live.
Locale-Conditioned Judges (and Their Own Biases)
The judge is the other half of the eval pipeline and a more dangerous source of locale blind spots. Two specific failure modes show up consistently.
The first is translationese bias. LLM judges systematically prefer machine-translated text over native-authored text — they reward the syntactic shape of translated output, which means a model that produces translation-flavored prose in the target language will outscore a model that produces idiomatic native prose. In low-resource languages this is severe enough to invert the ranking. If your judge prompt is in English and the candidate answer is in Japanese, the judge is implicitly translating the Japanese into an English-shaped representation to score it, and it is going to reward outputs that already look like that representation. This is the inverse of what you want.
The second is inconsistency across languages. Multilingual judge agreement measured by Fleiss' Kappa runs in the 0.1–0.32 range on average across languages, with the bottom of that range hitting low-resource languages and reasoning-heavy tasks. That is not "calibration drift." That is the judge essentially making different value calls on different languages while presenting the same numerical scale. You cannot compare a 7.4 in English to a 7.4 in Vietnamese and treat the gap as a real quality gap — the scales are not the same scale.
What helps:
- Per-locale judge prompts written in the target language with criteria explicitly grounded in that language's register and idiom expectations. "Polite but not overly formal" in English becomes a register-specific instruction in Japanese that names the actual register.
- Per-locale calibration sets — a set of 20–50 frozen cases per locale with expert scores attached, used to compute a judge-vs-expert correlation and recalibrate the scale when it drifts.
- Decomposed judging: break the rubric into language-agnostic criteria (factual correctness, instruction adherence, formatting) and language-specific criteria (register, idiom, native-speaker naturalness). Report and gate on them separately. Prompt scaffolding with explicit criteria produces ~0.05–0.10 Kappa improvement on its own, which is meaningful when your starting point is 0.2.
- Native-speaker spot review of judge scores at a sample rate — maybe 5% per locale — to catch drift in the judge before it propagates.
A subtle point: even with all of this, the judges across languages should not be compared as a single number. The right report shape is "did this locale regress against its own previous baseline," not "is this locale's score lower than English."
Traffic-Weighted Reporting and Per-Locale Gating
Once the eval set is stratified and the judges are calibrated, the reporting layer is where most teams accidentally undo the work.
The default rollup of "average across all eval cases" puts non-English regressions back below the noise floor, because the per-locale signal gets averaged with a much larger English signal. Replace it with two reports.
The first is per-locale absolute score over time. One chart per major locale, gated independently. A 2-point drop in French is a French regression and gets flagged as a French regression, regardless of what English is doing. Do not let any prompt change ship if any gated locale dropped below its threshold, even if the aggregate is up.
The second is traffic-weighted impact. For each per-locale change, multiply by the locale's traffic share. A 5-point drop on a locale that is 1% of traffic is much smaller than a 1-point drop on a locale that is 20% of traffic. The traffic-weighted view tells you which regressions to chase first, while the per-locale view tells you which ones not to ship at all. The two are complementary, not substitutes.
A second-order pattern worth adopting: locale-specific canary slices in production. When a prompt change rolls out, route a small percentage of traffic per locale through the new prompt for a window — say a few hours per locale — and compare live quality signals (regenerate rate, thumbs-down rate, session abandon rate) against the locale's own baseline. Some regressions only show up under real traffic distribution, and the canary catches them before they reach the rest of the locale.
The runbook when a per-locale regression fires has three steps, in order: confirm against the locale's calibration set so you can rule out judge drift; bisect the prompt or model change against the per-locale eval set rather than the aggregate; and check whether the regression is concentrated in a specific intent (long-context, multi-turn, instruction-heavy) — most locale regressions cluster on one intent class rather than spreading evenly.
Where This Discipline Tends to Fail in Practice
A few practitioner patterns that show up after the basics are in place:
- Eval set rot per locale: the English set gets continuous attention because everyone uses it, while the French set stops getting failure cases added after the first quarter. Six months later the French set is testing a stale distribution. The fix is to formalize a per-locale failure-harvesting cadence — a weekly half-hour where a native speaker triages low-rated turns from that locale and promotes a few into the eval set.
- Judge upgrade asymmetry: a model migration is validated in English and shipped. The English judge handles the new model fine; the Japanese judge does not, because the new model's Japanese register shifted. Per-locale calibration sets catch this if they get re-run on every judge or judged-model change.
- Traffic share blindness: a team locks in per-locale thresholds based on initial traffic share and forgets that the share is itself a function of product quality in that locale. The Japanese locale stays at 3% of traffic because the product is poor in Japanese, which keeps it under the gating threshold, which keeps the regressions unblocked. Periodically revisit the gating tier of each locale against the addressable user base, not just current traffic.
- Aggregate-first culture: leadership dashboards keep showing the single number even after engineering has the stratified report. The aggregate is the path of least resistance, and the moment it is the headline metric, prompt changes start optimizing for it again. The cleanest counter is to remove the aggregate from the gating layer entirely — the report can still show it, but no decision is allowed to use it.
The Takeaway
Locale drift is not a translation problem; it is an observability problem. Models will continue to be worse in non-English languages than in English for the foreseeable future — the training-data asymmetry, tokenizer asymmetry, and latent-space bias are structural and not going away on the current scaling trajectory. The question is whether your eval system can see the gap before your support queue does.
The minimum viable version is small: pick the two or three locales that are over 5% of traffic, build native-authored eval sets of a few hundred cases each, write a judge prompt in the target language with a small calibration set, and gate prompt changes per locale rather than on the aggregate. That is enough to surface most locale-specific regressions before they leak to users. The version after that — failure harvesting, per-locale canaries, decomposed judging — is incremental once the structure is in place.
The change in mental model is to stop treating non-English as a tail and start treating each major locale as a first-class product surface with its own quality channel. The English eval set is one of many, not the one. The aggregate score is a story, not a gate.
- https://lilt.com/blog/multilingual-llm-performance-gap-analysis
- https://www.comet.com/site/blog/complexities-for-non-latin-languages-llm-evaluations/
- https://arxiv.org/html/2502.07346v1
- https://mmluprox.github.io/
- https://aclanthology.org/2025.findings-emnlp.587.pdf
- https://arxiv.org/html/2603.10351
- https://arxiv.org/html/2603.18557
- https://arxiv.org/html/2502.16534v1
- https://openreview.net/forum?id=sr1zpOStM4
- https://artificialanalysis.ai/models/multilingual
- https://sea-lion.ai/blog/multilingual-llm-benchmarking/
