Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On
A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.
The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.
The discipline that has to land is straightforward to describe and unfamiliar to practice: locale-native evals authored in-region instead of translated from English, per-locale prompt variants that encode cultural conventions distinct from translation, an explicit politeness-and-refusal calibration step per locale because the safety-tuned defaults from English-dominant RLHF do not transfer evenly, and named-entity handling that respects the locale's conventions for honorifics and name order. Each piece is doable; the trap is that none of them are visible from the place where most teams measure their multilingual rollout.
The Failure Mode the Translation Pipeline Hides
The instinct on a multilingual launch is to treat translation quality as the proxy for product quality. It is a comforting metric because it is measurable, the BLEU and COMET numbers move predictably, and the localization team has a defensible line of work. The proxy is wrong because the failures users notice live in registers and norms that translation systems explicitly try not to alter.
Consider the simplest version of this. A bank assistant in English is trained to respond to a wire-transfer question with a businesslike, slightly informal acknowledgment ("Sure, I can help you with that"). Translated into Japanese, the literal equivalent lands in a casual register that a Japanese-speaking user would never use with a financial institution. Translated into German, the same response can read as overly familiar. Translated into Brazilian Portuguese, the same response can read as cold compared to the local norm of warmth in service interactions. The translator is not wrong. The source text was wrong for the target locale, and translation faithfully preserved that error.
The eval set inherits the same mistake. A translated eval grades whether the model produced a faithful rendering of an English-anchored gold answer. It does not grade whether the answer reads as appropriate to a native user in the target locale. Manual audits of translated benchmarks have shown that translation artifacts — proper-name mistranslations, idiomatic losses, and lack of cultural adaptation — account for roughly 30 to 60 percent of apparent failures, and that even faithful translations cannot capture the cultural nuances the benchmark would need to test. The team ships, the dashboard is green, and the support inbox is the only system that knows the product is reading off-key.
What "Cultural Calibration" Actually Means in a Prompt Stack
The fix is not a longer system prompt with a sentence about being polite. The fix is to treat each locale as a first-class deployment surface with its own prompt variant, its own eval set, and its own quality bar.
Per-locale prompt variants encode the conventions that translation cannot infer from context. A Japanese variant specifies honorific level (often a teineigo default for service contexts, with sonkeigo for formal tasks), name handling (family name before given name, no Western-style "first name only" familiarity), and a refusal style that is more apologetic and indirect than the English default. A Korean variant picks among the six speech levels and matches the formality the user signals. A German variant chooses the du/Sie boundary. A Brazilian Portuguese variant warms the service register. These are not translations of each other; they are independent prompt designs that happen to live in the same product.
The eval set must be authored in-region by native speakers, not translated from English. This is the sub-step most teams skip because it is the most expensive. The cost is real and the alternative is worse. Newer multilingual benchmarks like SinhalaMMLU, TurkishMMLU, and HKMMLU have moved away from translation-based construction precisely because translated evals systematically grade the wrong thing — they reward a model for producing English-shaped answers in another language rather than locale-appropriate ones. A locale-native eval will surface failures a translated eval physically cannot, because the failure modes were edited out by the translation pipeline before the model ever saw them.
The named-entity handling rules are deceptively load-bearing. A model that addresses a Korean user as "Min-jun" instead of "Park-ssi" is not making a translation mistake; it is making a relationship mistake. A model that uses "Ms. Tanaka" in a Japanese business context where "Tanaka-sama" is the convention has just communicated something about its origin and its register that the user will read accurately and the team will not. These are the kind of details that a localization style guide encodes for human writers, and they need the equivalent encoding in the prompt stack — not as polish, but as core behavior.
Safety Defaults Do Not Transfer Across Languages
The harder problem is that the model itself was not trained evenly across the languages it speaks. RLHF for production models is overwhelmingly English-dominant, and recent mechanistic work has shown that refusal behavior is anchored to representations that activate cleanly on English token sequences and degrade as the input drifts to lower-resource languages. The practical consequences are uncomfortable.
Refusal rates drop sharply in some non-English contexts — published work on West African languages reported refusal rates falling to 35–55 percent on prompts that an English-language model would refuse reliably. The same effect powers a class of cross-lingual jailbreaks where attackers translate a refused prompt into a language the safety training under-covered. The model's refusal direction is approximately universal across languages, but its ability to separate harmful from harmless prompts is not, and the gap is where the failure lives.
The flip side is that the same model can over-refuse in languages where the training data nudged it toward caution on topics that are routine in the target culture. A model that refuses to discuss a culturally normal practice as "unsafe" because its English-anchored safety training lacked the context is making a different kind of error than a jailbreak, and it is just as visible to the user.
The implication is that a politeness-and-refusal calibration step has to be part of every locale launch, not a one-time global tuning. The team needs to measure refusal rates per locale on a locale-native red-team set, measure over-refusal rates on a locale-native benign set, and treat any divergence from the English baseline as a release blocker — not an accepted regression. This is not a one-off audit. The base model changes, the safety tuning shifts, and a quarterly recalibration is closer to the right cadence than an annual one.
- https://aclanthology.org/2026.mme-main.pdf
- https://arxiv.org/abs/2412.03304
- https://arxiv.org/html/2503.10497v1
- https://arxiv.org/html/2406.17789v1
- https://arxiv.org/html/2505.17306v1
- https://arxiv.org/abs/2602.01283
- https://arxiv.org/html/2406.14805v2
- https://aclanthology.org/2025.coling-main.567.pdf
- https://arxiv.org/html/2502.08045v3
- https://www.sciencedirect.com/science/article/pii/S2949882125001082
- https://www.1stopasia.com/blog/asian-formality-systems-ux-compliance/
- https://en.wikipedia.org/wiki/Korean_honorifics
- https://huyenchip.com/2022/02/07/data-distribution-shifts-and-monitoring.html
