Skip to main content

Building Multilingual AI Products: The Quality Cliff Nobody Measures

· 11 min read
Tian Pan
Software Engineer

Your AI product scores 82% on your eval suite. You ship to 40 countries. Three months later, French and German users report quality similar to English. Hindi and Arabic users quietly stop using the feature. Your aggregate satisfaction score barely budges — because English-speaking users dominate the metric pool. The cliff was always there. You just weren't measuring it.

This is the default story for most teams shipping multilingual AI products. The quality gap isn't subtle. A state-of-the-art model like QwQ-32B drops from 70.7% on English reasoning benchmarks to 32.8% on Swahili — a 54% relative performance collapse on the best available model tested in 2025. And that's the best model. This gap doesn't disappear as models get larger. It shrinks for high-resource languages and stays wide for everyone else.

The Root Cause Lives in the Training Data

Pre-training data distribution is the original sin. English dominates Common Crawl, the primary data source for most large language models, at roughly 41% of all crawled content. Russian sits at 6.5%, German at 6%, Japanese at 5.7%, Chinese at 5%. Arabic falls below 1%. Hindi is below 0.5%. Bengali, Swahili, and most of the world's other languages are noise-level fractions.

Model training reflects this skew directly. LLaMA 2 trained on roughly 90% English tokens. Other languages appear as trace quantities: German at 0.17%, French at 0.16%, Chinese at 0.13%. LLaMA 3 improved the ratio slightly — but Meta's own documentation notes that only 5% of pretraining tokens were non-English across 30 languages, and explicitly warns that performance "will not match English." BLOOM (the BigScience model) made deliberate multilingual design choices, yet even its balanced ROOTS corpus ended up 30% English by token count.

The downstream effect is measurable at the model internals level. Research analyzing cosine similarity between internal layer representations across languages found that LLaMA 2 7B's representation of Korean bears similarity scores of 0.2–0.5 to its English representations, while German — with 0.17% of training data — lands around 0.58. The metric correlates directly with training data proportion, and the pattern holds across LLaMA, Qwen, Mistral, and Gemma model families. This isn't a single vendor's quirk. It's a structural property of how models are built at scale.

Scaling doesn't fix it uniformly. Larger models improve performance on low-resource languages, but the relative gap to English often persists or widens in some capability dimensions.

Three Distinct Failure Modes — Not One

Teams that think about multilingual quality usually frame it as a single problem: "the model knows less in other languages." That framing misses two equally important failure modes.

Capability cliff. The model has genuinely less knowledge and weaker reasoning in non-English languages because it saw less training data in those languages. This shows up in benchmarks: MMLU-ProX tested 36 models across 29 languages and found gaps of 24+ percentage points between high and low-resource languages on identical questions. A healthcare chatbot study tested ChatGPT-3.5 and specialized medical models across Spanish, Chinese, and Hindi — finding 18% lower correctness, 29% lower consistency, and 13% lower verifiability compared to English. One specialized model gave irrelevant or contradictory responses to more than 67% of non-English medical queries.

Safety cliff. Alignment training — RLHF, Constitutional AI, safety fine-tuning — is conducted overwhelmingly in English. The constraints trained into models in English often don't transfer to other language surfaces. Research from Brown University demonstrated that GPT-4's safety guardrails, which blocked harmful prompts with less than 1% success rate in English, could be bypassed roughly 79% of the time by translating the same prompts into Zulu, Scots Gaelic, Hmong, or Guarani. This isn't a theoretical finding. OpenAI patched the issue after the research went public. Practitioners have independently confirmed similar gaps with Gemini, where safety refusals that work correctly in English are bypassed in the same session when queries switch to certain non-English languages.

Output language cliff. This is the one most teams discover first and misdiagnose. The model receives a query in Arabic and responds in English. It understood the question — it just defaulted to English output. LLaMA 2 instruction-tuned models respond in English to Arabic queries with a 0.3% monolingual pass rate. Nearly every response ignores the query language. The counterintuitive finding: instruction tuning worsened the problem. The base model handles language consistency better. Instruction fine-tuning, conducted primarily on English data, amplifies the English output bias.

This output language cliff appears in a subtler form in RAG systems. When the retrieved context is in English but the user queries in Chinese, the model frequently "drifts" to generating in English. In one systematic study, Chinese-target consistency dropped from 92% to 68.4% when retrieved context was English. In 70–98% of drift cases across languages, the model defaulted to English output.

Why English-Only Evals Don't Catch Any of This

The standard eval setup: build a test set in English, run your model, measure aggregate accuracy. Maybe translate a subset of examples for spot-checking. Ship to production.

The problems compound at every step.

Translation-based benchmarks carry the original English structure into the target language, which distorts results. A study of Spanish MMLU found translation errors — mistranslated technical terms, improper proper-name handling, semantic drift — accounted for 30–60% of apparent failures in some categories. Manually correcting these artifacts recovered up to 63% of failed items. Benchmark scores on translated content don't measure model capability in the target language; they measure some mix of model capability and translation quality.

Cultural bias compounds the problem. Analysis of MMLU found that 28% of questions require Western cultural knowledge, and 84.9% of geography questions focus on North American or European regions. Model rankings change substantially when you split culturally sensitive from culturally agnostic questions. A model optimized to score well on the full MMLU may be doing so by exploiting Western-knowledge advantage rather than general reasoning capability.

Aggregate scores hide the distribution. A model reporting 75% on a "29-language average" could be at 95% on English and French while scoring 40% on Swahili and Bengali. Most model cards don't show per-language breakdowns. Model vendors who claim support for 30+ languages rarely disclose evaluation methodology or per-language performance data.

The final failure is the production signal itself. Non-English users who consistently get worse answers will churn silently or stop using the feature — but their volume in aggregate satisfaction scores is too small to move the metric in the short term. You see aggregate CSAT stay flat while a significant user segment stops engaging. You'd need per-language CSAT instrumentation to see it. Almost no team has this.

Language Detection and Routing That Actually Works

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates