Skip to main content

Building Multilingual AI Products: The Quality Cliff Nobody Measures

· 11 min read
Tian Pan
Software Engineer

Your AI product scores 82% on your eval suite. You ship to 40 countries. Three months later, French and German users report quality similar to English. Hindi and Arabic users quietly stop using the feature. Your aggregate satisfaction score barely budges — because English-speaking users dominate the metric pool. The cliff was always there. You just weren't measuring it.

This is the default story for most teams shipping multilingual AI products. The quality gap isn't subtle. A state-of-the-art model like QwQ-32B drops from 70.7% on English reasoning benchmarks to 32.8% on Swahili — a 54% relative performance collapse on the best available model tested in 2025. And that's the best model. This gap doesn't disappear as models get larger. It shrinks for high-resource languages and stays wide for everyone else.

The Root Cause Lives in the Training Data

Pre-training data distribution is the original sin. English dominates Common Crawl, the primary data source for most large language models, at roughly 41% of all crawled content. Russian sits at 6.5%, German at 6%, Japanese at 5.7%, Chinese at 5%. Arabic falls below 1%. Hindi is below 0.5%. Bengali, Swahili, and most of the world's other languages are noise-level fractions.

Model training reflects this skew directly. LLaMA 2 trained on roughly 90% English tokens. Other languages appear as trace quantities: German at 0.17%, French at 0.16%, Chinese at 0.13%. LLaMA 3 improved the ratio slightly — but Meta's own documentation notes that only 5% of pretraining tokens were non-English across 30 languages, and explicitly warns that performance "will not match English." BLOOM (the BigScience model) made deliberate multilingual design choices, yet even its balanced ROOTS corpus ended up 30% English by token count.

The downstream effect is measurable at the model internals level. Research analyzing cosine similarity between internal layer representations across languages found that LLaMA 2 7B's representation of Korean bears similarity scores of 0.2–0.5 to its English representations, while German — with 0.17% of training data — lands around 0.58. The metric correlates directly with training data proportion, and the pattern holds across LLaMA, Qwen, Mistral, and Gemma model families. This isn't a single vendor's quirk. It's a structural property of how models are built at scale.

Scaling doesn't fix it uniformly. Larger models improve performance on low-resource languages, but the relative gap to English often persists or widens in some capability dimensions.

Three Distinct Failure Modes — Not One

Teams that think about multilingual quality usually frame it as a single problem: "the model knows less in other languages." That framing misses two equally important failure modes.

Capability cliff. The model has genuinely less knowledge and weaker reasoning in non-English languages because it saw less training data in those languages. This shows up in benchmarks: MMLU-ProX tested 36 models across 29 languages and found gaps of 24+ percentage points between high and low-resource languages on identical questions. A healthcare chatbot study tested ChatGPT-3.5 and specialized medical models across Spanish, Chinese, and Hindi — finding 18% lower correctness, 29% lower consistency, and 13% lower verifiability compared to English. One specialized model gave irrelevant or contradictory responses to more than 67% of non-English medical queries.

Safety cliff. Alignment training — RLHF, Constitutional AI, safety fine-tuning — is conducted overwhelmingly in English. The constraints trained into models in English often don't transfer to other language surfaces. Research from Brown University demonstrated that GPT-4's safety guardrails, which blocked harmful prompts with less than 1% success rate in English, could be bypassed roughly 79% of the time by translating the same prompts into Zulu, Scots Gaelic, Hmong, or Guarani. This isn't a theoretical finding. OpenAI patched the issue after the research went public. Practitioners have independently confirmed similar gaps with Gemini, where safety refusals that work correctly in English are bypassed in the same session when queries switch to certain non-English languages.

Output language cliff. This is the one most teams discover first and misdiagnose. The model receives a query in Arabic and responds in English. It understood the question — it just defaulted to English output. LLaMA 2 instruction-tuned models respond in English to Arabic queries with a 0.3% monolingual pass rate. Nearly every response ignores the query language. The counterintuitive finding: instruction tuning worsened the problem. The base model handles language consistency better. Instruction fine-tuning, conducted primarily on English data, amplifies the English output bias.

This output language cliff appears in a subtler form in RAG systems. When the retrieved context is in English but the user queries in Chinese, the model frequently "drifts" to generating in English. In one systematic study, Chinese-target consistency dropped from 92% to 68.4% when retrieved context was English. In 70–98% of drift cases across languages, the model defaulted to English output.

Why English-Only Evals Don't Catch Any of This

The standard eval setup: build a test set in English, run your model, measure aggregate accuracy. Maybe translate a subset of examples for spot-checking. Ship to production.

The problems compound at every step.

Translation-based benchmarks carry the original English structure into the target language, which distorts results. A study of Spanish MMLU found translation errors — mistranslated technical terms, improper proper-name handling, semantic drift — accounted for 30–60% of apparent failures in some categories. Manually correcting these artifacts recovered up to 63% of failed items. Benchmark scores on translated content don't measure model capability in the target language; they measure some mix of model capability and translation quality.

Cultural bias compounds the problem. Analysis of MMLU found that 28% of questions require Western cultural knowledge, and 84.9% of geography questions focus on North American or European regions. Model rankings change substantially when you split culturally sensitive from culturally agnostic questions. A model optimized to score well on the full MMLU may be doing so by exploiting Western-knowledge advantage rather than general reasoning capability.

Aggregate scores hide the distribution. A model reporting 75% on a "29-language average" could be at 95% on English and French while scoring 40% on Swahili and Bengali. Most model cards don't show per-language breakdowns. Model vendors who claim support for 30+ languages rarely disclose evaluation methodology or per-language performance data.

The final failure is the production signal itself. Non-English users who consistently get worse answers will churn silently or stop using the feature — but their volume in aggregate satisfaction scores is too small to move the metric in the short term. You see aggregate CSAT stay flat while a significant user segment stops engaging. You'd need per-language CSAT instrumentation to see it. Almost no team has this.

Language Detection and Routing That Actually Works

Before you can route by language, you need to detect it reliably — and short-text language detection is harder than it sounds.

The production standard is FastText, which recognizes 176 languages and processes roughly 28,000 samples per second, significantly outperforming alternatives on both accuracy and throughput. But even FastText fails in predictable ways:

  • Short or ambiguous strings ("api" is fire in Indonesian and an English acronym)
  • Greetings where users write in one language but want responses in another
  • Code-switching where users mix languages mid-sentence
  • Overconfidence on strings with no clear language signal

Production systems use domain-specific fine-tuning — training on customer service conversations rather than general web text — and supplement detection with greeting dictionaries that map common multilingual phrases to intended response languages.

Once you can detect language reliably, three routing architectures handle the rest:

Translate-then-process. Detect language → translate query to English → run English-optimized model → translate response back. This works well for knowledge retrieval tasks. It fails for culturally-grounded questions (where the English knowledge base doesn't apply), formal registers (legal, medical), and agglutinative languages where translation loses structural meaning. You're also stacking two translation errors.

Language-specialist routing. Route to a language-specific fine-tuned model per detected language. Highest quality ceiling per language; highest operational cost. Only viable if you have enough traffic per language to justify maintaining separate model variants.

Unified model with language-aware prompting. Detect language → inject explicit language instruction into the system prompt ("Respond in {detected_language}") → use a single multilingual model. Research shows fully translated system prompts achieve above 95% correct language generation rates in most languages. Simpler to operate; quality ceiling is set by the base model's multilingual capability.

Most production systems start with Pattern 3 and add language-specific fine-tuning for their highest-traffic non-English languages as quality requirements tighten.

Eval Design That Exposes the Cliff Before Launch

The shift that matters most: stop aggregating across languages before you understand per-language variance.

Use natively-authored test sets. The INCLUDE benchmark provides 197,000+ QA pairs drawn from local exam sources in 44 languages across 52 countries — questions written by native speakers for native speakers, not translated from English. These expose failures that translated benchmarks miss. If your team can't source native-authored questions, use back-translation as a quality check: translate your English examples to the target language, then translate back to English, and flag any items where meaning shifted significantly.

Separate capability accuracy from language consistency. A model can answer correctly in the wrong language. Measure "Correct Language Rate" (did the response match the query language?) and content accuracy independently. You might have a 90% capability ceiling with a 60% language consistency rate — two different problems requiring different fixes.

Test safety guardrails explicitly in low-resource languages. This is the evaluation step almost no team does. Before any launch to new language markets, run your safety test suite through translation into 5+ non-English languages that fall outside your primary training distribution. The Brown University finding about 79% bypass rates isn't an edge case — it's what happens when you train safety in English and ship globally without testing the assumption that safety transferred.

Tag every production request with detected language. This is the minimum viable monitoring change. Once you have per-language tags, you can compute per-language accuracy, latency, and satisfaction signals separately. Set language-specific drift alerts rather than global ones — English P95 latency at 400ms and Hindi P95 at 550ms might both be acceptable, but different alert thresholds apply.

Run per-language regression after every model update. Model improvements in English don't guarantee improvements in non-English languages. Capability in one language can actually regress when a model update shifts the data distribution or changes the alignment training. Your pre-promotion checklist should include a per-language battery, not just aggregate performance.

The Measurement Commitment

The quality cliff is silent because most teams never build the instrumentation to see it. Non-English users in aggregate CSAT are a minority. Per-language satisfaction signals don't exist. Eval suites are English-only. Model cards don't disclose per-language breakdowns. The cliff is invisible until users stop showing up.

The fix isn't to wait for models to close the gap — the gap has existed since the first large-scale models shipped, and while it's narrowing in high-resource languages, it's persisting for the majority of the world's languages. The fix is to measure it.

Add language detection tagging to your request logs. Add a per-language capability benchmark to your eval suite — even a simplified 200-question subset across your target markets is better than nothing. Run your safety tests in at least 5 non-English languages before any launch. Build per-language monitoring dashboards with independent alert thresholds.

The engineers who will build the most reliable multilingual AI products aren't waiting for model vendors to solve this upstream. They're measuring it themselves, routing around the gaps, and filing bug reports against a comparison baseline that shows exactly how much worse the experience is for a user in Bangalore or Casablanca compared to one in San Francisco.

That comparison baseline has to exist before it can drive decisions. Right now, for most teams, it doesn't.

References:Let's stay in touch and Follow me for more thoughts and updates