Skip to main content

The Multilingual Quality Cliff: Why Your LLM Works Great in English and Quietly Fails Everyone Else

· 10 min read
Tian Pan
Software Engineer

Your LLM passes every eval you throw at it. Latency is solid, accuracy looks fine, and the team ships with confidence. Then a user in Cairo files a bug: the structured extraction returns malformed JSON. A developer in Seoul notices the assistant ignores complex instructions after a few turns. A product manager in Mumbai realizes the chatbot's summarization is just wrong—subtly, consistently, wrong.

None of this showed up in your benchmarks because your benchmarks are in English.

This is the multilingual quality cliff: a performance drop that is steep, systematic, and almost universally invisible to teams that ship AI products. The gap isn't marginal. In long multi-turn conversations, Arabic and Korean users see accuracy around 40.8% on tasks where English users are at 54.8%—a 14-point gap that compounds with every additional turn. For structured editing tasks, that same gap widens to catastrophic: 32–37% accuracy versus acceptable English performance. The users feel this. Your dashboards don't.

Why English Dominates and Everything Else Suffers

The root cause is straightforward: LLMs are trained on the internet, and the internet is overwhelmingly English. GPT-3's training corpus was 92.65% English tokens. LLaMA 2 was 89.70% English. Even as models have expanded their language coverage, the base distribution never changed enough to close the gap.

This imbalance propagates through every stage of the model's development. Pretraining gives the model its fundamental understanding of language patterns, grammar, and world knowledge. When 90% of that signal is English, the model builds robust internal representations for English and weaker, more fragile representations for everything else. When a user writes in Arabic, the model is working from a much sparser map of that linguistic territory.

Instruction-tuning makes this worse. Even multilingual instruction datasets exhibit extreme English skew—not just in volume, but in complexity. Non-English examples tend to be shorter, simpler, and less representative of the tasks users actually run in those languages. The model learns to follow detailed, nuanced English instructions well. It learns to approximate similar behavior in other languages.

The tokenizer adds another compounding factor. Subword tokenizers trained on English-dominant corpora fragment non-Latin scripts dramatically. Arabic requires roughly 3x as many tokens as English to express the same semantic content. Some Indic and Southeast Asian scripts hit a 12x multiplier. This isn't just a cost problem—it directly degrades attention quality. More tokens mean more positions to track, more opportunities for the model to lose the thread of a constraint stated three hundred tokens ago. Long conversations in Arabic fragment into thousands of tokens that an equivalent English conversation would represent in hundreds.

The final layer is what researchers call the "semantic attractor" effect. When a multilingual model processes non-English input, it internally translates to English, reasons in English, then translates the output back. Every translation step introduces loss. The model isn't natively reasoning in Arabic or Japanese—it's approximating a translation of English reasoning, and the approximation degrades with task complexity.

The Evaluation Blind Spot

The reason most teams don't catch this: their eval suites are built in English, by English-speaking engineers, for English-speaking stakeholders.

Over 75% of major LLM benchmarks are designed English-first. Non-English evaluation, when it exists, is typically handled by translating English test cases—which preserves the language label but discards the cultural context, native linguistic patterns, and task-specific failure modes that actually matter. A test case asking "limit your response to 15 words" becomes harder in Spanish simply because Spanish words are longer. Machine-translated benchmarks encode this kind of artifact throughout.

The LLM-as-Judge pattern widely used in production evaluation has a specific failure mode here. Deployed evaluation models fail to identify quality drops in more than 50% of cases when the evaluated content is in a non-English language. The judge was trained on English performance data; its calibration for "this answer is wrong" breaks down for Arabic or Thai outputs. You get green evals and production failures simultaneously.

Safety evaluations have the same problem. Guardrails and alignment mechanisms trained on English data exhibit degraded performance in other languages, with gaps that correlate directly to how underrepresented the language was in training. A model that refuses harmful requests in English may not refuse them in Swahili—not because safety was intentionally disabled, but because the safety-relevant patterns simply weren't present in the training signal for that language.

The failure mode this creates is insidious. Outputs look plausible. Sentences are grammatically reasonable. The model didn't hallucinate nonsense; it produced something coherent that happens to be factually wrong, subtly biased, or missing a critical constraint. These failures don't trigger error rates or latency alerts. They surface as user trust erosion, support tickets, and churned accounts.

Per-Language Benchmarking as Infrastructure

Treating multilingual quality as an infrastructure problem rather than a feature changes how you approach it. The first step is measurement: you cannot fix what you cannot see.

The practical approach is adopting dedicated multilingual evaluation frameworks and running them on every supported language before and after any model change. MMLU-ProX covers 29 languages with over 11,000 questions per language, providing broad coverage for general knowledge and reasoning tasks. Microsoft's MEGA framework evaluates generative AI performance across a structured set of multilingual tasks. BenchMAX (2025) provides a comprehensive suite designed specifically for production evaluation of multilingual systems.

The critical insight from recent benchmark research is that translated test cases are insufficient. Performance on translated tests doesn't predict performance on natively authored content in the target language. Benchmarks built with native speakers—like MultiNRC and the natively constructed portions of MMLU-ProX—surface failure modes that translation-based evaluation misses entirely. If you support a language seriously, you need native-language test cases authored or reviewed by speakers of that language.

Setting language-specific quality thresholds is as important as the measurement itself. Don't accept "within 5% of English performance" as a blanket standard. Some tasks transfer well across languages; structured extraction and complex instruction following do not. Establish separate baselines for each language-task combination, and treat degradation below those baselines as a regression requiring a fix before shipping.

Stratified sampling in production logging gives you ongoing signal. Log a percentage of non-English queries with associated quality signals—thumbs up/down, downstream task success, explicit user corrections. Disaggregate these by language. The goal is to know your actual per-language quality in production, not just at eval time.

Language Routing and Fallback Strategies

Once you can measure the gap, the next question is what to do about it. For many teams, the most tractable immediate solution is routing: directing queries to the model best suited for the detected language.

Language routing infrastructure involves detecting the language of the incoming query, then selecting among available models based on historical performance data for that language-task combination. Trained routing frameworks like RouteLLM can reduce infrastructure cost by up to 85% while retaining 95% of high-capability model performance, by routing simple queries to cheaper models and complex or low-resource-language queries to the strongest available model.

The router training matters. A router trained on generic benchmarks will generalize poorly to your specific task distribution. Routers trained on task-specific data that closely resembles the actual queries your users send—in the actual languages they send them in—are substantially more accurate. Preference-based training, where the router learns from human evaluations of model quality on your data, produces the best results.

For languages where even the strongest available model produces unacceptable quality, a translate-process-translate architecture can serve as a fallback. The query is translated to English, processed by the English-optimal model, and the output is translated back to the target language. This approach has real costs: translation introduces latency, each translation step loses fidelity, and the model's output may carry English-specific assumptions that don't translate. It's a fallback for when native performance is genuinely inadequate, not a substitute for native language quality.

Language-specific fine-tuning is the higher-effort, higher-reward path. For supported languages that matter to your product, fine-tuning on native-language task data—even modest amounts of high-quality examples—can close a significant portion of the quality gap. Research consistently shows that data quality matters more than volume: 1,000 native-language examples authored by fluent speakers outperforms 10,000 machine-translated examples. Parameter-efficient approaches like LoRA keep the fine-tuning cost manageable without requiring full model retraining.

What the 2025–2026 Model Landscape Gets Right and Wrong

The latest generation of models has made real progress on multilingual coverage. Qwen2.5 and Qwen3 now claim over 200 languages with expanded multilingual pretraining corpora. Gemini covers 140+ languages with native multimodal capability. LLaMA 3.1 extended strong multilingual instruction following across 100+ languages.

This progress is genuine but unevenly distributed. High-resource languages—Spanish, French, German, Japanese, Mandarin—have seen substantial improvements. The gap between these languages and English has narrowed. The gap for low-resource languages has not. African languages, Indigenous Americas languages, and regional languages with limited digital presence remain severely underserved. The best available models for these languages still perform at levels that would be unacceptable for English.

Tokenizer improvement is proceeding but slowly. Hybrid tokenization approaches and language-specific vocabulary expansion are showing promise in research, but the commercially deployed models most teams use still carry the token-count penalty for non-Latin scripts. When evaluating model cost for a multilingual product, factor in that some languages will bill at substantially higher per-task rates due to tokenizer inefficiency alone.

Evaluation quality is improving faster than model quality in this area. Native-speaker benchmarking is becoming standard practice. Safety evaluation is getting multilingual coverage. But production deployments still lag significantly—most teams running AI products haven't updated their eval infrastructure to match the pace of benchmark development.

The Operational Checklist

If you ship a multilingual AI product, the minimum viable configuration for responsible operation is:

  • Audit your eval coverage. Explicitly list every language your product supports and verify you have evaluation data for each. "We support Spanish" is a claim about quality, not just language detection.
  • Run per-language benchmarks before major model changes. Model upgrades that improve English performance sometimes degrade minority-language performance. Catch this before it reaches users.
  • Monitor production quality by language. Disaggregate your quality metrics. If you're not tracking success rates, user corrections, and error rates per language, you're flying blind.
  • Set language-specific thresholds. Different tasks degrade at different rates across languages. Treat each language-task pair as its own quality SLO.
  • Implement language routing. Even a simple rule-based router that sends low-resource language queries to your strongest model is better than treating all queries identically.
  • Document your language-quality commitments. If a language is in beta or has degraded quality, tell users. Silent degradation is a trust problem; acknowledged limitations are manageable.

The teams that get this right treat multilingual quality as a first-class engineering concern from the start—not an afterthought addressed when support tickets appear. The measurement infrastructure, routing logic, and per-language evaluation are built alongside the English baseline, not retrofitted after launch. In a world where most AI products launch English-first and never catch up for other languages, that discipline is a genuine differentiator.

References:Let's stay in touch and Follow me for more thoughts and updates