Skip to main content

Cross-Lingual Hallucination: Why Your LLM Lies More in Languages It Knows Less

· 9 min read
Tian Pan
Software Engineer

Your model scores 92% on your evaluation suite. Your French-speaking users complain constantly that it makes things up. Both of these facts can be true at the same time — and the gap between them is a structural problem in how multilingual AI systems are built and measured.

LLMs hallucinate 15–35% more frequently in non-English languages than in English. In low-resource languages like Swahili or Yoruba, that gap widens to 38-point performance deficits on the same factual questions. Yet most teams ship multilingual AI features with a single English-language eval suite, report aggregate benchmark scores that average away the problem, and only discover the damage when users in Paris or Mumbai start filing support tickets.

The cross-lingual hallucination problem is not primarily a model quality problem. It is a measurement and architectural failure that teams perpetuate by treating multilingual AI as "English AI with translation bolted on."

Why LLMs Know Less About the World in Your Language

The root cause is pretraining data imbalance, and the numbers are stark. Llama 3's training corpus is approximately 95% English and code, leaving roughly 5% for every other language on Earth. GPT-4 processes German and Italian text at around 50% higher cost than English — not because these are hard languages, but because the tokenizer was optimized for English, making the same semantic content take more tokens in other languages.

For genuinely low-resource languages, the situation is extreme. Processing Dzongkha or Odia costs 12 times more tokens than equivalent English content. Burmese and Tibetan require byte-level representations that are 4x longer than Chinese. Ukrainian, with its rich morphology, produces a higher token-to-word ratio than any Latin-script language.

This matters because token budgets are finite. A model that spends 12 tokens to represent what English would express in 1 token is processing fewer semantic units per inference call. Fewer examples in pretraining, fewer semantic units at inference — the model has genuinely learned less about the world as expressed in that language, and fills gaps with plausible-sounding confabulation.

The alignment data problem compounds this. Safety guardrails, instruction-following training, and factuality-oriented fine-tuning are overwhelmingly English-centric. The behaviors teams carefully tune into models — "say 'I don't know' when uncertain," "don't fabricate statistics" — were tuned in English. Research on safety in low-resource languages consistently finds that guardrails weaken sharply when inputs switch to code-mixed Hindi-English or low-resource African languages. The model learned the behavior; it didn't learn to generalize it across linguistic contexts.

The Benchmark Masking Problem

Here is how teams convince themselves they don't have a cross-lingual hallucination problem: they run MMLU or a similar benchmark, average across all languages, and report the result.

MMLU-ProX, a 29-language version of MMLU released in 2025, shows what's hiding inside that average. The same questions asked across languages reveal 24.3-percentage-point performance gaps between high and low-resource languages. Translation artifacts account for 30–60% of failures in Spanish MMLU alone — improper handling of proper names, mistranslated technical terms, missing cultural context. Manual correction by native speakers recovers up to 63% of failed items in some categories.

The Mu-SHROOM benchmark (SemEval 2025) assessed hallucination detection across 10 languages including Arabic, German, Hindi, and Mandarin Chinese. The per-language results varied substantially despite all languages being "represented" in the training data. A model that looks acceptable in English and French can be generating harmful fabrications in Hindi and Arabic at rates the aggregate score never reveals.

The failure mode here is not just academic. When a company reports "our model achieves 89% accuracy on multilingual factual QA," that number may conceal 95% accuracy in English, 88% in Spanish, 74% in Hindi, and 61% in Swahili. Teams making product decisions based on the headline number are building on false ground.

The Specific Ways Cross-Lingual Hallucination Manifests

Not all hallucination types distribute evenly across languages. Understanding the specific failure modes helps prioritize detection and mitigation.

Entity confusion is the most common. A model that reliably answers "who invented the telephone" correctly in English may flip to a plausible-but-wrong answer in a language where fewer examples of that entity relationship exist in training. Transliterated personal names are a particular vulnerability — the model may have seen "Graham Bell" in English thousands of times but encountered the phonetic transliteration in Devanagari only dozens of times, creating a weak knowledge anchor that degrades under inference pressure.

Fabricated statistics appear disproportionately in low-resource language generation. When a model needs a supporting number and the language doesn't have a well-represented knowledge base to draw from, it generates plausible-sounding figures. The numbers are often in the right order of magnitude and grammatically correct — making them hard to catch without independent verification.

Cross-modal compounding was documented in the CCHall benchmark (ACL 2025), which tested vision-language models across 9 languages. Models correctly identified objects in images when generating in English but hallucinated properties — colors, sizes, attributes — when generating the same description in lower-resource languages. The visual grounding remained intact; the language-specific generation process introduced fabricated detail. A model might correctly describe an image as "a chair" in English but produce "a red wooden chair" in a language where it had fewer examples of constrained visual description.

Safety guardrail bypass deserves mention as a production risk distinct from factual hallucination. Toxicity filters, refusal behaviors, and harmful content detection that work reliably in English can fail in low-resource languages or code-mixed inputs. Teams shipping content moderation built on English-tuned classifiers are exposed to attacks that simply switch languages.

Per-Language Quality Auditing in Practice

The fix starts with measurement. Teams that only run English evals are not measuring what their multilingual users experience. Building per-language quality auditing into your eval pipeline is not optional once you ship to non-English markets.

The most effective approach is native-speaker annotation rather than machine translation. The BenchMAX benchmark demonstrates the quality difference: it uses 3 independent native-speaker annotators per sample across 16 languages. The effort is higher, but the signal is honest. Machine-translated eval sets import translation artifacts that contaminate the measurement — your scores reflect how well the model handles translated questions, not how well it handles naturally-expressed questions in that language.

For production systems that can't afford full annotation, cross-language consistency checking is a practical middle ground. Ask the same factual question in each target language. Compare outputs for logical consistency. If your model says one thing in French and the opposite in Arabic about the same entity, that inconsistency is a hallucination signal even without ground truth labels. The AlignX framework formalizes this into entity-level consistency scoring across languages.

Automated multilingual hallucination detection tools have matured enough for production use. LettuceDetect supports hallucination detection in 7 languages (English, German, French, Spanish, Italian, Polish, Chinese) with a lightweight implementation suitable for inline inference pipelines. HaluAgent uses an autonomous multi-stage approach — sentence segmentation, tool-based verification, reflective reasoning with external sources — that scales to arbitrary languages where verification tools exist.

Production Architecture Patterns That Reduce Cross-Lingual Hallucination

Language-aware RAG with separate knowledge bases. The single most effective architectural mitigation is grounding generation in retrieved documents — but this only works if you maintain language-appropriate knowledge bases, not a single English-dominant index. A French query hitting an English knowledge base is already degraded before generation starts. MEGA-RAG, a multi-evidence guided architecture, achieves 40%+ hallucination reduction by grounding generation in language-matched retrieved content and using an additional refinement pass to reconcile conflicting retrieved evidence.

Per-language instruction sets. Language-specific system prompts matter more than most teams expect. A hedge instruction like "if you're uncertain, say so" tuned in English may not transfer reliably to languages where the alignment training didn't strongly reinforce the same behavior. For high-stakes multilingual deployments, maintain separate system prompts per language, explicitly test uncertainty expression in each language, and validate that refusal behaviors fire correctly under out-of-distribution inputs.

Language-aware routing. Different models have materially different multilingual strengths. A routing layer that selects model and configuration based on the request language — rather than applying a single frontier model to everything — can improve accuracy while controlling cost. For high-resource European languages, frontier models perform well. For low-resource languages with limited training representation, specialized models or translation-to-English pipelines with domain-appropriate routing may outperform direct generation. Semantic routing combined with cost-aware fallback can reduce infrastructure costs 30–40% without accuracy loss on high-resource language queries.

Temperature discipline per language. Lower sampling temperatures (0.1–0.3) reduce hallucination rates across all languages but the effect is larger in low-resource contexts where the model's uncertainty is higher. For factual, constrained generation tasks in low-resource languages, greedy or near-greedy decoding is usually the right starting point. Reserve higher temperatures for creative tasks in well-represented languages where you can absorb some variance.

Batch-wise multilingual alignment. For teams doing fine-tuning, the most effective approach to improving non-English accuracy without degrading English is batch-wise alignment — constructing fine-tuning batches from semantically equivalent examples across languages rather than single-language batches. Research shows this improves non-English accuracy up to 23.9% without English regression.

The Measurement Gap Is the Product Gap

The cross-lingual hallucination problem is solvable at the architectural level, but most teams never solve it because they don't measure it. An English-only eval suite plus aggregate multilingual benchmark scores creates the appearance of parity where none exists.

The practical starting point is choosing your three or four most-used non-English languages, building small but native-speaker-annotated eval sets for each, and tracking per-language metrics alongside aggregate scores. The first time you run this, you will find gaps you didn't know you had. Those gaps are what your non-English users are experiencing right now.

The second step is treating language as an architectural variable — routing, knowledge bases, system prompts, and temperature settings — rather than a translation problem. Translation assumes the model knows the same things in all languages. It doesn't. Architecture that accounts for differential knowledge depth per language is what actually bridges the gap between what the benchmark says and what users report.


Cross-lingual hallucination is not a frontier research problem. The tools to measure it exist. The architectural patterns to reduce it are documented. What's missing in most production systems is the decision to treat non-English quality as a first-class concern rather than a localization detail.

References:Let's stay in touch and Follow me for more thoughts and updates