Skip to main content

Your System Prompts Are Still in English: The Silent Cost of Incomplete AI Localization

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI feature. You celebrate the localization work: every button label, tooltip, and error message has been translated into twelve languages. The product manager signs off. The feature goes live globally.

Then, six weeks later, a user in Germany posts a screenshot. The AI's response has the right words but wrong register — awkward formality for a casual support context. A Japanese user reports that structured outputs contain dates formatted as MM/DD/YYYY, confusing their downstream tooling. A Brazilian support engineer notices the AI occasionally slips into English mid-sentence when reasoning through complex queries. These aren't infrastructure failures. Your dashboards show green. But for non-English users, the product is quietly worse.

The root cause is almost always the same: teams translate UI strings but leave system prompts in English. It feels like localization. It isn't.

Why System Prompts Are Not Neutral Text

A system prompt is not a configuration file. It is the reasoning scaffold the model operates inside. It establishes role, tone, output structure, constraints, and the implicit cultural contract between the model and the user. When that scaffold is written in English and the user is interacting in German, Japanese, or Arabic, the model faces a mismatch that English performance numbers will never surface.

Multilingual LLMs do not process all languages symmetrically. Internally, most major models default to an English-like representation space even when handling non-English inputs — they convert to English latent representations for reasoning, then translate back to the target language for output. This produces output that is grammatically correct but pragmatically off. Formality markers don't transfer cleanly. Politeness conventions differ. The phrase "be concise and direct" means something specific in English business writing; in Japanese, where indirectness is often a professional norm, the instruction generates responses that sound blunt to native speakers.

MMLU-ProX benchmarks across 29 languages reveal gaps as large as 38 points between English and Swahili performance on identical questions. Even high-resource languages like Greek and Arabic show 20–40% lower accuracy than English on complex reasoning tasks. These numbers reflect what happens at the model level when languages are treated as equivalent inputs. They also reflect what happens to your users when you ship a system prompt that assumes the model is always reasoning in English.

The Three Places Where Incomplete Localization Breaks Silently

Formality and register drift. German business contexts require formal address; defaulting to the informal "du" in a system prompt translated carelessly will generate outputs that feel condescending. Spanish varies dramatically by region — a single prompt won't span Spain and Mexico without adjustment. English system prompts that say "be warm and approachable" embed English pragmatics. What "warm" means in a Japanese customer service context versus a Brazilian one versus a British one is not the same, and the model is not inferring cultural norms from the user's language — it's following the English-language instructions you gave it.

Research on prompt politeness across languages confirms that optimal politeness levels differ by language, not just culture. A study on cross-lingual prompt steerability found that English-language system prompts consistently produce higher bias in demographic descriptions, and that bias gap widens with model size. Larger models trained on more English data amplify the English-centric defaults embedded in the system prompt.

Structured output format mismatches. Dates are the clearest signal. A model trained predominantly on English text defaults to MM/DD/YYYY. If your system prompt doesn't specify ISO 8601 (YYYY-MM-DD) explicitly and in the target language's expected locale, the model will generate locally ambiguous dates. October 3rd becomes 10/03 in American output and 03/10 in European expectation, and both are wrong if your downstream consumer expects a specific format.

Number separators follow the same pattern: 1,234.56 in English notation versus 1.234,56 in German notation. Currency placement varies. These aren't cosmetic issues — structured outputs consumed by code fail when the format is wrong. And they fail specifically for non-English locales, which means the failure is invisible in your English test coverage.

Domain vocabulary gaps. Specialized terminology often lacks clean translations, and LLMs fill the gap with literal translations or phonetic borrowings rather than domain-appropriate equivalents. A legal AI assistant using English-trained legal terminology will default to direct word-for-word translations that sound foreign to native-language legal practitioners. A medical AI in French may use anglicized technical terms where French clinical vocabulary exists. The users notice. The metrics don't.

Tokenization Makes the Problem Structural

Before the model even begins reasoning, tokenization imposes a penalty on non-English users that compounds every other problem. Languages using non-Latin scripts — Arabic, Chinese, Japanese, Korean, Hindi — require 2–15× more tokens than equivalent English text due to English-heavy tokenizer training data. A context window that holds a complete conversation in English may truncate the same conversation in Arabic, stripping the model of crucial context.

Ukrainian, Hindi, and official Indian languages show severe tokenization inefficiency. A researcher summarized this starkly: tokenization is "killing the multilingual LLM dream." This isn't a prompt engineering problem — it's infrastructure. But it interacts with prompt engineering decisions. A system prompt that is 300 tokens in English might be 800 tokens in Japanese, consuming context budget that should go to user content. Teams optimizing context window usage for English don't notice the asymmetry until production traffic in other languages arrives.

The Eval Gap: Nobody Owns Multilingual Quality

Here is why these failures accumulate: the team that owns system prompts is the AI/ML team. The team that owns translations is the localization team. The team that owns evals is data science. No one is chartered to own the intersection — the quality of AI outputs for non-English users. So multilingual eval doesn't happen.

The research community has produced frameworks that close this gap. MEGA (Microsoft Multilingual Evaluation) covers 16 NLP datasets across 70 languages. MMLU-ProX runs 11,829 identical questions across 29 languages for direct comparison. BenchMAX covers 17 languages across 10 tasks. CIA Suite (Cross-Lingual Auto Evaluation) provides language-specific evaluator LLMs and ready test sets. These tools exist. They are not commonly used in production eval pipelines.

The organizational pattern is consistent: teams run evals in English (and maybe one or two additional languages), establish baselines, and monitor aggregate metrics. When a prompt change degrades output quality for Turkish or Indonesian users, no alert fires — because no per-language quality metric was established. The degradation is invisible until a user screenshots it.

Multi-tenant agent evaluation across 52 languages found that top-performing models achieved only 34% average accuracy across all languages, with English at 57% and low-resource languages below 10%. Teams shipping AI features globally are implicitly accepting this gap by not measuring it.

What Localization-Aware Prompt Design Actually Looks Like

The fix is not complicated. It is organizational — it requires someone to own it.

Localize the entire system prompt, not just the user-facing surface. "You are a helpful customer service agent" is not a neutral instruction. Translate it into the target language and adapt the implied role and tone. In German: "Sie sind ein hilfsbereiter Kundendienst-Mitarbeiter" with formal address. In Japanese, the cultural role of a service agent carries different formality and indirectness conventions that need explicit encoding.

Enforce locale-specific structured output constraints explicitly. Don't assume the model infers date format from the user's language. State it: "Format all dates as YYYY-MM-DD regardless of locale" or "Use the following date format: [DD.MM.YYYY] for German users." Same for number separators, currency display, and measurement units. Constrained decoding at inference time adds a second enforcement layer when schema compliance is critical.

Maintain language-segmented eval baselines. Run quality evals per language, not in aggregate. Establish baseline accuracy, structured output compliance rate, and hallucination rate per language independently. Alert on per-language drift, not just overall metrics. The overhead is real but proportional to the risk of shipping silently degraded products to non-English users.

Use few-shot examples in the target language for low-resource languages. Few-shot prompting with culturally appropriate, domain-relevant examples in the user's language closes performance gaps that prompt translation alone cannot. This is especially effective for languages where zero-shot performance is substantially below English.

Treat reasoning language as a variable, not a constant. For mid-resource languages, English chain-of-thought pivoting sometimes improves accuracy. For high-resource and low-resource languages, native-language reasoning typically performs on par with or better than English pivoting. There is no universal answer — benchmark per language family for your specific task domain.

The Organizational Fix Is Cleaner Than the Technical Fix

The technical patterns above are well-understood. The organizational gap is the harder problem. Someone needs to own multilingual AI quality the same way someone owns English AI quality — with eval suites, monitoring dashboards, and incident response when per-language quality drops.

Localization teams know language. AI teams know prompts. Neither group owns the intersection by default. The fix is to explicitly charter a responsibility — whether that is a dedicated role, a shared SLA between teams, or a recurring audit cadence — for multilingual prompt quality and eval coverage.

Until that ownership exists, teams will keep translating UI strings, leaving system prompts in English, and discovering the degradation six weeks post-launch when a user takes a screenshot. The infrastructure metrics will stay green the whole time.

References:Let's stay in touch and Follow me for more thoughts and updates