Skip to main content

The Localized System Prompt Your Model Performs Worse Against Than the English Original

· 11 min read
Tian Pan
Software Engineer

Your English system prompt took six weeks to tune. A staff engineer rewrote the constraint list four times, the eval suite finally cleared 94% on the held-out task set, and the launch checklist green-lit it for production. Then the i18n team picked it up, ran it through the same translation pipeline that handles button labels and tooltips, and shipped the Japanese, German, Hindi, and Arabic variants the next sprint. The launch dashboard for non-English markets shows the same task volume, the same user funnel, and — until a support ticket from a Tokyo customer surfaces six months later — the same green status.

The Tokyo customer's complaint is that the agent ignored an instruction the English prompt explicitly forbids. You re-read the Japanese prompt and it says the same thing, semantically. You re-run the English eval suite against the English variant and it passes. There is no eval suite for the Japanese variant. There never was.

This is the architecture failure under the surface of every multilingual LLM product: the system prompt was localized as content, but it functions as conditioning. The eval suite measures the English contract. The translation pipeline measures string equivalence. Between them, the actual behavioral surface of the product in non-English markets is unmeasured and drifting.

A system prompt is not a translatable string

The translation pipeline that handles your UI is calibrated for a different artifact. Button labels, tooltips, and error messages are content the user reads — fidelity is preserved if the translated string conveys the same meaning to a human in the target locale. Translators have decades of conventions for this, professional QA, and a feedback loop that closes the moment a user complains the wording is wrong.

A system prompt is none of those things. It is a behavioral contract addressed to a model whose response to it depends on three layers the i18n team does not measure: the tokenizer's segmentation of the source string, the model's instruction-following calibration in that language, and the latent space alignment between the prompt's directives and the user's input.

The research literature has begun to quantify each of those layers separately. M-IFEval — the multilingual extension of IFEval — adapts instruction-following evaluation to French, Japanese, and Spanish and finds that the same model, scored against the same verifiable constraint, lands in materially different bands across the languages. XIFBench extends the picture to six languages with 558 instructions across five constraint categories (content, style, situation, format, numerical) and reports a systematic disparity between high-resource and low-resource targets. Marco-Bench-MIF takes it to 30 languages and confirms the gap widens as you walk away from English.

The numbers vary by paper, model, and constraint class, but the headline is consistent: an instruction the model honors 90% of the time when the system prompt is in English drops by 8–22 percentage points when the system prompt is re-expressed in a non-English language, even when the user-facing task and the model are held constant.

The localization team is not staffed to debug this

The translation pipeline runs on string-equivalence QA: a native speaker reads the translated string and confirms it conveys the original meaning. That QA passes on a translated system prompt — by construction, because the translator did their job. What the QA does not catch is that "convey the same meaning to a human reviewer" and "elicit the same instruction-following behavior from a model" are uncorrelated tasks.

The failure modes that escape this QA are quiet ones. The Japanese variant of "do not provide medical advice" parses correctly to a human reviewer, but the model's refusal calibration on the Japanese phrasing is weaker than on the English original, so the model issues advisory-sounding text the English variant suppresses. The German variant of "always cite the source document by section number" is structurally faithful, but the model's format-constraint adherence on German prompts is materially worse than on English, so the agent cites half the time. The Arabic variant of a numerical constraint is grammatically correct, but the model's numerical-constraint compliance — already the weakest constraint class even in English — collapses further.

The localization team can verify the string. They cannot verify the behavior. The team that can verify the behavior — the model-engineering team that wrote the English prompt and the eval suite that gates it — was not informed that 13 sibling variants were released into production.

English-centric reasoning is a property of the substrate

Why the gap exists at all is worth understanding, because it tells you which mitigations are available. The frontier models are trained on corpora dominated by English text, with multilingual data sampled less heavily and concentrated in a small number of high-resource languages. The model's internal reasoning trajectory — the chain of activations between prompt and output — leans toward English regardless of the surface language of the prompt. Cross-lingual analyses report that low-resource languages form shallow, isolated clusters in the latent space rather than mapping cleanly into a shared semantic geometry.

The practical consequence is that the model does a partial translation of your non-English prompt into its internal English-aligned representation before reasoning over it. The translation step is lossy. The losses concentrate on exactly the surfaces a system prompt is supposed to control: constraint adherence, refusal calibration, format fidelity, instruction priority. If the source-language phrasing in your prompt translates ambiguously to the model's internal representation, the constraint loses force; if the translation is grammatically valid but rare in the model's training distribution, the constraint loses salience.

This is why translating the prompt does not solve the problem — and often makes it worse. The original English prompt was already in the model's strongest representational language. Re-expressing it in a weaker representational language adds an internal re-translation step the model performs on every request. You are paying for an alignment loss in exchange for a property — surface-language match — that the model does not require to operate.

"Translate the prompt" and "translate the UI string" are different engineering decisions

Once you see the system prompt as model-conditioning rather than user-facing content, a different set of design moves opens up. The decision tree below replaces the one your i18n pipeline is following by default.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates