The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs
Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.
Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.
The surprise compounds because it isn't just cost. Multilingual prompt engineering fails in ways that don't show up in your existing evals. Safety guardrails weaken in languages other than English. Benchmark scores that look great in aggregate hide performance gaps of 20+ percentage points for lower-resource languages. This post covers the mechanics of all three problems and the engineering patterns that actually address them.
The Token Fertility Problem
Modern LLM tokenizers — whether BPE, SentencePiece, or WordPiece variants — are trained predominantly on English text. Their vocabulary is optimized to encode English efficiently: roughly 4 characters per token for typical prose. Non-English languages pay a tax called "fertility," measured as tokens per word.
The numbers are stark:
- Korean: ~2.36× tokens per word versus English
- Japanese: ~2.12× tokens per word, with individual sentences reaching 8×
- Chinese (Mandarin): ~1.76× tokens per word, but roughly 1 token per character versus English's ~4 characters per token
- Arabic, Hindi, Burmese: 3–4× is common; morphologically complex agglutinative languages can reach higher
The economic consequence is direct: if your system prompt and RAG context clock in at 50,000 tokens in English, the same content in Korean costs ~118,000 tokens. At GPT-4o pricing, that's roughly a 2.36× cost increase before you've generated a single output token. If your users are sending multi-turn conversations with substantial context, you may be eating 3–4× your English infrastructure cost per request.
There's also a quadratic pressure on context windows. A 128K-token context that can fit 30,000 words of English comfortably might only fit 12,000–15,000 words of Korean conversation. For tasks that require large contexts — RAG pipelines, long document analysis, multi-turn support sessions — you run out of window much faster.
The Double-Cost Rule
Fertility doesn't scale linearly into cost. The "Token Tax" research shows that doubling token fertility leads to roughly a 4× increase in training cost and inference latency. This happens because attention is quadratic in sequence length — longer sequences aren't just more tokens, they're more computation per token.
At production inference scale (not training), the immediate impact is API pricing: you pay per input and output token, so 2× fertility = 2× API bill, roughly. But the latency impact is more nuanced. Output generation is sequential, and more output tokens mean proportionally more time. For a 200-word response in Japanese that requires 400 tokens to encode versus 200 for English, your time-to-complete-response doubles even if time-to-first-token doesn't.
This matters for latency SLOs. If your English pipeline is comfortably within a p95 of 2 seconds, your Japanese pipeline — even against the same model, same hardware, same provider — may routinely exceed it on long responses. Capacity planning done on English benchmarks will underestimate GPU-hours needed for non-English workloads.
Why Your Evals Don't Catch This
The standard multilingual evaluation story is: "we ran MMLU in 10 languages and got 82% accuracy." This is misleading in two directions.
First, most multilingual benchmarks are machine translations of English tasks. Translation preserves factual content but strips cultural context, changes idiomatic difficulty, and introduces artifacts. The INCLUDE benchmark (2025), which used native-language questions from local exams, found that models perform worse on native content than on translated equivalents — meaning translated benchmarks systematically overstate real-world performance for non-English users.
Second, aggregate scores hide per-language variance. The MMLU-ProX benchmark (2025) across 29 languages shows up to a 24.3 percentage-point gap between high-resource and low-resource languages. A model that scores 85% on English and 78% on German might score 62% on Swahili. Your single aggregate number doesn't tell you that your Swahili users are getting wrong answers more than one-third of the time.
The eval methodology you need is not "run your English eval suite against translated inputs." It is:
- Source native-language test cases from domain experts or local exam corpora
- Benchmark tokenization fertility with your actual production content distribution (not generic text)
- Test safety and refusal behavior independently for each language — safety guardrails are a different failure mode than accuracy, described next
The Safety Alignment Gap
This is the most operationally dangerous failure mode, and the least discussed.
Safety alignment in LLMs is trained primarily on English data. Guardrails that reliably block harmful output in English degrade measurably in other languages. Research on multilingual LLM safety (2025) finds that toxicity is consistently higher when models are prompted in non-English languages — and that models produce harmful content in non-English that would be filtered in English.
The mechanism is well-understood: RLHF and Constitutional AI processes use predominantly English human feedback. When a model receives a jailbreak attempt in Korean, it maps the semantics through a lower-confidence pathway, and refusal behavior is less reliable.
Asia-Pacific languages (Korean, Malay, Indonesian, Thai, Vietnamese) show particularly high rates of safety bypass through non-English prompts. If your AI feature operates in these regions, you need a language-specific red team exercise — not an English red team exercise with translated prompts.
The operational implication: your English safety evaluation does not cover your non-English users. This is a compliance problem as much as a product quality problem.
- https://arxiv.org/html/2509.05486v1
- https://arxiv.org/pdf/2305.15425
- https://tonybaloney.github.io/posts/cjk-chinese-japanese-korean-llm-ai-best-practices.html
- https://digitalorientalist.com/2025/02/04/to-merge-or-not-to-merge-the-pitfalls-of-chinese-tokenization-in-general-purpose-llms/
- https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1538165/full
- https://arxiv.org/html/2503.10497v1
- https://arxiv.org/html/2505.17784
- https://latitude.so/blog/multilingual-prompt-engineering-for-semantic-alignment
- https://arxiv.org/html/2505.24119v1
- https://arxiv.org/html/2505.13141v1
