Skip to main content

The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs

· 11 min read
Tian Pan
Software Engineer

Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.

Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.

The surprise compounds because it isn't just cost. Multilingual prompt engineering fails in ways that don't show up in your existing evals. Safety guardrails weaken in languages other than English. Benchmark scores that look great in aggregate hide performance gaps of 20+ percentage points for lower-resource languages. This post covers the mechanics of all three problems and the engineering patterns that actually address them.

The Token Fertility Problem

Modern LLM tokenizers — whether BPE, SentencePiece, or WordPiece variants — are trained predominantly on English text. Their vocabulary is optimized to encode English efficiently: roughly 4 characters per token for typical prose. Non-English languages pay a tax called "fertility," measured as tokens per word.

The numbers are stark:

  • Korean: ~2.36× tokens per word versus English
  • Japanese: ~2.12× tokens per word, with individual sentences reaching 8×
  • Chinese (Mandarin): ~1.76× tokens per word, but roughly 1 token per character versus English's ~4 characters per token
  • Arabic, Hindi, Burmese: 3–4× is common; morphologically complex agglutinative languages can reach higher

The economic consequence is direct: if your system prompt and RAG context clock in at 50,000 tokens in English, the same content in Korean costs ~118,000 tokens. At GPT-4o pricing, that's roughly a 2.36× cost increase before you've generated a single output token. If your users are sending multi-turn conversations with substantial context, you may be eating 3–4× your English infrastructure cost per request.

There's also a quadratic pressure on context windows. A 128K-token context that can fit 30,000 words of English comfortably might only fit 12,000–15,000 words of Korean conversation. For tasks that require large contexts — RAG pipelines, long document analysis, multi-turn support sessions — you run out of window much faster.

The Double-Cost Rule

Fertility doesn't scale linearly into cost. The "Token Tax" research shows that doubling token fertility leads to roughly a 4× increase in training cost and inference latency. This happens because attention is quadratic in sequence length — longer sequences aren't just more tokens, they're more computation per token.

At production inference scale (not training), the immediate impact is API pricing: you pay per input and output token, so 2× fertility = 2× API bill, roughly. But the latency impact is more nuanced. Output generation is sequential, and more output tokens mean proportionally more time. For a 200-word response in Japanese that requires 400 tokens to encode versus 200 for English, your time-to-complete-response doubles even if time-to-first-token doesn't.

This matters for latency SLOs. If your English pipeline is comfortably within a p95 of 2 seconds, your Japanese pipeline — even against the same model, same hardware, same provider — may routinely exceed it on long responses. Capacity planning done on English benchmarks will underestimate GPU-hours needed for non-English workloads.

Why Your Evals Don't Catch This

The standard multilingual evaluation story is: "we ran MMLU in 10 languages and got 82% accuracy." This is misleading in two directions.

First, most multilingual benchmarks are machine translations of English tasks. Translation preserves factual content but strips cultural context, changes idiomatic difficulty, and introduces artifacts. The INCLUDE benchmark (2025), which used native-language questions from local exams, found that models perform worse on native content than on translated equivalents — meaning translated benchmarks systematically overstate real-world performance for non-English users.

Second, aggregate scores hide per-language variance. The MMLU-ProX benchmark (2025) across 29 languages shows up to a 24.3 percentage-point gap between high-resource and low-resource languages. A model that scores 85% on English and 78% on German might score 62% on Swahili. Your single aggregate number doesn't tell you that your Swahili users are getting wrong answers more than one-third of the time.

The eval methodology you need is not "run your English eval suite against translated inputs." It is:

  1. Source native-language test cases from domain experts or local exam corpora
  2. Benchmark tokenization fertility with your actual production content distribution (not generic text)
  3. Test safety and refusal behavior independently for each language — safety guardrails are a different failure mode than accuracy, described next

The Safety Alignment Gap

This is the most operationally dangerous failure mode, and the least discussed.

Safety alignment in LLMs is trained primarily on English data. Guardrails that reliably block harmful output in English degrade measurably in other languages. Research on multilingual LLM safety (2025) finds that toxicity is consistently higher when models are prompted in non-English languages — and that models produce harmful content in non-English that would be filtered in English.

The mechanism is well-understood: RLHF and Constitutional AI processes use predominantly English human feedback. When a model receives a jailbreak attempt in Korean, it maps the semantics through a lower-confidence pathway, and refusal behavior is less reliable.

Asia-Pacific languages (Korean, Malay, Indonesian, Thai, Vietnamese) show particularly high rates of safety bypass through non-English prompts. If your AI feature operates in these regions, you need a language-specific red team exercise — not an English red team exercise with translated prompts.

The operational implication: your English safety evaluation does not cover your non-English users. This is a compliance problem as much as a product quality problem.

Prompt Engineering Breaks Silently

Beyond tokenization costs and safety gaps, multilingual prompt engineering has its own failure modes that don't appear in English-only development.

Internal English reasoning. Models with heavy English training continue to reason internally in English even when prompted in another language. The implicit pipeline becomes: translate input from Japanese → reason in English → translate output to Japanese. This introduces translation errors on both ends, and the back-translation of nuanced reasoning is often degraded. Complex multi-step problems suffer most.

Instruction-following drift. Models systematically follow instructions less precisely in non-Latin script languages. If your English prompt says "respond in exactly three bullet points," the Japanese version may produce four bullets, prose paragraphs, or switch to a list format. Formatting instructions require explicit testing in each target language.

Idiomatic prompt patterns don't transfer. Prompt engineering tricks that work in English — chain-of-thought phrasing, role-setting language, few-shot example structure — often fail to produce the same effects when translated. A translated few-shot example that looks correct to a non-speaker may use idioms or constructs that change how the model interprets the instruction. Language-specific prompt examples, written by native speakers, consistently outperform translated English prompts.

Code-mixing and language confusion. Models inadvertently blend languages in output, especially when the system prompt is English but the user input is not. For multilingual support applications where code-switching is common (users who write in Spanglish, or switch mid-conversation), this becomes a latent source of low-quality responses.

Engineering Patterns That Work

Several patterns address these problems in production.

Language-specific system prompts, not translations. The most impactful change is writing system prompts natively in each target language with examples written by native speakers. Translated English prompts are not equivalent. Teams that have made this change — including Duolingo for their AI-assisted language content — report meaningful accuracy improvements over translated prompts.

Prompt cache your multilingual assets. Static content — terminology glossaries, behavioral instructions, domain knowledge chunks — should be in prompt cache. For applications with large stable system prompts, prompt caching can reduce costs 60–90% on cached portions. For non-English systems where the system prompt is large (because you need more examples to overcome the translation-reasoning gap), this is the most reliable cost lever.

Separate your eval suite by language. Maintain a per-language eval harness with native test cases. Run it on every model version bump and every major prompt change. Aggregate multilingual scores are nearly useless as regression detectors — a change that improves English 3% while degrading Korean 8% looks neutral in an aggregate view.

Model selection matters more for multilingual than for English. Among closed models, Gemini 2.0 has the best tokenization fertility (SentencePiece with a large vocabulary, optimized for non-English). GPT-4o's o200k_base tokenizer is a significant improvement over earlier GPT-4 tokenizers. Claude's tokenizer has the worst reported fertility among major closed models. For CJK-heavy applications, the model choice has a direct 30–50% impact on API costs — this is a procurement decision, not just a quality decision.

For open-source deployments, Llama 3.1 shows the best tokenization efficiency. The community has also built language-specific vocabulary extensions for Chinese and Arabic that further improve fertility for those scripts.

Routing by language at the infrastructure layer. At sufficient scale, it makes sense to route by language, not just by task. A Japanese query for a creative writing task may perform better on a different model than an English query for the same task. Language-specific routing also lets you apply different safety policies, different prompt templates, and different cost models without tangling them in application logic.

For high-stakes use cases, reasoning models narrow the gap. DeepSeek-style and o1-style reasoning models reduce the English-to-non-English performance gap by 8–12 percentage points on complex reasoning tasks, cutting the disparity for African languages nearly in half in some evaluations. The tradeoff is latency and cost — reasoning models are slower and more expensive — but for domains where accuracy in a lower-resource language matters (medical, legal, financial), the gap reduction may be worth the overhead.

What You Actually Need to Measure

Before shipping to non-English markets, run these instrumentation checks:

  1. Token fertility audit. Sample 1,000 real user inputs from each target locale. Measure average tokens per request. Divide by your English baseline. This is your cost multiplier. If you don't have real user data yet, synthetic benchmarks will underestimate it.

  2. Per-language accuracy on native tasks. Don't use translated English test cases. If you can't source native cases, use MMLU-ProX or INCLUDE for orientation, but note that those are still general-knowledge benchmarks — your domain may be worse.

  3. Safety bypass testing. For each language, try the top 20 English jailbreak patterns in that language. Run them against your application. Document the refusal rate. If it's materially lower than English, you have a compliance exposure.

  4. Latency profiling by language. Token-for-token latency is roughly constant, but if your Japanese responses require 2.5× more output tokens to convey equivalent information, your p95 latency will be 2.5× higher. Profile before your SLO commitments.

  5. Prompt instruction compliance. For each critical behavioral instruction in your prompt (formatting, length, structure), test whether it is followed correctly in each target language.

The Summary Math

If you've budgeted your AI infrastructure costs based on English API benchmarks:

  • Korean or Japanese users: expect 2–2.5× your English cost per request
  • Arabic or Hindi users: expect 2.5–4× your English cost per request
  • High-fertility languages (Burmese, Tibetan, some African languages): 4–8× is possible

This is not a rounding error. It's the difference between a feature being economically viable in a market or not. It is also the difference between a product that works for your English QA team and one that fails for real users who speak a different language.

The teams that discover this in production rather than in planning are the ones who evaluated their multilingual AI product using English-language benchmarks, shipped, and then worked backward from a support queue full of non-English user complaints and a cost overrun they didn't model.

The multipliers are known. The failure modes are documented. The patterns to address them exist. The only remaining question is whether you measure them before or after launch.

References:Let's stay in touch and Follow me for more thoughts and updates