Skip to main content

7 posts tagged with "multilingual"

View all tags

Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On

· 12 min read
Tian Pan
Software Engineer

A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.

The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.

Cross-Lingual Hallucination: Why Your LLM Lies More in Languages It Knows Less

· 9 min read
Tian Pan
Software Engineer

Your model scores 92% on your evaluation suite. Your French-speaking users complain constantly that it makes things up. Both of these facts can be true at the same time — and the gap between them is a structural problem in how multilingual AI systems are built and measured.

LLMs hallucinate 15–35% more frequently in non-English languages than in English. In low-resource languages like Swahili or Yoruba, that gap widens to 38-point performance deficits on the same factual questions. Yet most teams ship multilingual AI features with a single English-language eval suite, report aggregate benchmark scores that average away the problem, and only discover the damage when users in Paris or Mumbai start filing support tickets.

The cross-lingual hallucination problem is not primarily a model quality problem. It is a measurement and architectural failure that teams perpetuate by treating multilingual AI as "English AI with translation bolted on."

The Multilingual Quality Cliff: Why Your LLM Works Great in English and Quietly Fails Everyone Else

· 10 min read
Tian Pan
Software Engineer

Your LLM passes every eval you throw at it. Latency is solid, accuracy looks fine, and the team ships with confidence. Then a user in Cairo files a bug: the structured extraction returns malformed JSON. A developer in Seoul notices the assistant ignores complex instructions after a few turns. A product manager in Mumbai realizes the chatbot's summarization is just wrong—subtly, consistently, wrong.

None of this showed up in your benchmarks because your benchmarks are in English.

This is the multilingual quality cliff: a performance drop that is steep, systematic, and almost universally invisible to teams that ship AI products. The gap isn't marginal. In long multi-turn conversations, Arabic and Korean users see accuracy around 40.8% on tasks where English users are at 54.8%—a 14-point gap that compounds with every additional turn. For structured editing tasks, that same gap widens to catastrophic: 32–37% accuracy versus acceptable English performance. The users feel this. Your dashboards don't.

The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs

· 11 min read
Tian Pan
Software Engineer

Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.

Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.

Tokenizer Blindspots That Break Production LLM Systems

· 10 min read
Tian Pan
Software Engineer

Most engineers who build on LLMs eventually learn the rough conversion: one token is about 0.75 English words, so a 4,000-token context window fits roughly 3,000 words. That number is fine for back-of-napkin estimates when your input is casual English prose. It is quietly wrong everywhere else — and "everywhere else" turns out to be most of the interesting production workloads.

Token miscalculations don't fail loudly. They show up as cost overruns that don't match any line item, as context windows that silently truncate the last few paragraphs of a document, or as multilingual pipelines that work fine in English testing and go 4x over budget the first week they hit real traffic. By the time you trace the issue back to tokenization, the damage is done.

Prompt Localization Debt: The Silent Quality Tiers Hiding in Your Multilingual AI Product

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped with a 91% task success rate. You ran evals, iterated on your prompt, and tuned it until it hit your quality bar. Then you launched globally — and three months later a user in Tokyo files a support ticket that your AI "doesn't really understand" their input. Your Japanese users have been silently working around a feature that performs 15–20 percentage points worse than what your English users experience. Nobody on your team noticed because nobody was measuring it.

This is prompt localization debt: the accumulating gap between how well your AI performs in the language you built it for and every other language your users speak. It doesn't announce itself in dashboards. It doesn't cause outages. It just quietly creates second-class users.

Building Multilingual AI Products: The Quality Cliff Nobody Measures

· 11 min read
Tian Pan
Software Engineer

Your AI product scores 82% on your eval suite. You ship to 40 countries. Three months later, French and German users report quality similar to English. Hindi and Arabic users quietly stop using the feature. Your aggregate satisfaction score barely budges — because English-speaking users dominate the metric pool. The cliff was always there. You just weren't measuring it.

This is the default story for most teams shipping multilingual AI products. The quality gap isn't subtle. A state-of-the-art model like QwQ-32B drops from 70.7% on English reasoning benchmarks to 32.8% on Swahili — a 54% relative performance collapse on the best available model tested in 2025. And that's the best model. This gap doesn't disappear as models get larger. It shrinks for high-resource languages and stays wide for everyone else.