Skip to main content

10 posts tagged with "multilingual"

View all tags

Locale-Stratified Evals: How to Catch Non-English Regressions Your English Test Set Can't See

· 12 min read
Tian Pan
Software Engineer

Your aggregate eval score is up 1.2 points after the last prompt change. Your CSAT on French queries dropped four points the same week. Both numbers are correct. The reason they disagree is that the eval set is 88% English, 6% Spanish, and the rest is a long tail none of which sees enough traffic to move the rollup. The French regression is in your data — it is just sitting at three decimal places below the noise floor of your top-line metric.

This is the most common shape of locale drift I see in production AI systems: not a sudden collapse, not a translated-string bug, but a steady performance gap that the rollup hides and the support queue eventually surfaces. By the time someone in the Paris office forwards a screenshot, you have shipped two more prompt changes on top of the regression and the bisect costs three engineering days.

Your System Prompts Are Still in English: The Silent Cost of Incomplete AI Localization

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI feature. You celebrate the localization work: every button label, tooltip, and error message has been translated into twelve languages. The product manager signs off. The feature goes live globally.

Then, six weeks later, a user in Germany posts a screenshot. The AI's response has the right words but wrong register — awkward formality for a casual support context. A Japanese user reports that structured outputs contain dates formatted as MM/DD/YYYY, confusing their downstream tooling. A Brazilian support engineer notices the AI occasionally slips into English mid-sentence when reasoning through complex queries. These aren't infrastructure failures. Your dashboards show green. But for non-English users, the product is quietly worse.

The root cause is almost always the same: teams translate UI strings but leave system prompts in English. It feels like localization. It isn't.

The Multilingual RAG Retrieval Gap: Why Cross-Lingual Queries Silently Fail Your Vector Search

· 11 min read
Tian Pan
Software Engineer

A team builds a RAG system. English retrieval hits 94% recall. They ship. Three months later, support tickets from French and German users pile up — the chatbot keeps returning irrelevant results or nothing at all. The engineers look at their monitoring dashboard. Overall recall: 91%. Nothing looks broken.

The corpus is English. The embedding model is English-only. The users are not. Every French query gets embedded into a vector space that was never designed to share coordinates with the English documents it's searching against. The cosine similarities aren't bad — they're geometrically meaningless. And because aggregate metrics aggregate, the problem is invisible until users complain loudly enough.

This is the multilingual RAG retrieval gap, and it's one of the most common silent failure modes in production AI systems serving non-English audiences.

Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On

· 12 min read
Tian Pan
Software Engineer

A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.

The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.

Cross-Lingual Hallucination: Why Your LLM Lies More in Languages It Knows Less

· 9 min read
Tian Pan
Software Engineer

Your model scores 92% on your evaluation suite. Your French-speaking users complain constantly that it makes things up. Both of these facts can be true at the same time — and the gap between them is a structural problem in how multilingual AI systems are built and measured.

LLMs hallucinate 15–35% more frequently in non-English languages than in English. In low-resource languages like Swahili or Yoruba, that gap widens to 38-point performance deficits on the same factual questions. Yet most teams ship multilingual AI features with a single English-language eval suite, report aggregate benchmark scores that average away the problem, and only discover the damage when users in Paris or Mumbai start filing support tickets.

The cross-lingual hallucination problem is not primarily a model quality problem. It is a measurement and architectural failure that teams perpetuate by treating multilingual AI as "English AI with translation bolted on."

The Multilingual Quality Cliff: Why Your LLM Works Great in English and Quietly Fails Everyone Else

· 10 min read
Tian Pan
Software Engineer

Your LLM passes every eval you throw at it. Latency is solid, accuracy looks fine, and the team ships with confidence. Then a user in Cairo files a bug: the structured extraction returns malformed JSON. A developer in Seoul notices the assistant ignores complex instructions after a few turns. A product manager in Mumbai realizes the chatbot's summarization is just wrong—subtly, consistently, wrong.

None of this showed up in your benchmarks because your benchmarks are in English.

This is the multilingual quality cliff: a performance drop that is steep, systematic, and almost universally invisible to teams that ship AI products. The gap isn't marginal. In long multi-turn conversations, Arabic and Korean users see accuracy around 40.8% on tasks where English users are at 54.8%—a 14-point gap that compounds with every additional turn. For structured editing tasks, that same gap widens to catastrophic: 32–37% accuracy versus acceptable English performance. The users feel this. Your dashboards don't.

The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs

· 11 min read
Tian Pan
Software Engineer

Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.

Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.

Tokenizer Blindspots That Break Production LLM Systems

· 10 min read
Tian Pan
Software Engineer

Most engineers who build on LLMs eventually learn the rough conversion: one token is about 0.75 English words, so a 4,000-token context window fits roughly 3,000 words. That number is fine for back-of-napkin estimates when your input is casual English prose. It is quietly wrong everywhere else — and "everywhere else" turns out to be most of the interesting production workloads.

Token miscalculations don't fail loudly. They show up as cost overruns that don't match any line item, as context windows that silently truncate the last few paragraphs of a document, or as multilingual pipelines that work fine in English testing and go 4x over budget the first week they hit real traffic. By the time you trace the issue back to tokenization, the damage is done.

Prompt Localization Debt: The Silent Quality Tiers Hiding in Your Multilingual AI Product

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped with a 91% task success rate. You ran evals, iterated on your prompt, and tuned it until it hit your quality bar. Then you launched globally — and three months later a user in Tokyo files a support ticket that your AI "doesn't really understand" their input. Your Japanese users have been silently working around a feature that performs 15–20 percentage points worse than what your English users experience. Nobody on your team noticed because nobody was measuring it.

This is prompt localization debt: the accumulating gap between how well your AI performs in the language you built it for and every other language your users speak. It doesn't announce itself in dashboards. It doesn't cause outages. It just quietly creates second-class users.

Building Multilingual AI Products: The Quality Cliff Nobody Measures

· 11 min read
Tian Pan
Software Engineer

Your AI product scores 82% on your eval suite. You ship to 40 countries. Three months later, French and German users report quality similar to English. Hindi and Arabic users quietly stop using the feature. Your aggregate satisfaction score barely budges — because English-speaking users dominate the metric pool. The cliff was always there. You just weren't measuring it.

This is the default story for most teams shipping multilingual AI products. The quality gap isn't subtle. A state-of-the-art model like QwQ-32B drops from 70.7% on English reasoning benchmarks to 32.8% on Swahili — a 54% relative performance collapse on the best available model tested in 2025. And that's the best model. This gap doesn't disappear as models get larger. It shrinks for high-resource languages and stays wide for everyone else.