6 posts tagged with "localization"

Locale-Stratified Evals: How to Catch Non-English Regressions Your English Test Set Can't See

May 14, 2026 · 12 min read

Software Engineer

Your aggregate eval score is up 1.2 points after the last prompt change. Your CSAT on French queries dropped four points the same week. Both numbers are correct. The reason they disagree is that the eval set is 88% English, 6% Spanish, and the rest is a long tail none of which sees enough traffic to move the rollup. The French regression is in your data — it is just sitting at three decimal places below the noise floor of your top-line metric.

This is the most common shape of locale drift I see in production AI systems: not a sudden collapse, not a translated-string bug, but a steady performance gap that the rollup hides and the support queue eventually surfaces. By the time someone in the Paris office forwards a screenshot, you have shipped two more prompt changes on top of the regression and the bisect costs three engineering days.

Your System Prompts Are Still in English: The Silent Cost of Incomplete AI Localization

May 7, 2026 · 8 min read

Tian Pan

Software Engineer

Your team ships an AI feature. You celebrate the localization work: every button label, tooltip, and error message has been translated into twelve languages. The product manager signs off. The feature goes live globally.

Then, six weeks later, a user in Germany posts a screenshot. The AI's response has the right words but wrong register — awkward formality for a casual support context. A Japanese user reports that structured outputs contain dates formatted as MM/DD/YYYY, confusing their downstream tooling. A Brazilian support engineer notices the AI occasionally slips into English mid-sentence when reasoning through complex queries. These aren't infrastructure failures. Your dashboards show green. But for non-English users, the product is quietly worse.

The root cause is almost always the same: teams translate UI strings but leave system prompts in English. It feels like localization. It isn't.

Multilingual Eval Cost Amplification: Why Seven Locales Doesn't Cost 7×

April 28, 2026 · 14 min read

Tian Pan

Software Engineer

The financial planning spreadsheet for the international launch had a clean line item: "extend eval coverage to seven new locales — assume 7× current eval cost." The English eval suite took two weeks and $40K to build, so seven locales would be $280K and a quarter of engineering time. The CFO signed it. The VP of Product signed it. The launch shipped.

Six months later the actual eval bill had crossed $310K and the team was still standing up the last two locales. The labeling vendor had churned through three replacements for the Portuguese-Brazilian pool because the first two kept producing inter-rater agreement scores an honest review would call random. The German judge model was scoring 6% lower than the English one on the same content — the team initially read this as a German model regression until a manual audit revealed the judge itself was the regression. And the eval lead was spending forty percent of their week on a question nobody had budgeted: how do we know when locale A's pass rate is actually worse than locale B's, versus when our cross-locale measurement is just noisier than the gap?

Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.

The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.

Prompt Localization Debt: The Silent Quality Tiers Hiding in Your Multilingual AI Product

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature shipped with a 91% task success rate. You ran evals, iterated on your prompt, and tuned it until it hit your quality bar. Then you launched globally — and three months later a user in Tokyo files a support ticket that your AI "doesn't really understand" their input. Your Japanese users have been silently working around a feature that performs 15–20 percentage points worse than what your English users experience. Nobody on your team noticed because nobody was measuring it.

This is prompt localization debt: the accumulating gap between how well your AI performs in the language you built it for and every other language your users speak. It doesn't announce itself in dashboards. It doesn't cause outages. It just quietly creates second-class users.

Cultural Calibration for Global AI Products: Why Translation Is 10% of the Problem

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

There is a quiet failure mode baked into almost every globally deployed AI product. An engineer localizes the UI strings, runs the model outputs through a translation API, has a native speaker spot-check a handful of responses, and ships. The product is technically multilingual. It is not culturally competent. Users in Tokyo, Riyadh, and Chengdu receive outputs that are grammatically correct and culturally wrong — responses that signal disrespect, confusion, or distrust in ways the team will never see in aggregate metrics.

The research is unambiguous: every major LLM tested reflects the worldview of English-speaking, Protestant European societies. Studies testing models against representative data from 107 countries found not a single model that aligned with how people in Africa, Latin America, or the Middle East build trust, show respect, or resolve conflict. Translation patches the surface. The underlying calibration remains Western.

About Tian Pan