Multilingual Eval Cost Amplification: Why Seven Locales Doesn't Cost 7×
The financial planning spreadsheet for the international launch had a clean line item: "extend eval coverage to seven new locales — assume 7× current eval cost." The English eval suite took two weeks and $40K to build, so seven locales would be $280K and a quarter of engineering time. The CFO signed it. The VP of Product signed it. The launch shipped.
Six months later the actual eval bill had crossed $310K and the team was still standing up the last two locales. The labeling vendor had churned through three replacements for the Portuguese-Brazilian pool because the first two kept producing inter-rater agreement scores an honest review would call random. The German judge model was scoring 6% lower than the English one on the same content — the team initially read this as a German model regression until a manual audit revealed the judge itself was the regression. And the eval lead was spending forty percent of their week on a question nobody had budgeted: how do we know when locale A's pass rate is actually worse than locale B's, versus when our cross-locale measurement is just noisier than the gap?
The arithmetic the spreadsheet used was N tokens per eval × L locales × per-token labeling cost. The arithmetic that actually applies is closer to N × L1.3, plus a meta-eval cost that doesn't appear on the curve at all because nobody itemized it. The amplification is real, it is structural, and the org that doesn't model it before committing to the localization roadmap discovers it the same way every other unmodeled cost gets discovered: through quarterly review pages with the CFO asking which line item was wrong.
Translation Doesn't Buy You What You Think It Buys You
The cheapest path to N locales of eval coverage is to translate the English eval suite. A vendor can MT the prompts and the rubric, a human reviewer can clean it up, and you have N suites at maybe 30% of the cost of authoring from scratch. Every team building international AI features tries this first. Most discover within the first month that the shortcut bought them numbers, not signal.
The core problem is that translated eval items are evaluating the wrong thing. A multiple-choice question translated from English into Portuguese-Brazilian primarily measures the model's translation fluency, not its ability to handle the kinds of queries Brazilian users actually send. The questions are calibrated to American educational frames, the cultural assumptions encoded in the distractors are American, and the failure modes that matter in Brazil — different formality registers, different name conventions, regional vocabulary that diverges from European Portuguese — are invisible to an item that was authored against an English distribution. Recent benchmarks like INCLUDE pointed this out by sourcing locally authored regional knowledge questions instead of translating: the gap between translated benchmarks and locale-native ones isn't a few percentage points of noise, it's a categorically different signal.
Worse, translated eval suites systematically underweight the failure modes you most need to catch. The English eval has a corpus of jailbreak prompts, edge-case formatting requests, and adversarial inputs that were collected from English-language production traffic over months. None of those translate. Brazilian users have their own jailbreak idioms, their own formatting expectations (date order, currency placement, address formats), their own polite-imperative constructions that the model needs to understand are still imperatives. Translating the English adversarial set gives you English adversarials in Portuguese clothing — a useful sanity check, but not an eval that catches what the locale will throw at you.
Teams that learn this the hard way go through three stages. Stage one: ship translated evals, get pass rates that look reasonable, ship the feature. Stage two: locale-native support starts surfacing complaints the eval suite never flagged. Stage three: tear out the translated evals and rebuild from locale-native authoring, at roughly the cost of the original English suite per locale — the cost the spreadsheet was supposed to avoid. The shortcut wasn't a shortcut; it was a deferred bill that came due during the first wave of negative reviews.
Locale-Pinned Judges and the Fleiss' Kappa Floor
LLM-as-judge is the cost-control reflex for English evals: instead of paying labelers $4-12 per item to grade rubric responses, you pay a few cents per item to have an LLM judge do it, calibrated against a small human-labeled gold set. The economics work in English because the judge model has seen enough English instruction-following to be reasonably calibrated, and the calibration set is cheap to build.
Multilingual breaks both halves of that bargain. Recent measurement shows multilingual LLM judges achieving an inter-rater Fleiss' Kappa of around 0.3 across a 25-language test, which is in the territory of "barely better than chance agreement." Performance degrades sharply on lower-resource languages, and even on high-resource languages the judge's failure modes diverge from the human labeler's — meaning your judge isn't just a noisier version of the human, it's biased differently in different directions per locale. A judge that systematically over-credits fluency in Portuguese and under-credits semantic accuracy in Japanese is not a single judge with noise; it's seven different judges in a trench coat, and the cross-locale comparison you wanted to do with it is structurally invalid.
The mitigation isn't to abandon LLM judging — the human-only path doesn't fit the budget — but to accept that judge calibration is now a per-locale workstream. Each locale needs its own gold set, its own per-rubric calibration measurement, its own freshness cadence as the model and the prompt evolve, and its own ensemble strategy if the single-judge variance is too high to land calibrated decisions. The cost that the spreadsheet treated as a fixed setup cost (build the judge once, run it everywhere) is now a per-locale variable cost that scales with L on top of the labeling cost itself.
The org that doesn't separate "the judge needs calibration" from "the judge is calibrated" discovers the gap through a regression that looks like a model improvement. The new model rolls out, the German judge says quality went up, the launch goes ahead, and three weeks later support tickets show quality went down — because the judge's bias shifted favorably toward the new model's stylistic tics on a dimension that doesn't matter to German users, while it stopped noticing a regression on a dimension that does.
The Meta-Eval Nobody Budgeted
Here is where the cost curve breaks N × L. You have seven eval suites. Each suite is locally authored, locally judged, locally labeled. Tenant A in Brazil hits 87% pass rate, Tenant B in Germany hits 81%, Tenant C in Japan hits 74%. The product manager asks the obvious question: is the Japanese experience worse, or is the Japanese eval harder?
- https://arxiv.org/abs/2411.19799
- https://arxiv.org/abs/2505.12201
- https://aclanthology.org/2025.findings-emnlp.587.pdf
- https://www.secondtalent.com/resources/data-annotation-costs-by-country-comparing-global-rates/
- https://datavlab.ai/post/data-annotation-pricing
- https://cohere.com/blog/towards-fair-and-comprehensive-multilingual-and-multicultural-llm-benchmarking
- https://arxiv.org/html/2406.14267v1
- https://medium.com/data-science-at-microsoft/building-and-evaluating-multilingual-rag-systems-943c290ab711
- https://benchlm.ai/multilingual
