Multilingual Eval Cost Amplification: Why Seven Locales Doesn't Cost 7×
The financial planning spreadsheet for the international launch had a clean line item: "extend eval coverage to seven new locales — assume 7× current eval cost." The English eval suite took two weeks and $40K to build, so seven locales would be $280K and a quarter of engineering time. The CFO signed it. The VP of Product signed it. The launch shipped.
Six months later the actual eval bill had crossed $310K and the team was still standing up the last two locales. The labeling vendor had churned through three replacements for the Portuguese-Brazilian pool because the first two kept producing inter-rater agreement scores an honest review would call random. The German judge model was scoring 6% lower than the English one on the same content — the team initially read this as a German model regression until a manual audit revealed the judge itself was the regression. And the eval lead was spending forty percent of their week on a question nobody had budgeted: how do we know when locale A's pass rate is actually worse than locale B's, versus when our cross-locale measurement is just noisier than the gap?
The arithmetic the spreadsheet used was N tokens per eval × L locales × per-token labeling cost. The arithmetic that actually applies is closer to N × L1.3, plus a meta-eval cost that doesn't appear on the curve at all because nobody itemized it. The amplification is real, it is structural, and the org that doesn't model it before committing to the localization roadmap discovers it the same way every other unmodeled cost gets discovered: through quarterly review pages with the CFO asking which line item was wrong.
Translation Doesn't Buy You What You Think It Buys You
The cheapest path to N locales of eval coverage is to translate the English eval suite. A vendor can MT the prompts and the rubric, a human reviewer can clean it up, and you have N suites at maybe 30% of the cost of authoring from scratch. Every team building international AI features tries this first. Most discover within the first month that the shortcut bought them numbers, not signal.
The core problem is that translated eval items are evaluating the wrong thing. A multiple-choice question translated from English into Portuguese-Brazilian primarily measures the model's translation fluency, not its ability to handle the kinds of queries Brazilian users actually send. The questions are calibrated to American educational frames, the cultural assumptions encoded in the distractors are American, and the failure modes that matter in Brazil — different formality registers, different name conventions, regional vocabulary that diverges from European Portuguese — are invisible to an item that was authored against an English distribution. Recent benchmarks like INCLUDE pointed this out by sourcing locally authored regional knowledge questions instead of translating: the gap between translated benchmarks and locale-native ones isn't a few percentage points of noise, it's a categorically different signal.
Worse, translated eval suites systematically underweight the failure modes you most need to catch. The English eval has a corpus of jailbreak prompts, edge-case formatting requests, and adversarial inputs that were collected from English-language production traffic over months. None of those translate. Brazilian users have their own jailbreak idioms, their own formatting expectations (date order, currency placement, address formats), their own polite-imperative constructions that the model needs to understand are still imperatives. Translating the English adversarial set gives you English adversarials in Portuguese clothing — a useful sanity check, but not an eval that catches what the locale will throw at you.
Teams that learn this the hard way go through three stages. Stage one: ship translated evals, get pass rates that look reasonable, ship the feature. Stage two: locale-native support starts surfacing complaints the eval suite never flagged. Stage three: tear out the translated evals and rebuild from locale-native authoring, at roughly the cost of the original English suite per locale — the cost the spreadsheet was supposed to avoid. The shortcut wasn't a shortcut; it was a deferred bill that came due during the first wave of negative reviews.
Locale-Pinned Judges and the Fleiss' Kappa Floor
LLM-as-judge is the cost-control reflex for English evals: instead of paying labelers $4-12 per item to grade rubric responses, you pay a few cents per item to have an LLM judge do it, calibrated against a small human-labeled gold set. The economics work in English because the judge model has seen enough English instruction-following to be reasonably calibrated, and the calibration set is cheap to build.
Multilingual breaks both halves of that bargain. Recent measurement shows multilingual LLM judges achieving an inter-rater Fleiss' Kappa of around 0.3 across a 25-language test, which is in the territory of "barely better than chance agreement." Performance degrades sharply on lower-resource languages, and even on high-resource languages the judge's failure modes diverge from the human labeler's — meaning your judge isn't just a noisier version of the human, it's biased differently in different directions per locale. A judge that systematically over-credits fluency in Portuguese and under-credits semantic accuracy in Japanese is not a single judge with noise; it's seven different judges in a trench coat, and the cross-locale comparison you wanted to do with it is structurally invalid.
The mitigation isn't to abandon LLM judging — the human-only path doesn't fit the budget — but to accept that judge calibration is now a per-locale workstream. Each locale needs its own gold set, its own per-rubric calibration measurement, its own freshness cadence as the model and the prompt evolve, and its own ensemble strategy if the single-judge variance is too high to land calibrated decisions. The cost that the spreadsheet treated as a fixed setup cost (build the judge once, run it everywhere) is now a per-locale variable cost that scales with L on top of the labeling cost itself.
The org that doesn't separate "the judge needs calibration" from "the judge is calibrated" discovers the gap through a regression that looks like a model improvement. The new model rolls out, the German judge says quality went up, the launch goes ahead, and three weeks later support tickets show quality went down — because the judge's bias shifted favorably toward the new model's stylistic tics on a dimension that doesn't matter to German users, while it stopped noticing a regression on a dimension that does.
The Meta-Eval Nobody Budgeted
Here is where the cost curve breaks N × L. You have seven eval suites. Each suite is locally authored, locally judged, locally labeled. Tenant A in Brazil hits 87% pass rate, Tenant B in Germany hits 81%, Tenant C in Japan hits 74%. The product manager asks the obvious question: is the Japanese experience worse, or is the Japanese eval harder?
This is the meta-eval question, and it is structurally a different problem from the per-locale evals. To answer it you need a calibration layer that compares the seven locale rubrics to each other, which means either a unified human panel that grades a sample across all seven locales (which requires hiring polyglot labelers who themselves are scarce and expensive — the "Japanese-and-Portuguese-and-German-fluent technical labeler" pool is roughly empty), or a synthetic anchor set translated across all locales that can give you a relative-difficulty signal even though, per the previous section, translated eval items are signal-degraded. Either path is expensive, the answer is approximate, and the org that doesn't fund it ends up with seven incomparable scoreboards and a leadership team that gravitates toward whichever locale's number happens to be highest.
The cost of the meta-eval scales worse than per-locale evals because the comparison space is L choose 2, not L. A two-locale launch has one comparison to defend; a seven-locale launch has twenty-one, and the team that ignores most of them is implicitly committing to a "we're not comparing locales" stance that the first cross-region exec review will refuse to accept. The honest version of the meta-eval is something like "we sampled the seven locales against a polyglot anchor panel quarterly, and the calibration drift between locales is X%." Building that is six person-weeks of standing setup plus a recurring labeling spend that nobody itemized.
The teams that get this right tend to land on a discipline that names the meta-eval as a separate budget category from the per-locale evals: a meta-eval-engineer role, a meta-eval data set with its own freshness cadence, and a leadership ritual that distinguishes "Locale A is doing worse than Locale B" claims from "Locale A's eval is harder than Locale B's eval" claims. The org that doesn't budget for that ritual ends up litigating it informally in product reviews, with the meta-eval engineer playing devil's advocate every time and burning out within two quarters.
Cohort Drift Is Per-Locale, Not Shared
The eval refresh cadence in monolingual production is roughly quarterly: every quarter the eval engineer audits the live distribution, re-checks whether the eval set still looks like production traffic, and refreshes the items where drift has accumulated. The cost of a refresh is maybe 10-20% of the original build cost, and it amortizes nicely across the year.
In multilingual production the refresh cadence does not multiply by L cleanly because cohort drift happens on different clocks per locale. The Japanese cohort might be stable for three quarters and then shift dramatically when a major competitor launches in market. The Brazilian cohort might churn every quarter as new feature rollouts hit the local user base. The German cohort might drift slowly but on a different axis (formality, technical-domain distribution) than the English cohort. Each locale needs its own drift detection, its own refresh cadence, and its own per-refresh cost — and the ones that drift fastest are exactly the ones with the smallest labeler pools, which means the marginal cost of a refresh is highest in the locales that need it most.
A locale that hits a major drift event and needs a full refresh outside the quarterly cycle is a build-cost spike. The team that didn't budget for that spike either ships against a stale eval — the path to undetected regressions — or pulls the budget from another locale's cycle, breaking the comparable-treatment commitment the international launch was supposed to honor.
The discipline here is per-locale eval freshness as a tracked metric, with an SLO on "weeks since last drift audit" per locale and a budget reserve for unscheduled refreshes. The org that treats refresh as a uniform cadence discovers within a year that the locale with the loudest user complaints is also the one whose eval set is twelve months stale.
The Staffing Model the Spreadsheet Forgot
The labeling cost is the line item most spreadsheets get right or close to right. The cost the spreadsheet almost always gets wrong is the labeler-management cost. Running a single English labeler pool of fifty people through one vendor is a part-time job for one program manager. Running seven locale-native labeler pools of fifty people through three to five vendors (because no single vendor has good coverage in all seven locales, especially in the Asian market) is a full-time job for one program manager and a half-time job for a labeler-quality engineer.
Per-locale costs the spreadsheet treats as zero include: vendor onboarding where the primary vendor lacks a labeler pool, time-zone-aware QA scheduling so the pool's questions can be answered without a 24-hour delay, vendor-quality monitoring that catches a labeler pool drifting in inter-rater agreement before it produces a quarter of unusable labels, and locale-specific privacy reviews — what the GDPR allows for German user-data labeling differs from what the LGPD allows for Brazilian data, which differs from what APPI allows for Japanese data, and the labeler contract has to encode that.
These costs scale roughly linearly with L until you hit the regulatory complexity boundary, after which they scale superlinearly. The seventh locale isn't 7/6 the cost of the sixth locale; it's 7/6 the cost plus whatever the seventh privacy regime adds, which depending on the locale can dominate the labeling cost itself. A team launching into a market with strict data-residency requirements (parts of the EU, parts of LATAM, China where the team has to use a locally-residing labeler pool entirely) finds the per-locale cost is dominated by compliance and operations, not labeling per se.
The staffing-model discipline that has to land before the localization roadmap commits is not "we'll add an eval engineer per locale" — that's overstaffing — but a tiered model: one eval-program manager who scales horizontally across locales, locale-specific eval-quality leads (often a half-FTE per locale, sourced internally from the support or i18n teams), a meta-eval engineer as a single shared role, and a vendor-management function that sits between engineering and procurement. The cost of that staffing model is real and persistent and is not in the launch spreadsheet because the launch spreadsheet treated eval as a one-time build cost rather than a steady-state operating cost.
The Architectural Realization
The realization that has to land in the AI feature org before the international launch commits is that eval cost in multilingual production is not a per-locale duplication problem. It is three different cost curves stacked on top of each other: per-locale labeling and judging that scales near-linearly with L, cross-locale meta-eval that scales with L2 in the comparison space, and operational complexity that scales linearly with L until it hits a regulatory boundary and jumps. The 7× line item in the spreadsheet captures roughly half of the first curve and none of the other two.
Teams that survive their international launches without an eval-debt blowup do three things differently. They budget meta-eval as a named line item from day one, with a meta-eval engineer assigned and a quarterly cadence defended at planning time. They model labeler-pool health as a per-locale operational metric with SLOs, not a binary "the labelers are working" assumption — the cost of a quietly-failing pool isn't the labeling spend, it's the regression that ships because the eval missed it. And they refuse to ship into a new locale until the per-locale judge has been calibrated against a locally-authored gold set, even under launch pressure — a launched feature with an uncalibrated locale judge is a feature whose quality bar is unmonitored, and unmonitored features in production drift in directions the team doesn't see until support inbox catches up.
The line in the international launch spreadsheet that says "7× eval cost" should read, in the version that actually predicts the bill, something closer to "L1.3 labeling cost plus a per-locale calibration spend plus a meta-eval program plus a regulatory reserve." The spreadsheet that doesn't say that is committing to a launch whose eval-debt growth rate is going to outrun the eval-team headcount within four quarters — and the team that hits that wall first will recognize the moment by the support tickets they can't explain and the cross-locale comparisons they can no longer defend.
- https://arxiv.org/abs/2411.19799
- https://arxiv.org/abs/2505.12201
- https://aclanthology.org/2025.findings-emnlp.587.pdf
- https://www.secondtalent.com/resources/data-annotation-costs-by-country-comparing-global-rates/
- https://datavlab.ai/post/data-annotation-pricing
- https://cohere.com/blog/towards-fair-and-comprehensive-multilingual-and-multicultural-llm-benchmarking
- https://arxiv.org/html/2406.14267v1
- https://medium.com/data-science-at-microsoft/building-and-evaluating-multilingual-rag-systems-943c290ab711
- https://benchlm.ai/multilingual
