Skip to main content

The Model Card Benchmark Whose Methodology Shifted While Your Contract Cited the Number

· 11 min read
Tian Pan
Software Engineer

Your procurement team renewed the inference contract last quarter and noted, with quiet satisfaction, that the quality clause referencing "HumanEval pass@1 of 84%" had been comfortably exceeded by the provider's latest model card, which now reports 87%. Three points to the good. The clause is satisfied. The relationship is healthy. Meanwhile, your inference team's own regression suite — the one that actually exercises the tasks your product depends on — shows a 2% decline on held-out evaluation cases since the model update shipped. Both numbers are real. Only one of them is in the contract.

This is what it looks like when a marketing artifact is load-bearing in a legal document. The benchmark number on the model card is the headline of a measurement; the methodology that produced it is a footnote in an appendix nobody on the contract review chain reads. When the provider changes the methodology — switches from greedy decode to best-of-three sampling, adds a structured-output system message, swaps the prompt template to match the model's new chat tuning — the number moves in a way that has nothing to do with your traffic and everything to do with how the number is computed. Your contract clause cites the number. The counterparty controls the protocol that produces it. You've signed a clause whose meaning the other side can revise without violating it.

The benchmark is a measurement, not a property

The instinct most procurement language reflects is that a benchmark score is a property of the model — like its parameter count, or the size of its context window. Cite the number, pin the quality, move on. But a benchmark score is the result of a measurement, and a measurement has a protocol. HumanEval pass@1 of 84% is shorthand for "this model, evaluated on these 164 problems, prompted in this way, decoded with this strategy, scored with this criterion, produced correct code on 84% of attempts." Every clause in that sentence after "this model" is part of the number. Change any of them and the same model weights produce a different score.

The variance is not academic. Single-run pass@1 estimates of the same model on the same benchmark vary by 2 to 6 percentage points across runs, with standard deviations exceeding 1.5 points even at temperature zero, before any methodology change enters the picture. Switch from greedy decode to best-of-three sampling and you can move pass@1 by 5 to 10 points without touching the weights. Add few-shot examples to the prompt and you move it further. Use a chat-tuned template that matches the model's preferred format and you move it further still. HumanEval+, which uses the same problems but stricter test cases, drops GPT-4 from 88.4% to 76.2% — a twelve-point gap from a single methodology choice about what counts as a passing solution.

The provider knows this. The model card knows this. The footnote on page seventeen of the technical report knows this. The procurement contract, in most cases, does not.

Methodology drift is not a bug — it's how the number is allowed to move

The uncomfortable observation is that providers have a structural incentive to evolve evaluation methodology in the direction the new model's training optimized for. A model fine-tuned for structured output evaluated under a prompt that solicits structured output will score higher than the same model evaluated under the original unstructured prompt. A model trained on chat templates evaluated with the chat template applied will outscore the same model evaluated without it. None of this is dishonest. The model card discloses the change, usually in a footnote, sometimes in an appendix, occasionally in a separate methodology document linked from the main page. The disclosure satisfies the obligation. The headline number moves. The two facts are independently true.

What this means in practice is that a benchmark number on a model card is best understood as a measurement under the protocol the provider chose to use when measuring the model they wanted to ship. The protocol is part of the product. When the protocol changes, the number changes, and the change is intentional. The model card is doing exactly what it was designed to do: present the model favorably under a methodology that has been retrofitted to the model's strengths. This is not a flaw of the model card system. It is the model card system working as intended, in the service of a release-marketing function that lives one floor above where your contract lives.

The failure mode is the assumption that the methodology is stable. It is not stable. The number can rise three points across a release because the model got better. It can rise three points across a release because the evaluation harness got tuned. It can rise for both reasons simultaneously, with the contribution of each component undisclosed. Your contract clause cannot tell the difference and your procurement team cannot tell the difference and your model card change tracker — if you have one, which most teams do not — has to read the methodology footnote to tell the difference, and the methodology footnote is often a sentence buried in a PDF appendix.

The contract clause that actually means what you think it means

The fix to a procurement clause that references a benchmark number is to reference the benchmark together with its version and its evaluation protocol. "HumanEval pass@1 of 84%" is a sentence that can be revised under the contract by the counterparty changing the protocol. "HumanEval pass@1 of 84% under the v1.0 evaluation protocol with greedy decode, zero-shot prompting, and the original test cases" is a sentence that cannot. The methodology pin is the load-bearing clause. The number is the consequence.

The harder version of this fix recognizes that the published benchmark is not what you actually care about. What you actually care about is whether the model performs well on your tasks. The contract clause that names the published benchmark is a proxy for the contract clause that names your held-out evaluation set, and the proxy is leaky in proportion to how much your traffic resembles the benchmark's distribution. For most teams, the answer is "not much" — your traffic is in a domain the benchmark does not cover, in a format the benchmark does not test, with success criteria the benchmark does not score against. The published number is informational. The contractual quality referent should be your own task suite, run under a protocol you control, against a regression baseline you maintain.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates