2 posts tagged with "model-cards"

The Model Card Benchmark Whose Methodology Shifted While Your Contract Cited the Number

June 3, 2026 · 11 min read

Software Engineer

Your procurement team renewed the inference contract last quarter and noted, with quiet satisfaction, that the quality clause referencing "HumanEval pass@1 of 84%" had been comfortably exceeded by the provider's latest model card, which now reports 87%. Three points to the good. The clause is satisfied. The relationship is healthy. Meanwhile, your inference team's own regression suite — the one that actually exercises the tasks your product depends on — shows a 2% decline on held-out evaluation cases since the model update shipped. Both numbers are real. Only one of them is in the contract.

This is what it looks like when a marketing artifact is load-bearing in a legal document. The benchmark number on the model card is the headline of a measurement; the methodology that produced it is a footnote in an appendix nobody on the contract review chain reads. When the provider changes the methodology — switches from greedy decode to best-of-three sampling, adds a structured-output system message, swaps the prompt template to match the model's new chat tuning — the number moves in a way that has nothing to do with your traffic and everything to do with how the number is computed. Your contract clause cites the number. The counterparty controls the protocol that produces it. You've signed a clause whose meaning the other side can revise without violating it.

The Model Card Your Procurement Team Treated Like a Datasheet

June 2, 2026 · 11 min read

Tian Pan

Software Engineer

A model card is a research artifact. A datasheet is a contract. Procurement teams routinely read the first as if it were the second, and the AI vendor that handed it over is now bound to claims its engineering team thought were narrative.

This is the cleanest way to lose a renewal: you forwarded the same PDF you publish on your model index page, the customer's legal team excerpted four sentences into Schedule B, and twelve months later you discover that "intended use: general question answering" has become a contractual representation about scope of service. Your team measured those sentences in BLEU points. Their team is now measuring them in breach.

About Tian Pan