Vendor Benchmarks Are Your Ceiling, Not Your Forecast
The model release announcement lands on Tuesday morning. The blog post leads with a chart: HumanEval up four points, SWE-bench Verified up six, MATH up three, the agent harness du jour up a number that would have been a research paper a year ago. By Tuesday afternoon there is a Slack thread inside your company with screenshots of the chart and a question shaped like a decision: "Should we cut over?" The thread treats the benchmark delta as a forecast — as if those numbers describe what the new model will do for your product, on your prompts, in your tool harness, against your eval rubric. They do not. The vendor's number is the upper bound on what you might see. Your realized lift is somewhere between zero and roughly half of that headline, and you cannot know which without running an eval the vendor did not run.
This is not a complaint about benchmark validity. The benchmarks are real. They are run against real eval suites. The vendor is not lying. The problem is that the vendor's harness is an idealized environment that strips away every variable a production deployment introduces, and a number generated under those conditions is structurally incapable of predicting behavior under yours. Treating it as a prediction is a category error — and it leads to procurement decisions, capacity-planning commitments, and rollout schedules that are calibrated against a fiction.
