The AI Procurement Gap: Why Your Vendor Evaluation Process Can't Handle Probabilistic Systems

April 16, 2026 · 11 min read

Software Engineer

A procurement team I worked with spent eleven weeks scoring four LLM vendors against a 312-row RFP spreadsheet. They negotiated 99.9% uptime, $0.0008 per 1K input tokens, SOC 2 Type II, and a glossy benchmark PDF that put their selected vendor 2.3 points ahead on MMLU. The contract was signed on a Friday. The following Tuesday, the vendor silently rolled a model update, and the customer-support agent the team had built started routing roughly 14% of refund requests to the wrong queue. The uptime SLA was honored. The benchmark scores were unchanged. The procurement process had functioned exactly as designed, and the system was still broken.

This is the AI procurement gap. The instruments enterprise procurement uses to manage software risk — feature checklists, uptime guarantees, security questionnaires, sample benchmarks — were built for systems whose outputs are reproducible. None of those instruments measure the thing that actually determines whether an AI vendor will keep working for you: the behavioral stability of a stochastic surface that the vendor controls and you do not.

The RFP isn't broken because procurement teams are lazy. It's broken because the probabilistic shape of the underlying product has not yet propagated into the vendor management discipline. Most enterprise procurement frameworks assume that "the software the vendor sold us last week is the software we're running this week." That assumption silently dissolves the moment you sign with a frontier model provider, and nothing in the standard contract language tells you that's what just happened.

Why feature checklists and SLAs lie to you

A feature checklist is a binary surface ("does the API support function calling?") laid over a continuous one ("how reliably does function calling produce well-formed arguments for your tool schemas?"). The first question can be answered yes/no in a sales call. The second has a distribution, not an answer, and that distribution shifts every time the vendor reweights a model, retrains a tokenizer, or rolls a safety filter update. Enterprise RFPs almost never ask the second question, because traditional vendor management has no template for "we'd like a probability distribution over outcomes, not a yes."

Uptime SLAs are worse. A 99.95% uptime guarantee means the vendor's API will respond. It says nothing about whether the response is correct, well-formed, or behaviorally consistent with the response you got last month for the same input. Practitioners have started referring to this as the "up but broken" state — the vendor's status page is green, your monitoring shows zero 5xx errors, and your product is silently regressing. The most dangerous gap in current AI procurement is exactly this: enterprises rely on uptime SLAs to manage a risk surface where uptime is the cheapest dimension to satisfy and the least correlated with user experience.

The same gap applies to security questionnaires. SOC 2, ISO 27001, and PCI controls were designed to attest to the security of code that doesn't change between attestation and audit. They do not, and structurally cannot, attest to behavioral properties of a model whose weights or sampling parameters can be silently updated without breaching any of those controls.

The bring-your-own-eval RFP

The replacement is straightforward in concept, hard in execution: replace feature checklists with task suites. Instead of asking "does your model support tool use?", ship the vendor a sealed evaluation harness drawn from your actual production traffic, redacted appropriately, and require them to run it under documented conditions. Score on the metrics that matter for your task — schema conformance rate, abstention behavior on out-of-corpus queries, latency at p95 and p99 under realistic concurrency, cost per successful task completion — and treat single-run scores as worthless. A probabilistic system requires distributional scoring: at minimum three independent runs, with mean and standard deviation reported, and ideally the full output distribution returned for your team to inspect.

Several practical design rules emerge once you adopt this stance:

Make the suite domain-specific. A vendor's MMLU score tells you almost nothing about whether their model can handle your insurance-claims classifier or your contract-extraction pipeline. The eval suite must be built from your task distribution, not borrowed from public leaderboards.
Include adversarial and edge-case slices, not just happy-path examples. Vendors will route happy paths well. The gap between vendors widens dramatically on inputs at the long tail.
Require regression runs across vendor updates. The eval suite is not a one-time gate; it's a continuous integration test against the vendor's surface. Build it once, then run it weekly against the model version you're pinned to and the latest version the vendor is pushing.
Score on calibration, not just accuracy. A vendor whose model is wrong 8% of the time but knows when it's likely wrong is operationally far cheaper than one that's wrong 5% of the time and confident on the wrong outputs. Standard accuracy metrics flatten this distinction.

The structural change is harder than the technical one. Procurement teams that own the RFP do not typically have the engineering capacity to build, maintain, and interpret an eval harness. The teams that can build it (the engineers who will use the model) are not usually inside the procurement loop. Closing this gap is itself an organizational design decision: either procurement absorbs eval engineering capability, or engineering is given veto authority over vendor selection. The middle path — procurement runs the RFP, engineering runs a parallel pilot — produces the slowest decisions and the worst outcomes.

Contract clauses that matter for AI vendors

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The AI Procurement Gap: Why Your Vendor Evaluation Process Can't Handle Probabilistic Systems

Why feature checklists and SLAs lie to you

The bring-your-own-eval RFP

Contract clauses that matter for AI vendors

Recommended Reading

About Tian Pan

Why feature checklists and SLAs lie to you​

The bring-your-own-eval RFP​

Contract clauses that matter for AI vendors​

Recommended Reading

About Tian Pan

Why feature checklists and SLAs lie to you

The bring-your-own-eval RFP

Contract clauses that matter for AI vendors