Skip to main content

The AI Procurement Gap: Why Your Vendor Evaluation Process Can't Handle Probabilistic Systems

· 11 min read
Tian Pan
Software Engineer

A procurement team I worked with spent eleven weeks scoring four LLM vendors against a 312-row RFP spreadsheet. They negotiated 99.9% uptime, $0.0008 per 1K input tokens, SOC 2 Type II, and a glossy benchmark PDF that put their selected vendor 2.3 points ahead on MMLU. The contract was signed on a Friday. The following Tuesday, the vendor silently rolled a model update, and the customer-support agent the team had built started routing roughly 14% of refund requests to the wrong queue. The uptime SLA was honored. The benchmark scores were unchanged. The procurement process had functioned exactly as designed, and the system was still broken.

This is the AI procurement gap. The instruments enterprise procurement uses to manage software risk — feature checklists, uptime guarantees, security questionnaires, sample benchmarks — were built for systems whose outputs are reproducible. None of those instruments measure the thing that actually determines whether an AI vendor will keep working for you: the behavioral stability of a stochastic surface that the vendor controls and you do not.

The RFP isn't broken because procurement teams are lazy. It's broken because the probabilistic shape of the underlying product has not yet propagated into the vendor management discipline. Most enterprise procurement frameworks assume that "the software the vendor sold us last week is the software we're running this week." That assumption silently dissolves the moment you sign with a frontier model provider, and nothing in the standard contract language tells you that's what just happened.

Why feature checklists and SLAs lie to you

A feature checklist is a binary surface ("does the API support function calling?") laid over a continuous one ("how reliably does function calling produce well-formed arguments for your tool schemas?"). The first question can be answered yes/no in a sales call. The second has a distribution, not an answer, and that distribution shifts every time the vendor reweights a model, retrains a tokenizer, or rolls a safety filter update. Enterprise RFPs almost never ask the second question, because traditional vendor management has no template for "we'd like a probability distribution over outcomes, not a yes."

Uptime SLAs are worse. A 99.95% uptime guarantee means the vendor's API will respond. It says nothing about whether the response is correct, well-formed, or behaviorally consistent with the response you got last month for the same input. Practitioners have started referring to this as the "up but broken" state — the vendor's status page is green, your monitoring shows zero 5xx errors, and your product is silently regressing. The most dangerous gap in current AI procurement is exactly this: enterprises rely on uptime SLAs to manage a risk surface where uptime is the cheapest dimension to satisfy and the least correlated with user experience.

The same gap applies to security questionnaires. SOC 2, ISO 27001, and PCI controls were designed to attest to the security of code that doesn't change between attestation and audit. They do not, and structurally cannot, attest to behavioral properties of a model whose weights or sampling parameters can be silently updated without breaching any of those controls.

The bring-your-own-eval RFP

The replacement is straightforward in concept, hard in execution: replace feature checklists with task suites. Instead of asking "does your model support tool use?", ship the vendor a sealed evaluation harness drawn from your actual production traffic, redacted appropriately, and require them to run it under documented conditions. Score on the metrics that matter for your task — schema conformance rate, abstention behavior on out-of-corpus queries, latency at p95 and p99 under realistic concurrency, cost per successful task completion — and treat single-run scores as worthless. A probabilistic system requires distributional scoring: at minimum three independent runs, with mean and standard deviation reported, and ideally the full output distribution returned for your team to inspect.

Several practical design rules emerge once you adopt this stance:

  • Make the suite domain-specific. A vendor's MMLU score tells you almost nothing about whether their model can handle your insurance-claims classifier or your contract-extraction pipeline. The eval suite must be built from your task distribution, not borrowed from public leaderboards.
  • Include adversarial and edge-case slices, not just happy-path examples. Vendors will route happy paths well. The gap between vendors widens dramatically on inputs at the long tail.
  • Require regression runs across vendor updates. The eval suite is not a one-time gate; it's a continuous integration test against the vendor's surface. Build it once, then run it weekly against the model version you're pinned to and the latest version the vendor is pushing.
  • Score on calibration, not just accuracy. A vendor whose model is wrong 8% of the time but knows when it's likely wrong is operationally far cheaper than one that's wrong 5% of the time and confident on the wrong outputs. Standard accuracy metrics flatten this distinction.

The structural change is harder than the technical one. Procurement teams that own the RFP do not typically have the engineering capacity to build, maintain, and interpret an eval harness. The teams that can build it (the engineers who will use the model) are not usually inside the procurement loop. Closing this gap is itself an organizational design decision: either procurement absorbs eval engineering capability, or engineering is given veto authority over vendor selection. The middle path — procurement runs the RFP, engineering runs a parallel pilot — produces the slowest decisions and the worst outcomes.

Contract clauses that matter for AI vendors

Once you've changed how you evaluate, the contract has to catch up. The standard SaaS contract template has at least four gaps that bite specifically on AI vendors.

Model-change notification windows. Most provider contracts grant the vendor unilateral right to update the model behind your API endpoint. For production systems with behavioral dependencies, this is unacceptable. Negotiate a minimum 90-day deprecation notice for any pinned model version, and a defined notification channel that reaches your engineering team rather than your accounts-payable inbox. Practitioners increasingly require this clause before signing, and serious vendors will agree to it for enterprise customers. Smaller vendors will push back; that pushback is itself a procurement signal worth weighting heavily.

Capacity reservations. Public-facing rate limits are a shared resource. During provider-side capacity events — a competing tenant's traffic spike, a regional outage failover, or a model launch that pulls inference resources — your traffic gets the same priority as everyone else's. For workloads where degraded latency or hard rate-limit errors translate to lost revenue, you need a reserved-capacity tier with documented headroom. The contract should specify the floor: how many requests per second, with what queue priority, under what conditions the floor is honored, and what compensation applies if it isn't.

Data residency for traces, not just inputs. Most vendors will tell you where they process your input tokens. Far fewer will tell you where they store the inference traces, the prompt-completion logs used for abuse monitoring, and the eval data they collect when your usage triggers a content filter. For regulated industries — healthcare, financial services, EU jurisdictions under GDPR — this distinction matters. Your DPA needs language that covers the full lifecycle of every byte your system sends to the vendor, including data the vendor generates from observing your traffic.

Eval transparency and methodology disclosure. The most underused clause in current AI procurement is a requirement that the vendor disclose, under NDA if necessary, the eval methodology they use internally to validate model updates. A vendor who will share their internal regression suite, their accepted-degradation thresholds, and their pre-release human evaluation protocol is telling you something fundamental about their operational maturity. A vendor who refuses, or who tries to substitute a benchmark PDF, is also telling you something fundamental — just less flattering. This clause is harder to negotiate than the others, because it's culturally newer; it's also the one that distinguishes a vendor you can build a multi-year dependency on from one that will surprise you on a Tuesday.

What a vendor's eval methodology actually tells you

Procurement folklore says benchmarks are the comparable artifact across vendors. They are not, for two reasons. First, benchmarks are public; vendors train against them, intentionally or by data contamination, and the score gap between two frontier models on a public eval is mostly noise from the perspective of your specific workload. Second, the benchmark answers a question — "how well does this model do on this fixed task?" — that almost never matches your question, which is "how reliably does this model handle the specific distribution of inputs my product generates, and will that reliability hold across the next three model updates?"

The signal that actually distinguishes serious vendors is access to their own quality discipline. When you can see how a vendor thinks about regression — what they consider a regression, what threshold triggers a release block, how they validate against historical user traffic before pushing — you learn more in one document than from any third-party leaderboard. A vendor that runs sophisticated internal evals is one whose model behavior is bounded by something other than hope. A vendor that ships and watches the support queue is not.

This is also the most defensible signal against future model changes. The model the vendor sold you in Q1 will not be the model running in Q4. What persists is the methodology and the engineering culture that produced both. If the methodology is rigorous, the model in Q4 is likely to behave within the bounds your evals validated in Q1. If the methodology is "we'll see how it goes," no contract clause will save you.

The new procurement loop

What replaces the eleven-week RFP spreadsheet is closer to a continuous procurement function than a one-time event. The selection moment shrinks; the ongoing measurement expands. A working version looks roughly like this:

  • Pre-selection (weeks): Build the bring-your-own task suite from production traffic. Define the distributional acceptance criteria — not "must score 85% accuracy" but "p50 schema conformance must exceed 0.97 with standard deviation below 0.02 across three runs." Run two or three candidate vendors through the same harness under the same conditions and compare distributions, not point estimates.
  • Contracting (weeks): Negotiate the four AI-specific clauses above alongside standard SaaS terms. Reject vendors who will not commit to model-change notification windows; this is a structural signal about how seriously they take their enterprise customers.
  • Pinning and shadow-running (continuous): Pin the model version you accepted into. Run weekly eval regressions against both the pinned version and any newer version the vendor is pushing. The diff is your migration runway.
  • Ongoing renegotiation: Treat the relationship as a managed dependency, not a settled contract. Capacity needs grow, evals expand, regulatory surface shifts; the vendor relationship should be reopened on a defined cadence, not when the SOC 2 audit is due.

The teams that operate this way — and there are still very few of them — share a structural feature: engineering owns the eval harness, legal owns the contract surface, and procurement owns the relationship cadence. The handoff between those three functions is where most enterprises currently leak risk. Closing it requires a procurement organization that is willing to admit its existing playbook does not work for AI, and that the cost of admitting this is small compared with the cost of the silent regressions waiting in production.

The enterprise that gets this right does not buy a model. It buys a measurement contract attached to a probabilistic surface, with explicit terms for how that surface is allowed to move and explicit instrumentation for catching it when it does. That mental model is the procurement primitive AI actually needs, and it is not yet in any standard playbook. The procurement team that builds it first — internally, painfully, before regulators force the issue — is the one that turns a Friday signature into something that still works on Tuesday.

References:Let's stay in touch and Follow me for more thoughts and updates