Skip to main content

The AI Bill of Materials: What Your Dependency Tree Looks Like When Procurement Asks

· 11 min read
Tian Pan
Software Engineer

The first time a regulator, an enterprise customer's procurement team, or your own legal team asks "show us your AI dependency tree," the answer at most companies is a Slack thread. Someone in the platform channel pings the model team. The model team pings the prompt owners. The prompt owners cc the data lead. Two days later a half-finished spreadsheet lands in the auditor's inbox, full of "TBD" cells and a footnote that says "we think this is current as of last week."

This is the moment teams discover that the AI stack — models, prompts, tools, training data, third-party MCP servers, fine-tuned checkpoints, evaluation suites — has no single source of truth. Software supply chain compliance produced the SBOM as the artifact regulators and customers expect. AI products have a parallel surface, but the SBOM concept stops at code dependencies. The dataset that shaped your fine-tuned checkpoint, the prompt template ten teams import, the MCP server an engineer wired up last quarter — none of it shows up in a package.json.

The fix is the AI Bill of Materials, or AIBOM: a continuously updated, machine-readable inventory of every AI component your product depends on, generated from instrumentation rather than from someone's memory. This is not a documentation exercise. It is becoming a contractual deliverable on the timeline of "next renewal cycle," and a compliance artifact on the timeline of August 2026, when the EU AI Act's core obligations land for high-risk systems.

Why the SBOM Concept Doesn't Cover AI

SBOMs were designed for code: libraries, versions, licenses, vulnerabilities. The model is "what binaries went into this build, and which of them have known CVEs." That works because software behavior is determined by code.

AI systems aren't. A frontier model's behavior is determined by its training data, its post-training procedure, its tokenizer, the system prompt it ships with, the temperature you set, the tools you wired in, the retrieval corpus it queries, and the version pin you specified — if you specified one. None of this lives in your dependency manifest. An SBOM that captures only the Python packages your inference service imports is missing the actual sources of behavior.

The gap shows up in concrete ways. A model provider rolls a "minor update" and your refusal patterns change overnight. A fine-tuning dataset includes scraped content from a license bucket your contracts don't cover. An engineer adds a third-party MCP server that quietly gets credentials to your CRM. A prompt template gets edited without a version bump and ten downstream features start producing different outputs. None of these are caught by traditional supply chain tooling, because traditional supply chain tooling doesn't know that prompts, datasets, or model checkpoints exist.

This is why the standards bodies have moved. The OWASP AIBOM Initiative is building open implementations on top of the existing CycloneDX 1.6 and SPDX 3.0 schemas, both of which now have AI/ML-specific component types. The CycloneDX spec — recently published as Ecma International standard ECMA-424 — supports AI/ML-BOM as a first-class document type alongside SBOM and SaaSBOM. SPDX 3.0 added AI profiles to capture model metadata, training data references, and evaluation results. The format wars are mostly settled. What teams are missing is the generation pipeline.

The Four Surfaces You Have to Track

A useful AIBOM has to cover four surfaces, and most teams underestimate at least two of them.

Models: Every model invocation in production, with provider, model ID, version pin, and the feature that called it. This sounds easy until you realize how often "version pin" is actually "whatever the provider's latest alias resolves to today." A team I talked to recently discovered three different versions of the same Claude model in production simultaneously, because three different services had been deployed at different times and none had pinned versions. Their AIBOM lacked a row for "version drift across services." When they rolled the registry, two of those features had measurable behavior differences they hadn't been tracking.

Prompts: Every system prompt, every templated user prompt, every assistant pre-fill. These need IDs, version history, and an explicit owner. The reason they need an owner is that prompts have become critical business logic with no clear org placement — sometimes product owns them, sometimes engineering, sometimes neither. Without an owner, change management becomes "whoever last edited it." A real prompt registry stores these as configs rather than code, with the same git-style diffs and CI gates you'd put on a database migration. MLflow's prompt registry, LaunchDarkly's prompt management, and Vertex AI's prompt registry all converge on this same shape: prompt as versioned, environment-promoted artifact.

Tools: Every function-calling tool, every MCP server, every plugin reachable from an agent. Capability scope (read-only? write? what resources?), authentication path, deprecation status. This is where shadow AI lives. One enterprise inventory exercise turned up 150 agents on the official list and over 500 actually deployed. A separate audit of 22.4 million enterprise prompts identified 665 distinct generative AI tools across enterprise environments — most unauthorized. If your tool registry is "the array of objects in tools.py," you don't have a registry.

Datasets and checkpoints: Every training set, every fine-tuning dataset, every retrieval corpus, every evaluation set. Provenance, license, last-refresh timestamp, the checkpoint it produced. Research on the lineage of widely used fine-tuning datasets found license miscategorization rates above 50% and license information omission rates above 70%. If you fine-tune on a dataset whose license you've miscategorized, your model is shipping a problem you can't see — and you can't fix it without an AIBOM that ties checkpoint to dataset to license.

The mistake teams make is treating these as four separate spreadsheets. They aren't. A production AI feature is the cross product of all four: this prompt, on this model version, calling these tools, against this retrieval corpus. Change any one and behavior changes. An AIBOM has to record the binding, not just the parts.

Generation by Instrumentation, Not by Wiki

The first attempt at an AIBOM is almost always a wiki page. Someone fills it out manually, gets thanked, and within six weeks the page is wrong. Manual AIBOMs do not scale; this is now well documented enough that it's the first thing the OWASP guidance addresses. The only sustainable approach is to generate the AIBOM from the same instrumentation you'd use for observability.

Concretely: every LLM call from your inference layer emits structured telemetry that includes model ID + version + provider + prompt ID + prompt version + tool list + retrieval source. Your AIBOM is a query against that stream. If a row appears in production telemetry that doesn't have a corresponding registered prompt, your AIBOM generator flags it as undocumented. If a model version appears that nobody deployed, your AIBOM flags it as drift. The artifact stops being something a human writes and becomes something the system continuously emits.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates