The Compliance Audit That Asked Which Model Produced Which Output
The auditor's question sounds simple. She has your appeals log open, points at a row from eight months ago, and asks which model decided that case. Your engineer pulls up the schema: there is a model column, and every decision in the audit window says v1. Then someone from the platform team mentions, almost in passing, that the alias behind v1 rotated four times during the audit period — a base model upgrade, a fine-tune refresh, a vendor-side capacity move, and one rollback that lasted six hours during an incident. The honest answer is that you cannot say which checkpoint produced that decision. The auditor writes something down. That phrase is not a regulator-acceptable answer, and you have just learned that the system you shipped has been failing an audit requirement it was never designed to meet.
The gap here is not a missing log line. The gap is between two different ideas of what "model" means. To the engineers shipping the system, v1 is an endpoint — a stable contract callers can point at while the thing behind it gets upgraded for free. To the auditor, "the model that produced this decision" is a specific artifact: a weight checkpoint, a hash, a thing you could in principle re-run on the same input and get a defensibly similar output. Endpoint aliases were invented to hide checkpoint rotation from callers. Audit-grade provenance demands the opposite — that every decision be attributable to exactly the checkpoint that produced it. The two ideas were on a collision course from the start; the audit just happened to be where they met.
The Endpoint Is Not the Model
The convenience of an alias is real. You ship code that calls model: "v1" or model: "claude-sonnet-latest" or your internal risk-scoring-prod, and you do not have to deploy every time the model behind it changes. Provider-side, the same convenience is even more valuable: vendors rotate model versions, retire old snapshots, and redirect capacity without forcing every customer to cut a release. OpenAI's aliased endpoints behave this way, and Anthropic has been asked for similar -latest aliases for the same reason. The pattern is industry-standard; it would be unusual to find a production AI system that does not use it somewhere.
The problem is that "model" is a polysemous word, and the alias quietly chooses the wrong meaning for compliance purposes. When the data team builds a dashboard and stores model = "v1" next to each decision, they have stored the endpoint name, not the artifact. The endpoint name is approximately useless as an audit primitive, because the function the endpoint computed is not constant across the audit window. You did not run "v1" on the customer's case in February and the same "v1" on a near-identical case in May — you ran two different checkpoints reachable through the same string. Storing the endpoint name in the audit log is roughly equivalent to storing "the production database" instead of the specific row.
This is the silent-versioning problem in its most expensive form. The aliased endpoint that was supposed to free engineering from a coordination tax turned out to be a hidden coordination tax of a different shape — a tax the compliance team has to pay, in arrears, with the auditor watching.
What Audit-Grade Provenance Actually Requires
The regulatory frame is converging fast on a specific shape. The EU AI Act's Article 10 requires version-control records and provenance information that enable traceability between datasets and model versions; Article 12 requires automatic logging of events that allow full traceability of inputs, outputs, and decision points. The Federal Reserve's revised model risk management guidance — what used to live under SR 11-7 and was updated in April 2026 — keeps the same essential demand: a model is an artifact, not an alias, and the institution must be able to point at the specific artifact that produced any given decision. Adverse-action regimes under ECOA and FCRA make this concrete for consumer credit, where the principal-reasons obligation cannot be answered honestly if you do not know which model generated the score.
Translated out of regulatory language, the demand is: for every decision your AI system makes that has a downstream effect on a person — a credit denial, a claims rejection, a content takedown, a benefits determination — you should be able to produce, on demand, the exact checkpoint identifier and a record that lets a third party reason about that checkpoint's behavior. "We use model v1" does not clear that bar. "This decision was produced by checkpoint sha256:7b3f..., which is registered in our model registry with this card and this evaluation profile" does.
A useful test: imagine the auditor asks you to re-run the case through the same model that decided it. Can you? If the honest answer involves "we'd have to ask the vendor whether they still have that checkpoint hosted," your provenance is below audit grade. If the honest answer is "we cannot, because the production endpoint has since rotated," your provenance is below audit grade. The bar is reproducibility-with-respect-to-the-checkpoint, not reproducibility-with-respect-to-the-endpoint-name.
How the Drift Happens Without Anyone Noticing
The frustrating part is that no single change feels like a compliance violation. A platform engineer upgrades the underlying model behind v1 because the provider deprecated the older snapshot — they have to. A vendor rotates capacity across model versions to balance load — they always have. An MLOps team swaps in a freshly fine-tuned variant behind the same internal endpoint because they want to ship without coordinating with every caller — that is what the endpoint abstraction is for. A six-hour rollback during an incident is reverted before most of the company even sees the page.
Each of these is reasonable on its own. The stack of them is what produces the answer "v1 means four different checkpoints in the same audit window." And because each rotation is invisible at the call site — the request still says model: "v1" and gets back a response shaped like the previous responses — there is nothing in the application code, the request log, or the decision log that would flag the drift. The audit log records the endpoint. The endpoint records nothing.
A secondary failure mode is worth naming: even when teams know they should pin, they often pin in the wrong place. Pinning happens in a config file or environment variable that controls the next request, not in the audit log that records the last decision. If the config rotates between request time and audit time, the audit log inherits whatever the config says today, not what it said when the decision was made. Provenance has to be captured at decision time and stored with the decision, not derived later from a config that has moved on.
Pinning the Checkpoint at Decision Time
The architectural fix is unglamorous and worth doing anyway. At the moment a decision is produced, the system captures the checkpoint identifier the provider actually served, and persists that identifier next to the decision, in the same write, in the same transaction. Not the endpoint name. Not the alias. The specific checkpoint.
For self-hosted models this is straightforward: you control the inference server, you know which checkpoint is loaded, and you can attach a hash of the weights to every response. For hosted APIs it requires more discipline. Most providers return some form of version identifier on the response — OpenAI returns a system fingerprint and a specific model name, Anthropic returns the resolved model in the response, and the major inference gateways expose similar fields. The discipline is to read that field, not the field the caller asked for, and to log the one the provider returned.
A few practices follow from this directly:
- Treat the request
modelfield and the responsemodelfield as different columns. The request field records what you asked for. The response field records what you got. The audit query reads the response field. If your schema has only one column, your audit answers the wrong question. - Hash what you can, refer to what you cannot. For self-hosted weights, store the weight hash. For hosted APIs, store whatever stable identifier the provider exposes plus a snapshot of the model card or the provider's published behavior notes from that day. The goal is that a future auditor can, in principle, reason about the artifact behind the identifier — not that you have the weights yourself.
- Pin aliases at the gateway, not in application code. If your platform offers
v1as an internal alias, the resolution fromv1to a concrete checkpoint should happen in a controlled gateway that logs the resolution, not opportunistically in calling services. One resolution point gives you one place to audit. - Make rotation an event with a record. When the checkpoint behind an alias changes, that change should produce a durable record — who rotated it, when, from what to what, with what evaluation evidence. The audit story for any decision then has two layers: the per-decision checkpoint identifier, and the rotation history that lets you explain why that checkpoint was in use that day.
The decision-log write becomes slightly heavier and the schema gains a few columns. That is the entire cost. The cost of not doing it is the conversation the auditor opened with.
The Organizational Move Behind the Architectural One
The deeper change is treating the model registry as a system of record for compliance, not a tool for the ML team. A registry that records every checkpoint that has ever been promoted to production — with its hash, its evaluation profile, its training data lineage, its date range of production use, and the rotation events that moved traffic onto and off of it — is the artifact that lets the company answer the auditor's question without flinching. The registry is the index; the per-decision log is the pointer; together they form a queryable history of "which artifact decided what, and on what evidence was that artifact trusted in production at that moment."
This is also where the work belongs organizationally. Decision logging is an application concern, but checkpoint identity is a platform concern. The platform team owns the gateway that resolves aliases. The platform team owns the registry that records checkpoints. The application team owns the discipline of writing the resolved identifier into the decision log. The compliance team owns the requirement that those identifiers be queryable. None of those parties can do the job alone, and the audit is what reveals which seam was left open.
The seam most teams leave open is the simplest one: nobody decided that the model field in the decision log meant the checkpoint rather than the endpoint. Without that decision, the field defaulted to whatever was easiest, which was the endpoint, because the endpoint is what the calling code knew. The audit-grade fix is small. The audit-grade discipline is choosing to capture artifact identity at every layer where "model" is recorded, and refusing to let an alias stand in for an artifact when a regulator is going to ask. The auditor will eventually ask. You want the answer to already be in the schema.
- https://www.federalreserve.gov/supervisionreg/srletters/SR2602.pdf
- https://goteleport.com/blog/eu-ai-act-requirements/
- https://www.digitalocean.com/community/tutorials/model-silent-versioning-problem
- https://blogs.cisco.com/ai/model-provenance-constitution
- https://www.relyance.ai/ai-governance/eu-ai-act-compliance
- https://jumpcloud.com/it-index/what-is-an-immutable-decision-log
- https://atlan.com/know/training-data-lineage-for-llms/
- https://github.com/anthropics/anthropic-sdk-python/issues/1447
