The Model Registry Your Platform Team Built That Nobody Updated
A platform team I know spent two quarters building a model registry. It had everything the org chart asked for: a promotion workflow from dev to staging to prod, a CODEOWNERS-style approval matrix, lineage tracking, eval-score gates, a deprecation policy with a 30-day window, and a Backstage tile that showed which version of every model was live in which service. They cut a launch announcement, ran a brown bag, and added a row to the compliance binder.
Six months later, the highest-traffic agent in the company was running on a model card whose "owner" field still pointed at someone who had left, whose eval score was from a benchmark the team had since deprecated, and whose "approved by" name was the platform tech lead — who had never used that agent, never read its eval set, and had pressed approve at 11:43pm on a Thursday because the producer had pinged him in DMs saying the launch was tomorrow.
The registry was not broken. The promotion gates fired. The audit log was intact. Everything the launch announcement had promised was true. And the org had less real oversight of its production models than it had had eighteen months earlier, when the same decisions were made by an ML engineer reading the eval output by hand before pasting the model URI into a config file.
This post is about why that happened. It is not a story about bad tools or lazy reviewers. It is a story about what happens when you build a governance interface on top of an incentive structure that has not been updated to match — and why model registries, in particular, have become the canonical example of the failure mode.
The Symmetry That Was Never There
Every model registry I have seen ships with a diagram. The diagram shows two roles. On the left, the producer: a model owner who trains, evaluates, and submits a candidate for promotion. On the right, the reviewer: a steward who reads the model card, inspects the eval delta, and approves or rejects.
The diagram implies a kind of symmetry. Producer does work; reviewer checks work; the registry mediates. The promotion is gated, so quality is enforced. That is the story the registry tells about itself. It is also the story the registry has to tell, because if the symmetry were not real, the entire premise would collapse.
The symmetry is not real, and the reason it is not real is structural rather than personal. Producers have ship dates. Their performance review depends on whether the model went live this quarter. Their tickets are visible. Their managers can count their promotions.
Reviewers have none of these. The model review is not on their roadmap. It is a cost center inserted into their week by someone else's deadline. Their performance review will not be affected by approving a model; it will only be affected by blocking one badly. Their downside risk is real, their upside is approximately zero, and their queue depth is invisible to anyone who could change the staffing.
So one side has a clock. The other side has, at best, a vague obligation. There is no SLA on reviewer turnaround, because nobody could agree on what an SLA for "did you actually read the eval set" would look like. There is, however, a very clear cultural SLA of "do not be the person who blocks a launch by being slow." The result is exactly what you would predict: approvals happen on the producer's schedule, not the reviewer's, and the depth of review compresses to fit the time the producer was willing to wait.
How "Approval" Decays Into Rubber Stamp
This is not the failure mode of any individual reviewer. It is the failure mode of an oversight role that has been asked to scale without being given the resources to scale.
The literature on rubber-stamping in human-in-the-loop systems is by now depressingly clear: when reviewers are overwhelmed by volume, lack context for what they are seeing, or are under cultural pressure to approve, oversight becomes mechanical. They click through. They develop heuristics that pattern-match on irrelevant features — does the producer's name look familiar, does the eval delta look "directionally positive," is the description well-written — and they apply those heuristics in milliseconds, because their queue has thirty more items in it.
In a model registry, this looks specifically like this:
- The "evidence" the reviewer needs to read is a dashboard with twelve charts and a markdown blob written by the producer, who has every incentive to present the model favorably.
- The reviewer has no independent way to run the eval set themselves; they trust the score the producer's CI reported.
- The model card template is filled in once and almost never updated, so by review three the reviewer is skimming a document they have already read.
- The reviewer's name is recorded in the audit log against the approval, but the audit log is never used in performance review — it is only used after an incident, by which point the reviewer is being asked to defend an action they took six months ago.
A 2025 NACD survey found that only 36 percent of corporate boards had implemented a formal AI governance framework, and only 6 percent had AI-related management reporting metrics. That gap shows up at the engineering layer too. The org has a registry, but it does not have a working measurement of whether the registry's gates are doing anything. The dashboard says "100 percent of production models are registry-approved." It does not say what that approval is worth.
The Tooling Did Exactly What It Was Asked To
It is tempting to look at this and say the platform team built the wrong thing. They did not. They built the thing the executive sponsor asked for. The promotion workflow exists, the lineage is captured, the audit log is queryable. If a regulator showed up and asked "do you have a model registry with documented approvals," the answer is yes, and the registry would survive a paper audit cleanly.
The platform team's mistake, if it is one, is the same mistake every internal platform team makes when adoption is mandated rather than earned. They optimized for the artifacts the governance review would inspect, which are the artifacts that prove approvals happened. They did not optimize for the artifacts that would have made the approvals meaningful, which would have required taking a position on what reviewers were actually responsible for, and giving them tools to discharge that responsibility at the scale they were being asked to operate at.
Reviewer scale is the part that nobody costed. A platform team building a registry will tell you what the producer's experience looks like in great detail — they will demo the CLI, the eval CI integration, the Backstage tile. Ask them what the reviewer's daily workflow looks like, what tools the reviewer has to compare two model versions side by side, what happens when a reviewer has fifteen items in their queue and a one-hour budget that week, and you will usually get a shrug. The answer is: the reviewer uses the registry's web UI, which is the same UI as everyone else, and figures it out.
So the producers got a paved road. The reviewers got a ticket queue. The architecture predicts the outcome.
What Incentive-Design People Already Know
The most useful framing I have seen on this comes from platform engineering's own literature on golden paths. The community's strongest claim is that mandated adoption does not work — that 80 percent voluntary adoption is the target, and that the moment a path is mandated, you create resentment and shadow IT, and you lose the feedback loop that would otherwise improve the path.
Model registries are usually mandated. Producers have to use the registry because regulation, or the legal team, or the CTO's memo says so. Reviewers have to staff the queue because the same memo says someone has to. Neither side chose the workflow. Neither side gets to vote on whether it is working. Adoption metrics, when the platform team reports them, count compliance — what percentage of production models passed through the workflow — and not the thing anyone actually cares about, which is what percentage of bad models were caught.
If you wanted to design a registry whose approvals were worth something, the platform engineering playbook is unsubtle about what you would have to do:
- Make reviewer effort visible. Track time-in-queue, time-to-decision, and number of items reviewed per reviewer per week. Publish these. Put them on a dashboard the reviewer's manager actually looks at. Otherwise, reviewer time is unaccounted, and unaccounted time gets crowded out.
- Give reviewers a meaningfully different artifact than producers. A model card written by the producer is not evidence; it is advocacy. The registry should run an independent eval suite on submission, against a held-out set the producer cannot see, and present the reviewer with that delta plus the lineage of which datasets the producer trained on. The reviewer's job is then to read independent evidence, not graded homework.
- Match SLA to incentive. If approvals must happen within 24 hours to keep producers shipping, reviewers must be staffed for 24-hour turnaround. If they are not staffed for it, the 24-hour SLA will be hit by approving without reading, every single time. Either the SLA is wrong or the staffing is wrong; one of them must give.
- Track the post-hoc accuracy of approvals. Every model that goes to production and has an incident should produce a fact: "this was approved on this date by this reviewer, who saw this evidence." Aggregate. Publish. Use it as the reviewer's actual scorecard. Without this, reviewers are graded on speed only, and you have just told them to optimize for speed.
None of these are tools. They are policies. The registry can support them — most modern registries already capture the data — but the policies are what convert "approval" from a signature into a judgment.
What Producers See, and Why That Matters
There is a producer-side asymmetry too, and it is the one that actually causes models to ship without scrutiny.
Producers do not see the reviewer's queue. They see a button that says "request promotion." They see a status that says "pending." They see a time-since-submitted clock that, if they are in a hurry, they can shorten by walking over to the reviewer's desk or DMing them on Slack. Every shortcut is rational from the producer's local incentive structure. The launch is tomorrow. The reviewer is online. The model is fine. The DM works.
The platform team often does not realize how much of the org's actual approval bandwidth has moved into DMs, because DMs do not show up on the registry dashboard. The registry sees the approval timestamp. It does not see the social pressure that produced it. The audit log captures the act and loses the context, which means the deepest signal about how well the system is actually working — the signal contained in "Mike approved this two minutes after Aisha sent him a Slack message saying the launch is tomorrow" — is the exact signal nobody is collecting.
This is not the producer's fault. The producer is doing what producers do everywhere: removing friction from their critical path. It is a leadership problem. Someone has to decide that the registry's authority is real, which means deciding that some launches will slip when the reviewer is not available, which means giving up something the org wants in exchange for an oversight property the org claims to want more.
That trade is unpopular every single time it is offered. Which is why, six months after the registry launches, the trade is no longer being offered.
What to Do About It
If you have a registry and you are not sure whether your approvals are real, three diagnostic questions will tell you within an afternoon.
First, ask your reviewers to count, from memory, how many models they have rejected in the last quarter. If the answer is zero, you do not have a review; you have a paperwork pipeline.
Second, look at your last ten production model approvals and check the timestamp delta between submission and decision. If the median is under an hour, your reviewers are not reading anything — there is not enough wall-clock for them to. If the median is over a week, your reviewers are buried, and the registry is the bottleneck producers are routing around in private.
Third, find an incident report from the last year where a model misbehaved in production. Trace it back through the registry. Ask the reviewer of record what they saw at approval time and what they would have needed to catch the issue. The answer will almost always be "I would have needed to see X, and the registry didn't show me X, and even if it had I wouldn't have had time to act on it." That answer is the spec for what the next version of your registry has to do.
The tooling is the easy part. Nobody who has shipped a model registry ran out of road on the tooling. They ran out of road on the question of what reviewers are responsible for, what they are being measured on, and what they are being given in exchange for that responsibility. That question is a leadership question, not an MLOps question, and treating it as the latter is how the registry your team built becomes the registry that nobody updates.
- https://introl.com/blog/model-registry-governance-mlops-production-ai-2025
- https://atlan.com/know/model-registry-implementation-guide/
- https://www.superblocks.com/blog/ai-model-governance
- https://mlflow.org/docs/latest/ml/model-registry/
- https://sloanreview.mit.edu/article/ai-explainability-how-to-avoid-rubber-stamping-recommendations/
- https://cybermaniacs.com/cm-blog/rubber-stamp-risk-why-human-oversight-can-become-false-confidence
- https://www.aviator.co/blog/why-some-companies-fail-to-adopt-internal-developer-portal/
- https://medium.com/devops-ai-decoded/the-platform-engineering-adoption-crisis-nobody-talks-about-9377d1ef58c6
- https://platformengineering.org/blog/what-are-golden-paths-a-guide-to-streamlining-developer-workflows
- https://www.metacto.com/blogs/code-review-bottleneck-ai-development
