Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It
A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.
This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.
The diagnostic question that exposes this is uncomfortable but cheap to ask: pull the current rubric and the original product brief into a single document and read them side by side. The teams that try this almost always find the rubric has drifted from the brief — usually toward what was easy to score, not what users care about. Two quarters of velocity has been measuring a slightly different product.
How the Translation Happens (and Why It's Invisible)
Rubric drift from the product brief isn't malicious or even careless. It's the predictable result of a translation problem nobody owns. The PM writes in directional language because directional language is what product briefs are made of: "feels helpful," "doesn't condescend," "respects the user's time." None of these compile. The engineer needs deterministic-ish criteria that an LLM judge can score consistently across runs, so they translate.
Translation requires choices the brief doesn't make. "Feels helpful" becomes "answers the literal question asked AND offers one relevant follow-up unless the user's intent was a quick lookup." Every word in that operationalization is a product call. Does a quick lookup get penalized for an unsolicited follow-up, or is the follow-up a positive? Is "answering the literal question" weighted equal to "useful follow-up," or higher? What if the model nails the answer but the follow-up is mildly off — is that a 1.0, a 0.7, or a 0.5?
The engineer makes the call because the engineer is the one whose code won't run otherwise. The call is reasonable. The call is undocumented. The call ships.
Multiply this by 47 criteria, then by every iteration of the rubric over six months, and the rubric has accumulated hundreds of product decisions. None of them appear in the product roadmap, the design doc, or the launch brief. The PM has been reviewing model outputs on a weekly basis and nodding because the outputs are good against the rubric — not because they're good against the original spec.
The Decisions Encoded in a Rubric Nobody Audits
Three categories of silent product decision tend to dominate. They're worth naming, because once named, they're easier to defend against.
Score-boundary decisions. When a rubric uses a 0/0.5/1 scale (or 1–5, or any ordinal), the boundaries between scores are product calls. Is "the assistant correctly identified the issue but solved a closely related, slightly wrong problem" a 0.5 or a 0? That answer determines whether your evals favor models that fail-loud (clear wrongness) over models that fail-soft (plausible wrongness). Most users prefer fail-loud — soft failures destroy trust faster — but most rubrics, when written by engineers optimizing for inter-rater agreement, score soft failures as partial credit because they're easier to grade.
Weighting decisions. Almost every analytic rubric assigns weights, and most of those weights are picked once and never revisited. A criterion weighted at 0.05 is approximately invisible — a model can fail it on every trace and the aggregate score barely moves. If "doesn't make the customer feel rushed" got translated into a single 0.05-weighted criterion while "task completion" got six criteria summing to 0.6, the rubric has unilaterally decided that customer experience is roughly one-twelfth as important as task completion. The PM almost certainly does not believe that.
Absent-criterion decisions. The criteria that aren't in the rubric are the most dangerous, because they're unobservable from inside the eval. If your rubric scores correctness, completeness, tone, and safety — but never scores whether the assistant gracefully handles a follow-up question that contradicts its prior answer — then "graceful self-correction under contradiction" is unevaluated, which means it's untested, which means it's whatever the model defaults to. Your product has a behavior in this category. You just don't know what it is.
Why Rubrics Drift Toward What's Easy to Score
There is a strong, structural pull on every rubric toward criteria that produce stable, defensible scores. Engineers who maintain the rubric want inter-rater agreement (between LLM judges, between LLM judges and humans) because low agreement makes the eval noisy and noisy evals don't gate anything. So the natural evolution of a rubric is: criteria that produce noisy scores get reworded toward something more concrete; criteria that can't be reworded get dropped; criteria that produce clean scores get retained and often duplicated.
The endpoint of this evolution is a rubric that scores beautifully — high agreement, low variance, clean dashboards — and measures something subtly off-axis from what users actually want. Recent work on rubric-based evaluation has documented exactly this drift, with research like the RULERS framework explicitly calling out "scale misalignment with human grading boundaries" as one of three recurring failure modes alongside rubric instability and unverifiable reasoning.
The same pull operates at the criterion-selection layer. "Was the response factually correct" is easy to score and produces clean numbers. "Did the response leave the user feeling like they were treated as a competent adult" is hard to score, produces noisy numbers, and is therefore likely to be either absent from your rubric or present as a single low-weighted catch-all that nobody trusts. Guess which one users actually notice.
This is not a problem of bad rubric design. It's a problem of optimization pressure: the rubric is being optimized for measurement properties (low variance, high agreement) rather than product properties (predicts user satisfaction, captures latent quality). Without a counterweight, measurement properties always win.
Treating the Rubric as a First-Class Product Artifact
The fix is not to write better rubrics. It's to relocate ownership and apply the same discipline to the rubric that already applies to feature specs. Three concrete moves work.
PM owns the rubric, in the same sense PM owns the spec. Not "PM signs off on quarterly updates" — PM owns the document, reviews every change, and is on the hook when the rubric and the product brief disagree. The implementation detail of who writes the YAML is irrelevant; the question is whose name is on the document and who has to defend its contents to the next product review. If the answer is "the engineer who set it up six months ago," your governance is broken regardless of how good your evals look.
Change-control the rubric like a feature spec. Every criterion added, every weight changed, every score-boundary redefined gets a dated entry in a decision log with one line on rationale and one line on what user-facing behavior is expected to shift. This sounds heavyweight; it isn't. The criterion-level decisions are already happening — they're just happening in a Slack thread that gets lost. Putting them in a versioned doc costs nothing and makes the rubric inspectable.
Run a periodic rubric-vs-brief review. Once a quarter (or whatever cadence matches your release cycle), pull the current rubric next to the original product brief and the most recent revision of the brief. Read them in the same room. Note every criterion in the rubric that doesn't trace to something in the brief, and every claim in the brief that no criterion in the rubric covers. The first list is "scope creep into the rubric"; the second list is "behavior we say we care about but don't measure." Both lists almost always exist; both are almost always uncomfortable.
The Cross-Functional Review That Catches Engineering-Convenience Criteria
A specific failure mode worth instrumenting against: criteria that exist because they're easy to score, not because product asked for them. These tend to cluster around model-output features that are syntactically detectable: response length, presence of bullet points, structured-output schema compliance, citation count, refusal-rate tracking. Engineering wants these criteria because they're stable. Product may or may not care about them — and may actively disagree with the direction the rubric implies.
A useful exercise: walk every criterion in the current rubric with the PM and ask, "If we removed this criterion entirely, would the user notice the difference within a month?" Criteria where the answer is "no, but it stabilizes our scoring" are engineering-convenience criteria. They aren't necessarily wrong to keep — stable scoring has real value — but they should be tagged as such and weighted accordingly. They should not be silently absorbing weight that belongs to product-facing behaviors.
The same exercise in reverse: walk the product brief with the engineer and ask, "If a model started failing this part of the spec, would any criterion in our rubric notice?" Every "no" is a hole in the eval. Holes can be filled with imperfect criteria; the cost of an imperfect criterion is far lower than the cost of an unmeasured behavior.
The Eval Reflects the Model's Strengths, Not the User's Needs
The most insidious form of rubric drift happens when graders — human or LLM — converge on what is easy to grade rather than what matters. In LLM-as-judge pipelines, this convergence is mechanical: rubric authors iterate on criteria until inter-judge agreement crosses a threshold, and the criteria that pass the threshold are the ones with clear surface signals. Surface-signal criteria favor surface-quality models.
The compounding effect is that you start picking models, prompts, and even product features by what scores well on these criteria — which means you're picking for the same surface qualities your rubric over-weights. Over time, the product becomes whatever the rubric rewards. If the rubric rewards thorough-sounding multi-paragraph responses, your product becomes verbose. If the rubric rewards confident assertions, your product becomes confident even when it shouldn't be. Users feel the change before any dashboard does.
The way out is to keep at least one human-annotated, expensive, low-volume eval channel that does not go through the rubric — a small set of traces graded by people who know the product intent and grade against the brief, not the YAML. When the cheap rubric scores and the human channel diverge, the rubric has drifted and needs to be reanchored. This is the equivalent of the canary deployment for evals: small, expensive, slow, and the only thing that catches the failure mode the cheap path is structurally blind to.
What to Do Monday Morning
If you're running an AI product and have never done the rubric-vs-brief review, do it this week. The exercise takes two hours and the findings are usually severe enough to warrant rewriting the rubric. While you're there, make three structural changes that prevent recurrence: put the rubric in version control with PM as a required reviewer on changes; require a one-line decision-log entry for every criterion modification; schedule a recurring quarterly rubric audit on the calendar with both the PM and the eng lead present.
The rubric will still be a translation. There is no way to evaluate an LLM product without translating directional intent into scorable criteria, and the translation will always introduce decisions the brief didn't make. The goal is not to eliminate the translation — it's to make the translation visible, auditable, and owned. The product manager who writes a paragraph and never sees the YAML is not delegating; they're abdicating. And in an AI product, the YAML is the product.
- https://www.productboard.com/blog/ai-evals-for-product-managers/
- https://www.news.aakashg.com/p/ai-evals
- https://medium.com/@anubhavgoyal0011/a-product-managers-guide-to-ai-evals-how-i-build-reliable-safe-and-high-quality-ai-features-cfd9ce2fecb6
- https://www.mindtheproduct.com/how-to-implement-effective-ai-evaluations/
- https://medium.com/@adnanmasood/rubric-based-evals-llm-as-a-judge-methodologies-and-empirical-validation-in-domain-context-71936b989e80
- https://galileo.ai/blog/agent-evaluation-framework-metrics-rubrics-benchmarks
- https://freeplay.ai/blog/defining-the-right-evaluation-criteria-for-your-llm-project-a-practical-guide
- https://arxiv.org/html/2603.00077v2
- https://arxiv.org/html/2601.08654
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
