Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It
A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.
This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.
The diagnostic question that exposes this is uncomfortable but cheap to ask: pull the current rubric and the original product brief into a single document and read them side by side. The teams that try this almost always find the rubric has drifted from the brief — usually toward what was easy to score, not what users care about. Two quarters of velocity has been measuring a slightly different product.
How the Translation Happens (and Why It's Invisible)
Rubric drift from the product brief isn't malicious or even careless. It's the predictable result of a translation problem nobody owns. The PM writes in directional language because directional language is what product briefs are made of: "feels helpful," "doesn't condescend," "respects the user's time." None of these compile. The engineer needs deterministic-ish criteria that an LLM judge can score consistently across runs, so they translate.
Translation requires choices the brief doesn't make. "Feels helpful" becomes "answers the literal question asked AND offers one relevant follow-up unless the user's intent was a quick lookup." Every word in that operationalization is a product call. Does a quick lookup get penalized for an unsolicited follow-up, or is the follow-up a positive? Is "answering the literal question" weighted equal to "useful follow-up," or higher? What if the model nails the answer but the follow-up is mildly off — is that a 1.0, a 0.7, or a 0.5?
The engineer makes the call because the engineer is the one whose code won't run otherwise. The call is reasonable. The call is undocumented. The call ships.
Multiply this by 47 criteria, then by every iteration of the rubric over six months, and the rubric has accumulated hundreds of product decisions. None of them appear in the product roadmap, the design doc, or the launch brief. The PM has been reviewing model outputs on a weekly basis and nodding because the outputs are good against the rubric — not because they're good against the original spec.
The Decisions Encoded in a Rubric Nobody Audits
Three categories of silent product decision tend to dominate. They're worth naming, because once named, they're easier to defend against.
Score-boundary decisions. When a rubric uses a 0/0.5/1 scale (or 1–5, or any ordinal), the boundaries between scores are product calls. Is "the assistant correctly identified the issue but solved a closely related, slightly wrong problem" a 0.5 or a 0? That answer determines whether your evals favor models that fail-loud (clear wrongness) over models that fail-soft (plausible wrongness). Most users prefer fail-loud — soft failures destroy trust faster — but most rubrics, when written by engineers optimizing for inter-rater agreement, score soft failures as partial credit because they're easier to grade.
Weighting decisions. Almost every analytic rubric assigns weights, and most of those weights are picked once and never revisited. A criterion weighted at 0.05 is approximately invisible — a model can fail it on every trace and the aggregate score barely moves. If "doesn't make the customer feel rushed" got translated into a single 0.05-weighted criterion while "task completion" got six criteria summing to 0.6, the rubric has unilaterally decided that customer experience is roughly one-twelfth as important as task completion. The PM almost certainly does not believe that.
Absent-criterion decisions. The criteria that aren't in the rubric are the most dangerous, because they're unobservable from inside the eval. If your rubric scores correctness, completeness, tone, and safety — but never scores whether the assistant gracefully handles a follow-up question that contradicts its prior answer — then "graceful self-correction under contradiction" is unevaluated, which means it's untested, which means it's whatever the model defaults to. Your product has a behavior in this category. You just don't know what it is.
Why Rubrics Drift Toward What's Easy to Score
There is a strong, structural pull on every rubric toward criteria that produce stable, defensible scores. Engineers who maintain the rubric want inter-rater agreement (between LLM judges, between LLM judges and humans) because low agreement makes the eval noisy and noisy evals don't gate anything. So the natural evolution of a rubric is: criteria that produce noisy scores get reworded toward something more concrete; criteria that can't be reworded get dropped; criteria that produce clean scores get retained and often duplicated.
The endpoint of this evolution is a rubric that scores beautifully — high agreement, low variance, clean dashboards — and measures something subtly off-axis from what users actually want. Recent work on rubric-based evaluation has documented exactly this drift, with research like the RULERS framework explicitly calling out "scale misalignment with human grading boundaries" as one of three recurring failure modes alongside rubric instability and unverifiable reasoning.
- https://www.productboard.com/blog/ai-evals-for-product-managers/
- https://www.news.aakashg.com/p/ai-evals
- https://medium.com/@anubhavgoyal0011/a-product-managers-guide-to-ai-evals-how-i-build-reliable-safe-and-high-quality-ai-features-cfd9ce2fecb6
- https://www.mindtheproduct.com/how-to-implement-effective-ai-evaluations/
- https://medium.com/@adnanmasood/rubric-based-evals-llm-as-a-judge-methodologies-and-empirical-validation-in-domain-context-71936b989e80
- https://galileo.ai/blog/agent-evaluation-framework-metrics-rubrics-benchmarks
- https://freeplay.ai/blog/defining-the-right-evaluation-criteria-for-your-llm-project-a-practical-guide
- https://arxiv.org/html/2603.00077v2
- https://arxiv.org/html/2601.08654
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
