Skip to main content

The PM-Eval Translation Gap: When Ship Decisions Outrun the Vocabulary

· 8 min read
Tian Pan
Software Engineer

The go/no-go meeting for an AI feature is, on the surface, a data-driven ritual. Engineering brings a slate of eval numbers — judge score deltas, slice accuracies, regression-against-baseline percentages — and the room decides. It looks rigorous. It usually isn't.

Here is the failure mode in one sentence: the person with the literacy to weight the eval slices does not have the authority to make the call, and the person with the authority cannot read the slices. The product manager owns the launch. The engineer owns the meaning of the numbers. Between them sits a translation gap, and into that gap rushes whoever speaks most confidently in the meeting.

The tell is that "ship at 87%" and "hold at 87%" are both defensible from the same scorecard, depending on which slice you weight. When a single dataset supports opposite conclusions and the deciding factor is rhetorical confidence rather than evidence, you do not have a data-driven process. You have a debate with a spreadsheet in the background.

This is not a competence problem on either side. The PM is not bad at their job, and the engineer is not hiding the ball. The problem is structural: the eval surface evolved faster than the shared vocabulary, so the artifact the team paid real money to produce — a careful, sliced, baseline-anchored eval set — arrives at the decision unreadable by the decision-maker.

How a PM Actually Reads a Scorecard

A non-engineer reading a code diff infers intent from the surrounding conversation, not from the diff itself. They watch who seems confident, who hedges, what the Slack thread said this morning. The diff is a prop.

A PM reads an eval scorecard the same way. Handed a table of numbers — intent_accuracy: 0.91, macro_F1: 0.78, adversarial_pass: 0.83, regression_vs_baseline: -1.2pp — the PM cannot tell which of those is load-bearing for this launch. So they fall back on the conversation. Is the eng lead nodding? Did the number go up or down since last week? Is anyone visibly worried?

This produces decisions that look metric-driven and are actually vibe-driven. The PM is not weighting slices. They are weighting people.

And the gap is asymmetric in a way that makes it worse. The engineer who built the eval knows exactly which slice matters — they know macro_F1 dropped because the rare "billing dispute" intent regressed, and they know billing disputes are 0.4% of traffic but 30% of escalations. That context lives in their head. It does not live on the scorecard. So unless the engineer happens to volunteer it, and happens to win the room, the decision is made without it.

Worse, the engineer often has the literacy but not the authority. They can say "I'd hold," but the PM owns the launch date, the roadmap commitment, and the sales call that's already on the calendar. If the PM's mental model of "AI quality" is two quarters out of date — because the eval surface added six new slices since the last time anyone walked them through it — the authority and the literacy have fully decoupled.

The Eval Set Is a Cross-Functional Artifact

Teams build eval sets as an engineering tool and are then surprised by who reads them. The eval set is not just a regression gate. It is a communication artifact, and its readership is larger than the team that wrote it: the PM reads it to make a launch call, the eng manager reads it for a staffing argument, the support lead reads it to predict ticket volume, sometimes legal reads it for a risk sign-off.

If you accept that, three things follow.

First, the scorecard needs a fixed template the PM sees on every release. Not a fresh ad-hoc table each time. When the format changes every launch, every checkpoint becomes a new debate about which evidence counts — and re-litigating the format is how teams end up re-justifying work that was already proven. A stable template means the PM learns to read it once and reuses that literacy forever.

Second, the thresholds have to be agreed before the eval runs. "We need 90% on intent accuracy to ship" is a product decision, not a technical one — it encodes how much imperfection users tolerate and how costly a failure is in your domain. If that number is set after the results are in, it is not a threshold. It is a justification, reverse-engineered to match whatever outcome the loudest person wants.

Third, every slice needs a name in product terms. "macro-F1 on the intent-classification head" means nothing to a PM. "How often the assistant correctly identifies what the user is asking for" means something. Same number, translated.

The Discipline That Closes the Gap

Five concrete practices, in rough order of leverage.

A fixed-template eval scorecard with pre-defined thresholds. One template, used every release. Each metric has a ship / hold / escalate band agreed by eng and product before the run. The scorecard's job is to render a recommendation — green, yellow, red — not just numbers. Green: every dimension meets threshold, no regression against baseline, adversarial pass above the safety floor. Yellow: core scenarios pass, some edge degradation, monitoring in place. Red: any safety failure, or core pass rate below threshold, or a significant regression with no offsetting gain. The PM does not have to compute the verdict. They have to understand it and override it consciously if they choose to.

A product-side eval review of real cases. Once per release, the PM literally walks through five failing cases and five passing cases — actual inputs, actual model outputs. This is the single highest-leverage practice on the list. A number is abstract; a transcript of the assistant confidently giving a customer the wrong refund policy is not. After three releases of this, the PM stops asking "is 87% good?" and starts asking "are the 13% that fail the ones we can afford to fail?" — which is the right question.

A translation glossary. A living one-pager that maps each eval slice to a product-language sentence. intent_accuracy → "how often we understand what the user wants." adversarial_pass → "how often a user trying to break the assistant fails to." groundedness → "how often the answer is actually supported by the docs we retrieved." Boring to maintain. It is the difference between a PM who can read the scorecard and one who can only read the room.

A quarterly calibration. Every quarter, PM and engineering re-anchor on which slices map to which user-visible outcomes. The eval surface drifts — new slices get added after incidents, old ones saturate and stop discriminating. Without a standing re-anchor, the PM's mental model silently expires while the scorecard keeps growing.

A launch-meeting decision protocol. The eval scorecard's recommendation is read out first, before the room debates. This is a small piece of meeting design with an outsized effect. When the data is the anchor, discussion adjusts away from a shared reference point. When discussion comes first, the data gets recruited as ammunition by whichever side already has a preference. Same numbers, opposite epistemics — the ordering decides which one you get.

Why This Is Leadership's Problem, Not a Template Problem

You could read the section above as a process checklist and stop there. That would miss the actual point.

The deeper failure is a leadership assumption: that AI quality can be delegated to a metric, and the metric will speak for itself. It will not. An eval score is a compression of a messy, multi-dimensional reality into a single number, and compression is lossy. Someone has to decompress it in the room, and that someone needs both the literacy and the authority — or a reliable bridge between the two people who hold them separately.

When the bridge is missing, the organization does not fail loudly. It fails quietly, by escalating launch decisions to whoever's intuition sounds most certain. Sometimes that intuition is right. The problem is that being right becomes uncorrelated with the evidence — and an organization that pays for careful evals and then decides by confidence has bought an expensive prop.

The fix is cheap relative to the eval infrastructure it protects. A glossary, a fixed template, ten transcripts a release, a quarterly hour, and a meeting-agenda reordering. None of it requires new tooling. All of it requires leadership to treat cross-functional eval literacy as a deliverable with an owner — not as something that will accrete on its own.

Build the literacy bridge before the next launch, not after the postmortem of the one that shipped at 87% because the room felt good about it.

References:Let's stay in touch and Follow me for more thoughts and updates