The PM-Eval Translation Gap: When Ship Decisions Outrun the Vocabulary
The go/no-go meeting for an AI feature is, on the surface, a data-driven ritual. Engineering brings a slate of eval numbers — judge score deltas, slice accuracies, regression-against-baseline percentages — and the room decides. It looks rigorous. It usually isn't.
Here is the failure mode in one sentence: the person with the literacy to weight the eval slices does not have the authority to make the call, and the person with the authority cannot read the slices. The product manager owns the launch. The engineer owns the meaning of the numbers. Between them sits a translation gap, and into that gap rushes whoever speaks most confidently in the meeting.
The tell is that "ship at 87%" and "hold at 87%" are both defensible from the same scorecard, depending on which slice you weight. When a single dataset supports opposite conclusions and the deciding factor is rhetorical confidence rather than evidence, you do not have a data-driven process. You have a debate with a spreadsheet in the background.
