Your eval set is read by sales, marketing, legal, and customer success — and they each extract a different artifact than you intended. Build the engineering-vs-shareable split before a customer recognizes their own complaint in a procurement deck.
A 90-day onboarding plan for AI engineers that swaps stale architecture docs for shadowed eval reviews, supervised prompt diffs, and end-to-end judge calibration.
Grading an LLM against paid-cohort traces is grading on the easy distribution. The cohort still deciding whether to upgrade lives in the free tier — and the eval set ignores it.
Multi-year GPU commitments quietly bind product roadmaps to capacity decisions made by people who never saw the feature list. Here is the planning discipline that closes the gap.
AI teams routinely grade models against production conversations and call it an internal dataset. Under purpose limitation rules it is a separate processing event that nobody reviewed.
Static tool descriptions in agent prompts decay against live latency and error rates. A runtime cost-of-waiting signal in the prompt is what turns tool selection from a frozen eval artifact into a routing decision.
An English-first eval rollup hides regressions on French, Japanese, and Portuguese queries until churn surfaces them. The discipline of locale-stratified evals, locale-conditioned judges, and traffic-weighted reporting catches locale drift before users do.
Renaming an MCP tool is not an API deprecation — it's a model-distribution shift. Why the old name keeps showing up and how to phase it out without paging on-call.
Web AI features iterate in minutes; mobile AI features iterate on the platform's review clock. The architectural seams that keep both surfaces honest under one eval set and two release trains.
Swapping foundation models silently invalidates your eval baseline — human-anchored scores, LLM judges, snapshots, and team intuition all need re-anchoring, and the labor bill is usually larger than the per-token savings.
The pager fires because eval-on-traffic dropped four points, not because a service crashed. Runbook patterns, alert design, and rotation discipline for the failure modes that don't trip your existing alerts.
Per-customer system prompt customizations accumulate silently until model-migration day, when a single provider deprecation becomes 47 separate re-validations. The base-plus-overlay architecture and approval discipline that prevent it.