Distillation optimizes a divergence over a finite sample, then ships against a finite eval. Behaviors the eval never measured are free entropy the student is licensed to drop — and the ones it drops first are usually the rare-but-load-bearing ones.
Why vendor-side embedding upgrades silently break your A/B tests on retrieval features, and the experimentation discipline that closes the gap.
An escalate_to_human tool stops being human-in-the-loop the moment the downstream queue grows its own automation. Why the contract has to outlive the consumer.
An LLM judge whose endpoint silently updates is a measurement instrument with no calibration contract. Pin snapshots, build anchor sets, and run dual-judge windows so a six-point lift means your system improved — not the ruler.
An eval rubric read by humans and an LLM judge drifts on two axes at once. Composite scores hide the motion. Here is the measurement protocol that keeps each drift attributable.
An offline eval built from a nightly 3am cron quietly becomes a survey of overnight batch retries and APAC traffic — and the leaderboard cannot tell you whose model it is.
A plateaued eval score does not always mean a model ceiling. When labelers homogenize, agreement metrics climb and the eval stops measuring what the team thinks it does.
An LLM prompt experiment leaks assignment whenever the routing hash and the prompt assembler share an input — a walk through how the lift gets manufactured, the symptoms your dashboard does not surface, and the disciplines that close the gap.
Hosted fine-tunes share an API surface with base models but not a cost-of-latency curve. Here's why the cold start tax hides in your p99 and never shows up on the bill.
When the thumbs-down button in your staging UI silently doubles as a training pipeline, you fine-tune on six months of personal taste, customer text, and engineer venting. Separate the debug surface from the curation surface — or ship a model trained on whatever your team was feeling that week.
Supervised fine-tuning quietly strips the refusal training your base model came with. Why task-only evals miss it, and the four practices that catch the regression before customers do.
Sharing one GPU pool between nightly training and morning inference looks like a utilization win until the p99 dashboard prices the externality. Why GPU partitions have to be physical, accounting has to follow latency class, and the morning tail won't be fixed in software.