Agents inherit your wiring but not your sense of place. When the prompt is the same in staging and prod, the model fills in 'where it is' from training data — and 'production database' is the default. Here is how to ground an agent in its environment.
Distillation optimizes a divergence over a finite sample, then ships against a finite eval. Behaviors the eval never measured are free entropy the student is licensed to drop — and the ones it drops first are usually the rare-but-load-bearing ones.
Why vendor-side embedding upgrades silently break your A/B tests on retrieval features, and the experimentation discipline that closes the gap.
An escalate_to_human tool stops being human-in-the-loop the moment the downstream queue grows its own automation. Why the contract has to outlive the consumer.
An LLM judge whose endpoint silently updates is a measurement instrument with no calibration contract. Pin snapshots, build anchor sets, and run dual-judge windows so a six-point lift means your system improved — not the ruler.
An eval rubric read by humans and an LLM judge drifts on two axes at once. Composite scores hide the motion. Here is the measurement protocol that keeps each drift attributable.
An offline eval built from a nightly 3am cron quietly becomes a survey of overnight batch retries and APAC traffic — and the leaderboard cannot tell you whose model it is.
A plateaued eval score does not always mean a model ceiling. When labelers homogenize, agreement metrics climb and the eval stops measuring what the team thinks it does.
An LLM prompt experiment leaks assignment whenever the routing hash and the prompt assembler share an input — a walk through how the lift gets manufactured, the symptoms your dashboard does not surface, and the disciplines that close the gap.
Hosted fine-tunes share an API surface with base models but not a cost-of-latency curve. Here's why the cold start tax hides in your p99 and never shows up on the bill.
When the thumbs-down button in your staging UI silently doubles as a training pipeline, you fine-tune on six months of personal taste, customer text, and engineer venting. Separate the debug surface from the curation surface — or ship a model trained on whatever your team was feeling that week.
Supervised fine-tuning quietly strips the refusal training your base model came with. Why task-only evals miss it, and the four practices that catch the regression before customers do.