JSON schema only validates shape, not truth. When agents hallucinate referential arguments that pass schema checks, retry loops launder the bug into a clean audit trail — here is the missing layer.
Pilot-based token cost forecasts miss the heavy tail of production users — and the bill arrives at the p99, not the median. How to price the distribution instead of the mean.
Provider region parameters look like AWS region pinning but behave like routing hints. Engineering teams that confuse the two ship a residency posture that only survives until the first real audit.
A streaming agent's stop button can train users to wait through bad answers instead of redirecting them. The fix is to treat interrupt as a turn in the conversation, not a circuit breaker on the API call.
An AI feature whose containment time exceeds its blast time has shipped a kill switch on paper, not in practice. Measure activation latency, tier it against damage rate, and write the number in the runbook.
A latency-budget router does exactly what its loss function asks for and silently downgrades the reasoning-eligible cohort. Why the aggregate eval hides the regression and what to instrument instead.
Legal review is a serial dependency on a parallel roadmap. The team that learns this on the first launch slip pays for it in every quarter after.
Translating a tuned English system prompt into 14 locales is not localization — it is a silent eval regression nobody re-measures. The model's instruction-following accuracy drops 8–22 points and your non-English users get an agent that ignores the constraints the English users see honored.
Retrieval@10 stays green while answer quality drifts. The gap is a U-shaped attention bias that lives in the seam between the retrieval team and the prompt team, and neither dashboard knows the slot the model never read.
Procurement reviewers read model cards as contractual representations, not research disclosures. Author a separate vendor due-diligence package before legal binds you to claims your engineering team wrote as narrative.
Provider deprecation cadence is not external weather. Treat vendor clocks as production infrastructure so the next sunset notice doesn't re-prioritize your quarter.
OAuth consent treats agents like single-purpose apps, but each chained tool call expands the realized authority your audit log has to explain. The one-shot screen rendered the worst case as one decision.