AI-feature work produces evidence — eval coverage, judge calibration, kill decisions — that the standard perf rubric has no slot for. Here is what to add.
A same-vendor LLM judge can make one prompt variant look better while production regresses. Here is why family bias passes every dashboard check, and how multi-vendor ensembles plus human calibration fix it.
A clean model migration with green evals and matching latency can quietly invalidate your provider's prefix cache and spike input-token costs for weeks. Here is the blind spot and the rollout discipline that prevents it.
The disclosure your single-turn compliance review approved cannot survive the agent loop that serves it — by turn fourteen the model is answering from a summary that quietly deleted 'I am an AI,' and that gap is now a regulatory liability with teeth.
Inference grew to 85% of enterprise AI spend while the org chart still treats it as an engineering line item. The fix is a named owner, a chargeback rule, and a pre-committed kill threshold — three things no tool can ship for you.
Multimodal agents shatter the span tree the moment voice, vision, and LLM each open their own root. The fix is a turn ID, attached artifacts, and one owner for the join.
Your token bucket measures user clicks; the bill measures model calls. When a single click fans out into thirty calls, the limiter at the HTTP boundary becomes a paper umbrella.
Your team's stated AI capability is the maximum of its members' skill, while delivery velocity is the median — and that gap is the most underpriced risk on your roadmap.
Six months ago your prompt prefix was 4k tokens and amortized to nearly free. Today it is 11k tokens, your cache hit rate is 31 percent, and nobody can point to the PR that did it.
Wiring an AI agent into CI with merge rights creates a new operational class your SRE runbook never named. Type its actions, queue what cannot run unattended, log every change with attribution, and rehearse the kill switch.
A tool nobody owns can sit in your shared agent catalog forever, taxing every inference in tokens, selection accuracy, and security surface. Build the deprecation lifecycle that lets you remove it.
Your AI feature has an undocumented contract with its users, encoded in the system prompt. A small fraction of patient users reverse-engineer it; the rest get a worse product. Surface the contract instead of hiding it.