A 98.4% structured-output success rate hides a 2% retry loop that quietly eats 12–18% of your inference budget. A practical guide to retry-token budgets, per-field failure dashboards, and fall-through paths that keep the bill honest.
Total GWh on a slide is not an AI sustainability metric. Task-watts joined to product telemetry is — and the dashboard your CFO is about to ask for cannot compute it yet.
Local tokenizers and provider billing counters disagree by 5–15% on the long-tail content your CI never tests. The gap eats your safety margin where your users live.
Function-calling layers default to fire-and-forget, with no call stack and no cycle detector — and the cost shows up as per-request token counts that drift upward as the tool catalog grows.
Cached tool results that look clean in the trace are quietly producing confidently-wrong agent answers. Treat the cache as a per-tool freshness contract — TTLs by volatility, freshness metadata in the result, bypass tiers, and a stale-cache eval slice.
Auto-generating LLM tool schemas from your OpenAPI spec ships your API documentation as prompt — and your agents pay the cost in misuse you never see in tests.
Shipping translated prompts and translated evals is not a multilingual launch. The failure modes are cultural, not linguistic, and your dashboards cannot see them.
AI features ship at 92% pass rate and slide to 78% twelve months later with no single change to blame. Five compounding clocks — model deprecations, weight rotations, input drift, prompt-patch debt, judge calibration — produce a cliff most teams discover only at deprecation deadline. The maintenance cadence that has to be on the calendar before launch.
Static type systems go blind at the prompt boundary. Three failure modes — interpolation, schema-as-prose, output parsing — and the disciplines that close the gap when the compiler can't see the seam.
Most AI teams split prompt ownership from product ownership and pay the coordination tax in regressions nobody owns. Here is the failure pattern and the rituals — shared release calendar, single dashboard, joint incident channel, and a four-artifact RACI — that make the split survivable.
Public ANN benchmarks run uniform query workloads, but production retrieval is Zipfian — and the gap shows up as melted shards, wasted RAM, and a p99 nobody planned for.
Vendor benchmark numbers describe a controlled harness, not your stack. The realized lift on your product is structurally smaller — and the only forecast worth signing budget against is your own shadow eval.