A staging agent sent a real customer email because one tool in its registry held a production credential. Why sandbox is now a per-tool property, and the attestation pattern that catches credential-tier drift before it ships.
Fine-tuning teaches a model to behave like your corpus — including the misspellings, hedges, and one rep's verbal tics. Here is how that inheritance happens and the curation pass that catches it.
Worker-critic agent loops promise convergence to quality but rarely deliver it — the verifier is a stochastic policy, the max-iterations cap is a budget gate dressed as a quality gate, and the patterns that restore termination treat the satisfaction surface as the real architectural problem.
Safety-tuned LLM agents refuse legitimate operator requests because the model can't tell an on-call engineer from an anonymous user. The fix is architectural — signed runbooks, capability tokens, and operator-mode channels — not retuning refusal calibration.
Agents execute multi-step plans into deploy freezes, active incidents, and red status pages because they cannot read the side channels humans absorb for free. Here is how to fix it.
Per-user token budgets bite hardest mid-conversation, where silent truncation, dropped tool calls, and model fallbacks read as a quality regression — and the upgrade conversation never happens.
Deflection rate counts silence, not help. The same number can mean a resolved customer or a churned one — and the dashboard can't tell which until the cohort report arrives.
AI-feature work produces evidence — eval coverage, judge calibration, kill decisions — that the standard perf rubric has no slot for. Here is what to add.
A same-vendor LLM judge can make one prompt variant look better while production regresses. Here is why family bias passes every dashboard check, and how multi-vendor ensembles plus human calibration fix it.
A clean model migration with green evals and matching latency can quietly invalidate your provider's prefix cache and spike input-token costs for weeks. Here is the blind spot and the rollout discipline that prevents it.
The disclosure your single-turn compliance review approved cannot survive the agent loop that serves it — by turn fourteen the model is answering from a summary that quietly deleted 'I am an AI,' and that gap is now a regulatory liability with teeth.
Inference grew to 85% of enterprise AI spend while the org chart still treats it as an engineering line item. The fix is a named owner, a chargeback rule, and a pre-committed kill threshold — three things no tool can ship for you.