Transparent tool retries silently burn wall-clock budget while the planner reasons against a stale deadline, producing a bimodal SLA failure that no single layer's metric catches.
A third population sits between cooperative users and malicious attackers: the curious customer who treats your AI agent as a puzzle. Here is how to build evals, refusals, and fallbacks that hold up when they are the moment your brand is being judged.
Provisioned throughput sized on user QPS quietly under-provisions agent products by the loop fan-out factor. Plan with model-call rate, loop depth, and burst tails instead.
Two agent runs of the same prompt almost never produce the same output. Diffing them at the text level hides the actual cause. Here is what structural diffing requires and how to build for it.
AI codebases carry a hidden domain-knowledge tax that turns three-week ramps into three-month ones. The fix is decision history, not architecture diagrams.
Hiding AI cost from users ships silent throttling and surprise downgrades. Treating the token budget as a real product surface — preview, caps, model selection — turns the cost ceiling from a churn driver into a monetization lever.
Six months into a human eval program, the inter-rater agreement number is a weighted average of three different implicit rubrics. The model didn't drift — the measurement instrument did.
Agent stacks emit four logs that don't agree. The fix isn't more logging — it's a transaction ID minted at the user-action boundary, a unified audit record, and retention sized to the compliance question, not the subsystem.
Production AI products leak refusals to three-word framings — 'hypothetically,' 'for educational purposes,' 'for a story.' How to detect and defend against the bypass vocabulary your users pick up from social platforms.
Compliance reviewers spot LLM failure modes engineering evals systematically miss. Move them out of document-review gates and into the regression suite — legal sign-off becomes a statement about pinned test cases that run every commit.
Long-running agent sessions silently leak tokens — quadratic cost growth and quality degradation hide inside conversation history. How to instrument, prune, and compact it.
Chat is a great input modality and a terrible output one. The moment your agent returns more than three results, the right answer is to render UI — not to keep talking.